4

I have a setup with three identical harddrives, recognized as sdb, sdd and sde. I have one RAID0 partition (md0) and two RAID5 partitions (md1 and md2) across these three disks. All my RAID partitions appear to be working normally, and have done so since I created them. I have seen messages on the console about md[12] being "active with 2 out of 3 devices", which to me sounds like a problem.

$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid0] [linear] [multipath] [raid1] [raid10] md2 : active raid0 sdb3[0] sdd3[1] sde3[2] 24574464 blocks super 1.2 512k chunks md1 : active raid5 sdd2[1] sde2[3] 5823403008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] md0 : active raid5 sdd1[1] sde1[3] 20462592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] unused devices: <none> 

I'm not experienced with mdadm, but this to me seems like arrays md[12] are missing the sdb disk. However, md2 does not seems to be missing anything. So, has the sdb disk failed or is this just some configuration issue? Any more diagnostics I need to do to figure that out?

EDIT:

# mdadm --examine /dev/sdb2 /dev/sdb2: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 94d56562:90a999e8:601741c0:55d8c83f Name : jostein1:1 (local to host jostein1) Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 5823404032 (2776.82 GiB 2981.58 GB) Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 5823403008 (2776.82 GiB 2981.58 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=1024 sectors State : active Device UUID : cee60351:c3a525ce:a449b326:6cb5970d Update Time : Tue May 24 21:43:20 2016 Checksum : 4afdc54a - correct Events : 7400 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) # mdadm --examine /dev/sde2 /dev/sde2: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 94d56562:90a999e8:601741c0:55d8c83f Name : jostein1:1 (local to host jostein1) Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 5823404032 (2776.82 GiB 2981.58 GB) Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 5823403008 (2776.82 GiB 2981.58 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=1024 sectors State : clean Device UUID : 9c5abb6d:8f1eecbd:4b0f5459:c0424d26 Update Time : Tue Oct 11 21:17:10 2016 Checksum : a3992056 - correct Events : 896128 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : .AA ('A' == active, '.' == missing, 'R' == replacing) 

So --examine on sdb shows it is active, while the same command on sdd and sde show it as missing.

# mdadm --detail --verbose /dev/md1 /dev/md1: Version : 1.2 Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 2911701504 (2776.82 GiB 2981.58 GB) Raid Devices : 3 Total Devices : 2 Persistence : Superblock is persistent Update Time : Tue Oct 11 22:03:50 2016 State : clean, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : jostein1:1 (local to host jostein1) UUID : 94d56562:90a999e8:601741c0:55d8c83f Events : 897492 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 50 1 active sync /dev/sdd2 3 8 66 2 active sync /dev/sde2 

EDIT2:

The event count for the device no longer part of the array is very different from the others:

# mdadm --examine /dev/sd[bde]1 | egrep 'Event|/dev/sd' /dev/sdb1: Events : 603 /dev/sdd1: Events : 374272 /dev/sde1: Events : 374272 

Smartmontools for the disk that is not part of the array:

# smartctl -d ata -a /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.0-36-generic] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2185619 LU WWN Device Id: 5 0014ee 25c58f89e Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Oct 12 18:54:30 2016 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51480) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 494) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 147 144 021 Pre-fail Always - 9641 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1398 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7788 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1145 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 45 193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 309782 194 Temperature_Celsius 0x0022 124 103 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. 
5
  • check the logfiles Commented Oct 11, 2016 at 18:48
  • There are no errors in dmesg. Any other logs I should check? Commented Oct 11, 2016 at 19:05
  • mdadm --examine /dev/sdb2 Commented Oct 11, 2016 at 19:08
  • Look at the SMART data for the failed drive? smartctl. See if it is reporting as failed? Commented Oct 11, 2016 at 22:30
  • @Zoredache smartctl shows the drive is fine, as far as I can tell, see output as comment to the question. The output is very similar to other disks of the same type that work just fine. Commented Oct 12, 2016 at 16:58

3 Answers 3

1

Your mdstat file says it all.

[3/2] [_UU] means that while there are 3 defined physical volumes, only 2 are in use at the moment. Similarly the _UU says the same.

For grater details on the raid devices (before going to the physical ones) you'd run (as root)

mdadm --detail --verbose /dev/md0 mdadm --detail --verbose /dev/md1 mdadm --detail --verbose /dev/md2 

On my system (using raid6) I have simulated a failure and this is an example output:

dev/md0: Version : 1.2 Creation Time : Thu Sep 29 09:51:41 2016 Raid Level : raid6 Array Size : 16764928 (15.99 GiB 17.17 GB) Used Dev Size : 8382464 (7.99 GiB 8.58 GB) Raid Devices : 4 Total Devices : 5 Persistence : Superblock is persistent Update Time : Thu Oct 11 13:06:50 2016 State : clean <<== CLEAN! Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : ubuntu:0 (local to host ubuntu) UUID : 3837ba75:eaecb6be:8ceb4539:e5d69538 Events : 43 Number Major Minor RaidDevice State 4 8 65 0 active sync /dev/sde1 <<== NEW ENTRY 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 3 8 49 3 active sync /dev/sdd1 0 8 1 - faulty /dev/sda1 <<== SW-REPLACED 
4
  • So I have "state: clean, degraded" in the middle and "removed" for the listing of disk number 0 (which I assume should have been sdb). Can I recover from "removed" somehow? The full output has been included in the question. Commented Oct 11, 2016 at 20:07
  • @josteinb You'd add spare devices so the RAID can be automatically reconstruced. This is my current setup, as I showed in the answer. Commented Oct 12, 2016 at 7:29
  • @josteinb By the way, I prefer to 1) create a RAID device over entire disks, 2) then partition (and 3) then LVM eveything). You first did partitioning, then the RAID. My starting point is: does that make any sense to you? When a drive is gone, then all partitions are (supposed to be) gone as well. Commented Oct 12, 2016 at 8:12
  • I see your point about raid first, partitioning second. It is a long time since I created this setup, but I would think the reason why I partitioned first is that I have one raid0 and two raid5 partitions. Commented Oct 12, 2016 at 16:52
1

md1 and md2 are raid5 arrays, degraded because their respective partitions on /dev/sdb are failed or marked falty. Run mdadm --examine on the array itself for more details (madam --examine /dev/md1).

If all is well with /dev/sdb, re-add the partitions to the arrays. Get the correct partition numbers from your /etc/mdadm.conf or the output of --examine on the array.

mdadm --re-add /dev/sdb[?] /dev/md1

1
  • I tried this, --re-add does not work. mdadm just responded that the device could not be re-added. It may be related to the event counts shown in EDIT2 of the question. Commented Oct 12, 2016 at 16:50
1

Yes, /dev/sdb1 and /dev/sdb2 was kicked from /dev/md0 and /dev/md1 respectively. You can doing grep on your syslogs (/var/log/messages* on the RHEL/CentOS/etc based distros, /var/log/syslog* on the Debian/Ubuntu based distros) to find what has caused this if these are still kept.

To fix this now, first thing I suggest you to issue a SMART test on the /dev/sdb. This can be done with smartctl -t long /dev/sdb (this is a non-destructive test) and then you can check both it's progress and results with smartctl -a /dev/sdb.

If everything looks to be fine with the disk afterwards, you can re-add partitions to the RAID arrays which might cause rebuilds (and most likely will), which is still non-destructive:

mdadm /dev/md0 --add /dev/sdb1 mdadm /dev/md1 --add /dev/sdb2 

You can watch rebuild progress by issuing watch cat /proc/mdstat command (which will print /proc/mdstat every 2 seconds on screen).

If you didn't had write-intent bitmap before, I strongly suggest to add it after rebuild:

mdadm -G /dev/mdX -b internal 

Replace X with your array number. Proceed for every array you have active. It doesn't takes much space, and is non-destructive too. But it helps with data consistency and rebuild speed sometimes.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.