Background: I had an Ubuntu Bionic system set up and running using 3x1tb disks. The disks were all partitioned with one circa 15gb partition and the remaining circa 983gb.
15gb partitions from two disks formed md0, a raid1 array used for swap. 983gb partitions from all 3 disks formed md10, a raid10 far 2 array used for / and totaling around 1.4tb.
What happened: One hard drive failed. The raid10 array carried on regardless. md0 required me to add the remaining unused 15gb partition but the system then booted. No worries \o/
What happened next: A couple of hours after ordering a new drive the filesystem went read-only and then failed to reboot. Essentially a second disk failed though smart reported a clutch of crc errors rather than bad blocks.
It should be noted that I also had issues with a bad RAM stick prior to this, and issues with system stability with replacement RAM concurrent with this happening. (Now resolved.)
Where I am now: I imaged the two disks with ddrescue and have been examining and attempting to repair the filesystem with testdisk. (Currently re-imaging the disks for a fresh start). Essentially one disk appears fine, the other shows no bad blocks or unreadable sectors when ddrescue is copying but does appear to have filesystem issues.
What I think happened is that rather than a second disk hardware fail the bad RAM caused filesystem errors which have made that disk unreadable.
Evidence mdadm --examine on the actual drives, sdf is 'good', sdd 'bad':
/dev/sdf: MBR Magic : aa55 Partition[0] : 31997952 sectors at 2048 (type fd) Partition[1] : 1921523712 sectors at 32000000 (type fd) /dev/sdd: MBR Magic : aa55 Partition[0] : 32002048 sectors at 2048 (type 83) Partition[1] : 1921519616 sectors at 32004096 (type 83) Two things visible here, firstly sdd partitions have reverted back to straight ext4 linux(83) not linux raid (fd) and secondly sdd0 seems to have gained 4096 sectors from sdd1 (unless I created the partitions that way...but I doubt that).
Testdisk also seems to confirm a filesystem issue good disk first:
Disk /dev/sdf - 1000 GB / 931 GiB - CHS 121601 255 63 Current partition structure: Partition Start End Size in sectors 1 * Linux RAID 0 32 33 1991 231 32 31997952 [zen:0] 2 P Linux RAID 1991 231 33 121601 57 56 1921523712 [zen:10] Disk /dev/sdd - 1000 GB / 931 GiB - CHS 121601 255 63 Current partition structure: Partition Start End Size in sectors 1 * Linux 0 32 33 1992 41 33 32002048 Bad relative sector. 2 P Linux 1992 41 34 121601 57 56 1921519616 I haven't been able to get Testdisk to correct this - doesn't seem the version on partedmagic supports linux raid partitions and It's suggestions to use fsck result in bad magic number in superblock errors even when using alternate superblocks.
Here are the results of mdadm --examine from images mounted as loop devices, again good sdf first, bad sdd second:
/dev/loop1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 152533db:4e587991:5e88fe61:7fc8bfb8 Name : zen:10 Creation Time : Wed Jun 20 10:47:05 2018 Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 1921261568 (916.13 GiB 983.69 GB) Array Size : 1440943104 (1374.19 GiB 1475.53 GB) Used Dev Size : 1921257472 (916.13 GiB 983.68 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=4096 sectors State : clean Device UUID : ef11197c:7461dd9e:ad2e28f6:bad63a7b Internal Bitmap : 8 sectors from superblock Update Time : Thu Aug 30 15:36:14 2018 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 93267b2f - correct Events : 55599 Layout : far=2 Chunk Size : 512K Device Role : Active device 1 Array State : AA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/loop2: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 152533db:4e587991:5e88fe61:7fc8bfb8 Name : zen:10 Creation Time : Wed Jun 20 10:47:05 2018 Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 1921257472 (916.13 GiB 983.68 GB) Array Size : 1440943104 (1374.19 GiB 1475.53 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=0 sectors State : active Device UUID : f70177cb:7daea233:2feafe68:e36186cd Internal Bitmap : 8 sectors from superblock Update Time : Thu Aug 30 15:25:37 2018 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 3c6bebab - correct Events : 55405 Layout : far=2 Chunk Size : 512K Device Role : Active device 0 Array State : AA. ('A' == active, '.' == missing, 'R' == replacing) Again notable that sdd1(aka loop2) has issues - no 'Used Dev Size' listed. I've tried recreating the array using the images, and whilst it seems to work the array is unmountable (bad magic superblock again).
Questions Does it look like I'm right in thinking its a corrupted partition map on sdd that is the root of the problem? Is it possible to fix that, and if so, with what? fdisk?
Desired outcome make the array mountable so I can dump as much as possible to a different disk. I have a backup of /etc and /home (in theory - haven't tried to restore yet), but it would be helpful and give peace of mind if I could resurrect this array temporarily. A brief run of photorec suggests a hell of a lot of files are recoverable too but sorting through nearly a terrabyte haystack of files without directory structure or filename...
[SOLVED] I put in place fresh copies of the disk images I'd made so none of my previous fiddling could mess things up. In fact one was a partition image and one a whole disk image so mounting them:
losetup --show -f /recovered/imgs/sdf1-md10.img losetup -o 16386097152 /dev/loop3 /media/adrian/855c7d84-25f0-4b2f-b8fa-a5408536aaff/sdd-801485.img Checking cat /proc/mdstat showed them mounted in an inactive state as md127 so I stopped that and then ran assemble as suggested by @derobert
mdadm --stop /dev/md127 mdadm -v --assemble --force /dev/md10 /dev/loop3 /dev/loop2 And can mount and access the array! \o/
The important fact I missed in my research at the beginning of my own attempts is that you need to specify the devices for --assemble if you're reassembling the array on a new system - didn't even realise that was possible to begin with.
sdf1as swap" step;mkswapmakes assumptions based on the size of the volume that may not be valid if you then shrink the volume by half, without redoingmkswapon it.