mdadm raid10 recovery - is this filesystem corrupted? Is it fixable?

Question

Background: I had an Ubuntu Bionic system set up and running using 3x1tb disks. The disks were all partitioned with one circa 15gb partition and the remaining circa 983gb.

15gb partitions from two disks formed md0, a raid1 array used for swap. 983gb partitions from all 3 disks formed md10, a raid10 far 2 array used for / and totaling around 1.4tb.

What happened: One hard drive failed. The raid10 array carried on regardless. md0 required me to add the remaining unused 15gb partition but the system then booted. No worries \o/

What happened next: A couple of hours after ordering a new drive the filesystem went read-only and then failed to reboot. Essentially a second disk failed though smart reported a clutch of crc errors rather than bad blocks.

It should be noted that I also had issues with a bad RAM stick prior to this, and issues with system stability with replacement RAM concurrent with this happening. (Now resolved.)

Where I am now: I imaged the two disks with ddrescue and have been examining and attempting to repair the filesystem with testdisk. (Currently re-imaging the disks for a fresh start). Essentially one disk appears fine, the other shows no bad blocks or unreadable sectors when ddrescue is copying but does appear to have filesystem issues.

What I think happened is that rather than a second disk hardware fail the bad RAM caused filesystem errors which have made that disk unreadable.

Evidence mdadm --examine on the actual drives, sdf is 'good', sdd 'bad':

/dev/sdf: MBR Magic : aa55 Partition[0] : 31997952 sectors at 2048 (type fd) Partition[1] : 1921523712 sectors at 32000000 (type fd) /dev/sdd: MBR Magic : aa55 Partition[0] : 32002048 sectors at 2048 (type 83) Partition[1] : 1921519616 sectors at 32004096 (type 83)

Two things visible here, firstly sdd partitions have reverted back to straight ext4 linux(83) not linux raid (fd) and secondly sdd0 seems to have gained 4096 sectors from sdd1 (unless I created the partitions that way...but I doubt that).

Testdisk also seems to confirm a filesystem issue good disk first:

Disk /dev/sdf - 1000 GB / 931 GiB - CHS 121601 255 63 Current partition structure: Partition Start End Size in sectors 1 * Linux RAID 0 32 33 1991 231 32 31997952 [zen:0] 2 P Linux RAID 1991 231 33 121601 57 56 1921523712 [zen:10] Disk /dev/sdd - 1000 GB / 931 GiB - CHS 121601 255 63 Current partition structure: Partition Start End Size in sectors 1 * Linux 0 32 33 1992 41 33 32002048 Bad relative sector. 2 P Linux 1992 41 34 121601 57 56 1921519616

I haven't been able to get Testdisk to correct this - doesn't seem the version on partedmagic supports linux raid partitions and It's suggestions to use fsck result in bad magic number in superblock errors even when using alternate superblocks.

Here are the results of mdadm --examine from images mounted as loop devices, again good sdf first, bad sdd second:

/dev/loop1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 152533db:4e587991:5e88fe61:7fc8bfb8 Name : zen:10 Creation Time : Wed Jun 20 10:47:05 2018 Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 1921261568 (916.13 GiB 983.69 GB) Array Size : 1440943104 (1374.19 GiB 1475.53 GB) Used Dev Size : 1921257472 (916.13 GiB 983.68 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=4096 sectors State : clean Device UUID : ef11197c:7461dd9e:ad2e28f6:bad63a7b Internal Bitmap : 8 sectors from superblock Update Time : Thu Aug 30 15:36:14 2018 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 93267b2f - correct Events : 55599 Layout : far=2 Chunk Size : 512K Device Role : Active device 1

Array State : AA. ('A' == active, '.' == missing, 'R' == replacing)

/dev/loop2: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 152533db:4e587991:5e88fe61:7fc8bfb8 Name : zen:10 Creation Time : Wed Jun 20 10:47:05 2018 Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 1921257472 (916.13 GiB 983.68 GB) Array Size : 1440943104 (1374.19 GiB 1475.53 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=0 sectors State : active Device UUID : f70177cb:7daea233:2feafe68:e36186cd Internal Bitmap : 8 sectors from superblock Update Time : Thu Aug 30 15:25:37 2018 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 3c6bebab - correct Events : 55405 Layout : far=2 Chunk Size : 512K Device Role : Active device 0 Array State : AA. ('A' == active, '.' == missing, 'R' == replacing)

Again notable that sdd1(aka loop2) has issues - no 'Used Dev Size' listed. I've tried recreating the array using the images, and whilst it seems to work the array is unmountable (bad magic superblock again).

Questions Does it look like I'm right in thinking its a corrupted partition map on sdd that is the root of the problem? Is it possible to fix that, and if so, with what? fdisk?

Desired outcome make the array mountable so I can dump as much as possible to a different disk. I have a backup of /etc and /home (in theory - haven't tried to restore yet), but it would be helpful and give peace of mind if I could resurrect this array temporarily. A brief run of photorec suggests a hell of a lot of files are recoverable too but sorting through nearly a terrabyte haystack of files without directory structure or filename...

[SOLVED] I put in place fresh copies of the disk images I'd made so none of my previous fiddling could mess things up. In fact one was a partition image and one a whole disk image so mounting them:

losetup --show -f /recovered/imgs/sdf1-md10.img losetup -o 16386097152 /dev/loop3 /media/adrian/855c7d84-25f0-4b2f-b8fa-a5408536aaff/sdd-801485.img

Checking cat /proc/mdstat showed them mounted in an inactive state as md127 so I stopped that and then ran assemble as suggested by @derobert

mdadm --stop /dev/md127 mdadm -v --assemble --force /dev/md10 /dev/loop3 /dev/loop2

And can mount and access the array! \o/

The important fact I missed in my research at the beginning of my own attempts is that you need to specify the devices for --assemble if you're reassembling the array on a new system - didn't even realise that was possible to begin with.

What do you mean "recreated the array"? You should just have to assemble it. If you actually started fresh and created a new array with those drives, that would be bad. — psusi
– psusi, Commented Sep 12, 2018 at 16:15
I'm not doing anything with the drives except imaging with ddrescue @psusi I think my initial attempts at assemble failed because I didn't realise I needed to specify the devices (the OS, and thus mdadm.conf was on the array). — adrinux
– adrinux, Commented Sep 12, 2018 at 17:32
I would be concerned about the "added sdf1 as swap" step; mkswap makes assumptions based on the size of the volume that may not be valid if you then shrink the volume by half, without redoing mkswap on it. — Martin Kealey
– Martin Kealey, Commented Jul 16, 2024 at 23:17

derobert · Accepted Answer · 2018-09-12 15:00:37Z

It sounds like you've made copies and only worked on copies, that's good!

"Used Dev Size" missing from the examine output isn't a problem, I think. Rather, I think it means it's using the entire device. The other one shows a used size 4096 less than the device size, which is consistent with one partition being 4096 smaller. (When you create the array, mdadm used the smallest partition size for all devices, otherwise it wouldn't have been possible to build the array).

I doubt anything corrupted your partition table. It'd be pretty rare for a sector you're not writing to be corrupted yet still appear mostly valid. Nothing wrong with 83 as a partition type for mdraid, the other type is actually obsolete and shouldn't be used. Non-FS data (da, if I remember right) is also a good choice.

I think all you should need is mdadm --assemble --force /dev/md«WHATEVER» /dev/loop1 /dev/loop2. You should get a message about forcing in a not up-to-date device, then it should assemble the array (degraded). You can then try fsck.ext4 (or whichever) on /dev/md«WHATEVER» If that works, you can probably do all that from your system's initramfs to recover it, then just mdadm -a the new disk and let it rebuild.

Thinking back, the 3 disk array was created as 2 + 1 missing from the ubuntu installer, the OS installed and the third disk added later once booted - that would explain the discrepancy in size and different fs type designation. Good point about the size. That still leaves Testdisk's 'Bad relative sector' to be puzzled over. — adrinux
– adrinux, Commented Sep 12, 2018 at 17:19
@adrinux Your partition table is not lost, there is no reason to be using testdisk to attempt to rediscover it. (And honestly, no idea what that message means from testdisk ... but it could easily be confused by having ⅔ of a filesystem on the disk). — derobert
– derobert, Commented Sep 12, 2018 at 17:25
Thanks. My attempts at assemble failed, but might have done that after running create (notes inaccessible right now). I'll give it another shot once I have fresh disk images copied into place. — adrinux
– adrinux, Commented Sep 12, 2018 at 17:28
It's going to take till tomorrow morning for me to get fresh disk images copied into place. I'll let you know how it goes. — adrinux
– adrinux, Commented Sep 12, 2018 at 17:39

Stack Exchange Network

mdadm raid10 recovery - is this filesystem corrupted? Is it fixable?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

mdadm raid10 recovery - is this filesystem corrupted? Is it fixable?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions