We have an old backup server with lots of disks and one mount with md raid 5 setup has now frozen. How to diagnose the problem and get it to running again? I would like to avoid restarting the whole system because only one subsystem requires this specific mount point.
Diagnostics this far:
# cat /proc/mdstat ... md0 : active raid5 sdn1[2] sdm1[1] sdo1[4] sdl1[0] sdg1[5] sda1[6] 29301952000 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU] [==========>..........] check = 54.4% (3191189500/5860390400) finish=3314902.0min speed=13K/sec bitmap: 0/44 pages [0KB], 65536KB chunk and this doesn't progress even a little bit even though I already adjusted /proc/sys/dev/raid/speed_limit_max and /proc/sys/dev/raid/speed_limit_min and waited for an hour.
However, according to mdadm everything seems fine:
# mdadm --query --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jun 15 23:50:50 2016 Raid Level : raid5 Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Jul 5 01:42:59 2021 State : active, checking Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Check Status : 54% complete Name : examplehost:md0 (local to host examplehost) UUID : ed0000c4:47000085:8000006f:221938f5 Events : 404407 Number Major Minor RaidDevice State 0 8 177 0 active sync /dev/sdl1 1 8 193 1 active sync /dev/sdm1 2 8 209 2 active sync /dev/sdn1 4 8 225 3 active sync /dev/sdo1 6 8 1 4 active sync /dev/sda1 5 8 97 5 active sync /dev/sdg1 And the underlying devices work just fine: I tested running
dd if=/dev/sdX of=/tmp/test.img bs=1M count=1 for every disk in this raid and got the expected start of disk with normal response time.
So it seems that the underlying hardware is working just fine but the md raid has frozen in practice. The actual mount point of this raid doesn't give any errors but seems to never respond to any IO requests. Even a simple ls -la hangs forever.
The journalctl --since "7 days ago" | grep "blocked for more than" tells that there has been slowness in md1 but md0 doesn't appear in the system logs even though it doesn't respond at all.
Jul 04 01:20:14 examplehost kernel: INFO: task jbd2/md1-8:2262 blocked for more than 120 seconds. Jul 04 01:38:21 examplehost kernel: INFO: task jbd2/md1-8:2262 blocked for more than 120 seconds. Jul 04 02:04:32 examplehost kernel: INFO: task jbd2/md1-8:2262 blocked for more than 120 seconds. The mount point of md1 works just fine so that must have been just too much load for that night.
Can you provide any hints how to fix this mount? I'd obviously prefer suggestions that can fix it without restarting the whole server. If you can suggest something that can fix this without unmounting the filesystem it would be even better. I first assumed that it was hardware hang but that doesn't seem to be the case this same.
I think this has happened once before and in that case the server was just rebooted. However, I'd like to understand the real issue so that I can apply a real fix. The system is running Linux kernel version 5.4.119 in case it makes a difference.
smartctl -x /dev/sdXalso says that all devices are okay. The underlying drives are Western Digital Red 6 TB drives but I would guess that's not related to the issue at hand.echo "idle" > /sys/block/md0/md/sync_actionbut that too just hangs.zfs sendintozfs recvto get a synced, duplicate mountable filesystem on the other end. Not true if all you want is to archive or backup the snapshots (zfs sendis just a data stream. can be saved as a file or dumped to tape or something), but restoring would require additional steps (to retrieve the snapshot dumps and process them in the right order - oldest to newest)