Timeline for How to debug frozen md raid?
Current License: CC BY-SA 4.0
22 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jul 14, 2021 at 14:04 | comment | added | Mikko Rantalainen | The problem hasn't reappeared after the reboot and I've now completed background mdadm check for all RAIDs in the system without a problem. I currently think I was unlucky enough to hit some kernel bug. The system is running with ECC RAM so I would guess it wasn't caused by memory error either. | |
| Jul 8, 2021 at 9:40 | comment | added | cas | Depends. True enough if you want to pipe zfs send into zfs recv to get a synced, duplicate mountable filesystem on the other end. Not true if all you want is to archive or backup the snapshots (zfs send is just a data stream. can be saved as a file or dumped to tape or something), but restoring would require additional steps (to retrieve the snapshot dumps and process them in the right order - oldest to newest) | |
| Jul 8, 2021 at 9:29 | comment | added | Mikko Rantalainen | ZFS send is only usable if both the source and target systems use ZFS. For our case there're about 8 source systems with various filesystems on each and the off-site backup cannot use ZFS. | |
| Jul 8, 2021 at 8:49 | comment | added | cas | BTW, if you want to greatly reduce the IO load from backups, I recommend switching to ZFS and using snapshots and zfs send for syncing and backups. When I converted from mdadm raid to ZFS (over 10 years ago), the nightly backup jobs that used to take hours with rsync were completed in minutes with zfs send. And the snapshots gave me a versioned history without a horrendous performance-killing link farm as with programs like rdiff-backup or backuppc. | |
| Jul 8, 2021 at 8:38 | comment | added | Mikko Rantalainen | I don't know if the iommu setting affects the problem. The problem seems to appear randomly about twice a year so testing any change is pretty slow. The idea for iommu setting came from this kernel bug: bugzilla.kernel.org/show_bug.cgi?id=196683 – I haven't tried to add this flag but added it for future readers as an extra info. | |
| Jul 8, 2021 at 8:22 | comment | added | cas | If changing the kernel's iommu setting fixes it, that indicates that your motherboard needs to be replaced with something that isn't broken. Or, maybe just update the BIOS with a non-broken version. How old is the motherboard? What kind (brand, model) of SATA ports does it have? (some motherboards have 4, 6, or 8 ports provided by the CPU's chipset and a few more that are whatever cheap junk they can find). If the m/b is relatively "new" (i.e. within the last 5 years or so), it's probably OK. | |
| Jul 8, 2021 at 8:16 | comment | added | Mikko Rantalainen | The normal workload for that hardware is running multiple rsync routines in parallel and in addition it runs crap called HitachiHDS for additional off-site remote backup. So it's seeing high loads daily. It syncs changes in about 35 TB dataset in and another 35 TB dataset out daily. | |
| Jul 8, 2021 at 8:13 | comment | added | cas | The & in the for loop makes it run dd multiple times in the background, i.e. they'd all be running at once. This will make sure the drives are all running at the same time. It's not a comprehensive test, though, because there are no random seeks OR a mix of reads and writes, it's all just sequential reads. that's why i suggested running other programs to thrash the raid array. rsync would be great for that, which may be why it triggered the original problem. | |
| Jul 8, 2021 at 8:13 | comment | added | Mikko Rantalainen | A shot in the dark: some kernel hangs can be fixed with kernel flag iommu=pt. I saw the problem with AMD CPU which does have IOMMU hardware. | |
| Jul 8, 2021 at 8:08 | comment | added | Mikko Rantalainen | I did the dd for every drive to make sure that kernel is able to access the drives but only gave the command line once. If this happens again, I'll also try dd if=/dev/md0 for sure. I don't know how I missed that test. There were no errors of any kind in the journalctl from md0 so it was working "all ok but a bit slow" according to kernel. From the behavior I saw, I would expect that dd if=/dev/md0 of=/dev/null would have stalled without errors. The system is working just fine after rebooting so I don't think it's a hardware failure. | |
| Jul 8, 2021 at 7:34 | comment | added | cas | And make sure that the power supply has enough spare capacity to provide enough power to all drives at once in addition to motherboard, cpu, gpu, ram, and whatever else is installed (this is unlikely to be the problem, but is worth checking if you can't find any other problem. Check the datasheet for your drive model to find out how much power it uses when reading & writing...probably around 5W per drive). | |
| Jul 8, 2021 at 7:30 | comment | added | cas | Or use dd if=/dev/md0 of=/dev/null. or try bonnie++ or fio or anything else that thrashes the raid array. Also check the SATA and power connectors on the drives, make sure they're securely plugged in. It's possible that a connector is loose and the vibrations from heavy load is causing "random" disconnects. | |
| Jul 8, 2021 at 7:17 | comment | added | cas | You said that it works ok with dd, but the example you gave only showed reading just 1MB of data from one drive. Does it hang when you run something more I/O-intensive like for d in /dev/sd{n,m,o,l,g,a}1; do dd if="$d" of=/dev/null & done to read lots of data from all drives at once? | |
| Jul 8, 2021 at 5:49 | comment | added | cas | You should check journalctl or your kernel log for any error messages while reading or writing to any of the drives in your raid array. maybe check Western Digital's web site to see if there are any firmware updates for those drives (and google the model number to see if there are any known problems). When drives stall or hang, I tend to suspect either the drives or the controller (or motherboard) they're plugged in to are either failing or buggy. | |
| Jul 7, 2021 at 14:21 | comment | added | Mikko Rantalainen | According to journalctl after magically losing 8 minutes to something, the next boot took extra 2.5 minutes to run fsck on the volume that was mounted md0 during the hang and returned with clean after recovering journal. So the total time the reboot took was around 20 minutes. I still don't understand where those extra 8 minutes went before kernel started to boot but I'd guess Ubuntu initrd did something before booting the kernel. I did the whole setup remote so I don't know what the local console was displaying for that 8 minutes. | |
| Jul 7, 2021 at 14:06 | comment | added | Mikko Rantalainen | The system reboot took about half and hour but things seem to be working again. Reading through the journalctl after reboot tells that even systemd failed to kill frozen md0 parts: e.g. systemd: Killing process 1203568 (rsync) with signal SIGKILL. followed by systemd: Failed with result 'timeout'. and later Killing process 805983 (mdcheck) with signal SIGKILL. and mdcheck_start.service: Processes still around after final SIGKILL. Entering failed mode. It took the system 6 minutes to finally reboot and the reboot took 8 minutes extra for something unknown according to journalctl. | |
| Jul 7, 2021 at 11:25 | comment | added | Mikko Rantalainen | I also tried sudo fuser -M -m /path/to/md0/mount -k -9 that that couldn't make the md0 not busy either. I couldn't figure out any way to make md0 and underlying devices non-busy so I had to restart the whole system in the end. If I just needed to make the mount point available, I could have used umount -l /path/to/md0/mount but that wouldn't make the raid available for remounting – however, I needed the raid volume back to online. Running just sync from command line did hang, too! | |
| Jul 7, 2021 at 10:47 | comment | added | Mikko Rantalainen | This didn't help me but could help somebody else with similar symptoms: spinics.net/lists/raid/msg41039.html | |
| Jul 7, 2021 at 10:44 | comment | added | Mikko Rantalainen | I also tried to run echo "idle" > /sys/block/md0/md/sync_action but that too just hangs. | |
| Jul 7, 2021 at 9:43 | comment | added | Mikko Rantalainen | smartctl -x /dev/sdX also says that all devices are okay. The underlying drives are Western Digital Red 6 TB drives but I would guess that's not related to the issue at hand. | |
| Jul 7, 2021 at 9:36 | comment | added | Mikko Rantalainen | Real hostname and UUID has been redacted. | |
| Jul 7, 2021 at 9:35 | history | asked | Mikko Rantalainen | CC BY-SA 4.0 |