Timeline for How to debug frozen md raid?

Current License: CC BY-SA 4.0

22 events

when toggle format	what		by	license	comment
Jul 14, 2021 at 14:04	comment	added	Mikko Rantalainen		The problem hasn't reappeared after the reboot and I've now completed background mdadm `check` for all RAIDs in the system without a problem. I currently think I was unlucky enough to hit some kernel bug. The system is running with ECC RAM so I would guess it wasn't caused by memory error either.
Jul 8, 2021 at 9:40	comment	added	cas		Depends. True enough if you want to pipe `zfs send` into `zfs recv` to get a synced, duplicate mountable filesystem on the other end. Not true if all you want is to archive or backup the snapshots (`zfs send` is just a data stream. can be saved as a file or dumped to tape or something), but restoring would require additional steps (to retrieve the snapshot dumps and process them in the right order - oldest to newest)
Jul 8, 2021 at 9:29	comment	added	Mikko Rantalainen		ZFS `send` is only usable if both the source and target systems use ZFS. For our case there're about 8 source systems with various filesystems on each and the off-site backup cannot use ZFS.
Jul 8, 2021 at 8:49	comment	added	cas		BTW, if you want to greatly reduce the IO load from backups, I recommend switching to ZFS and using snapshots and `zfs send` for syncing and backups. When I converted from mdadm raid to ZFS (over 10 years ago), the nightly backup jobs that used to take hours with `rsync` were completed in minutes with `zfs send`. And the snapshots gave me a versioned history without a horrendous performance-killing link farm as with programs like `rdiff-backup` or `backuppc`.
Jul 8, 2021 at 8:38	comment	added	Mikko Rantalainen		I don't know if the `iommu` setting affects the problem. The problem seems to appear randomly about twice a year so testing any change is pretty slow. The idea for `iommu` setting came from this kernel bug: bugzilla.kernel.org/show_bug.cgi?id=196683 – I haven't tried to add this flag but added it for future readers as an extra info.
Jul 8, 2021 at 8:22	comment	added	cas		If changing the kernel's iommu setting fixes it, that indicates that your motherboard needs to be replaced with something that isn't broken. Or, maybe just update the BIOS with a non-broken version. How old is the motherboard? What kind (brand, model) of SATA ports does it have? (some motherboards have 4, 6, or 8 ports provided by the CPU's chipset and a few more that are whatever cheap junk they can find). If the m/b is relatively "new" (i.e. within the last 5 years or so), it's probably OK.
Jul 8, 2021 at 8:16	comment	added	Mikko Rantalainen		The normal workload for that hardware is running multiple rsync routines in parallel and in addition it runs crap called HitachiHDS for additional off-site remote backup. So it's seeing high loads daily. It syncs changes in about 35 TB dataset in and another 35 TB dataset out daily.
Jul 8, 2021 at 8:13	comment	added	cas		The `&` in the for loop makes it run `dd` multiple times in the background, i.e. they'd all be running at once. This will make sure the drives are all running at the same time. It's not a comprehensive test, though, because there are no random seeks OR a mix of reads and writes, it's all just sequential reads. that's why i suggested running other programs to thrash the raid array. rsync would be great for that, which may be why it triggered the original problem.
Jul 8, 2021 at 8:13	comment	added	Mikko Rantalainen		A shot in the dark: some kernel hangs can be fixed with kernel flag `iommu=pt`. I saw the problem with AMD CPU which does have IOMMU hardware.
Jul 8, 2021 at 8:08	comment	added	Mikko Rantalainen		I did the `dd` for every drive to make sure that kernel is able to access the drives but only gave the command line once. If this happens again, I'll also try `dd if=/dev/md0` for sure. I don't know how I missed that test. There were no errors of any kind in the `journalctl` from `md0` so it was working "all ok but a bit slow" according to kernel. From the behavior I saw, I would expect that `dd if=/dev/md0 of=/dev/null` would have stalled without errors. The system is working just fine after rebooting so I don't think it's a hardware failure.
Jul 8, 2021 at 7:34	comment	added	cas		And make sure that the power supply has enough spare capacity to provide enough power to all drives at once in addition to motherboard, cpu, gpu, ram, and whatever else is installed (this is unlikely to be the problem, but is worth checking if you can't find any other problem. Check the datasheet for your drive model to find out how much power it uses when reading & writing...probably around 5W per drive).
Jul 8, 2021 at 7:30	comment	added	cas		Or use `dd if=/dev/md0 of=/dev/null`. or try `bonnie++` or `fio` or anything else that thrashes the raid array. Also check the SATA and power connectors on the drives, make sure they're securely plugged in. It's possible that a connector is loose and the vibrations from heavy load is causing "random" disconnects.
Jul 8, 2021 at 7:17	comment	added	cas		You said that it works ok with `dd`, but the example you gave only showed reading just 1MB of data from one drive. Does it hang when you run something more I/O-intensive like `for d in /dev/sd{n,m,o,l,g,a}1; do dd if="$d" of=/dev/null & done` to read lots of data from all drives at once?
Jul 8, 2021 at 5:49	comment	added	cas		You should check journalctl or your kernel log for any error messages while reading or writing to any of the drives in your raid array. maybe check Western Digital's web site to see if there are any firmware updates for those drives (and google the model number to see if there are any known problems). When drives stall or hang, I tend to suspect either the drives or the controller (or motherboard) they're plugged in to are either failing or buggy.
Jul 7, 2021 at 14:21	comment	added	Mikko Rantalainen		According to `journalctl` after magically losing 8 minutes to something, the next boot took extra 2.5 minutes to run `fsck` on the volume that was mounted `md0` during the hang and returned with clean after recovering journal. So the total time the reboot took was around 20 minutes. I still don't understand where those extra 8 minutes went before kernel started to boot but I'd guess Ubuntu initrd did something before booting the kernel. I did the whole setup remote so I don't know what the local console was displaying for that 8 minutes.
Jul 7, 2021 at 14:06	comment	added	Mikko Rantalainen		The system reboot took about half and hour but things seem to be working again. Reading through the `journalctl` after reboot tells that even systemd failed to kill frozen `md0` parts: e.g. `systemd: Killing process 1203568 (rsync) with signal SIGKILL.` followed by `systemd: Failed with result 'timeout'.` and later `Killing process 805983 (mdcheck) with signal SIGKILL.` and `mdcheck_start.service: Processes still around after final SIGKILL. Entering failed mode.` It took the system 6 minutes to finally reboot and the reboot took 8 minutes extra for something unknown according to `journalctl`.
Jul 7, 2021 at 11:25	comment	added	Mikko Rantalainen		I also tried `sudo fuser -M -m /path/to/md0/mount -k -9` that that couldn't make the `md0` not busy either. I couldn't figure out any way to make `md0` and underlying devices non-busy so I had to restart the whole system in the end. If I just needed to make the mount point available, I could have used `umount -l /path/to/md0/mount` but that wouldn't make the raid available for remounting – however, I needed the raid volume back to online. Running just `sync` from command line did hang, too!
Jul 7, 2021 at 10:47	comment	added	Mikko Rantalainen		This didn't help me but could help somebody else with similar symptoms: spinics.net/lists/raid/msg41039.html
Jul 7, 2021 at 10:44	comment	added	Mikko Rantalainen		I also tried to run `echo "idle" > /sys/block/md0/md/sync_action` but that too just hangs.
Jul 7, 2021 at 9:43	comment	added	Mikko Rantalainen		`smartctl -x /dev/sdX` also says that all devices are okay. The underlying drives are Western Digital Red 6 TB drives but I would guess that's not related to the issue at hand.
Jul 7, 2021 at 9:36	comment	added	Mikko Rantalainen		Real hostname and UUID has been redacted.
Jul 7, 2021 at 9:35	history	asked	Mikko Rantalainen	CC BY-SA 4.0

toggle format