3

I have had a situation where I am moving data into a new zfs raidz pool with four devices, some of them virtual to facilitate the migration. The system completely hung in the middle of a device replace of a file based device to a physical device.

The system did not even respond to SysRq and had to be reset physically. When it came back online then zfs had decided that only 2 out of 4 devices were online and started resilvering and reporting loads of errors. I didn't know how to stop it doing this, it keeps going in the backround even when the pool is unmounted.

By the time I managed to get the totally ok missing device online it has reported many many errors.

Does that mean that zfs has destroyed data while resilvering due to the missing device? Or can it now resilver correctly back again now that it has it's original devices in place?

When it was resilvering with only 2 devices then it was resilvering on sda3 below:

 NAME STATE READ WRITE CKSUM zfs_raid DEGRADED 0 0 38.5K raidz1-0 DEGRADED 0 0 129K sda3 ONLINE 0 0 0 sdc2 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 3 /zfs_jbod2/zfs_raid/zfs.1 OFFLINE 0 0 0 sdb1 ONLINE 0 0 0 (resilvering) /zfs_jbod/zfs_raid/zfs.2 ONLINE 0 0 0 (resilvering) 

errors: 25852 data errors, use '-v' for a list

1
  • It appears to have rebuilt things back normal after re adding the missing devices. Commented Sep 12, 2013 at 15:17

1 Answer 1

1

Having not inspected the code, this is just speculation, but I'd say "no". ZFS raidz is roughly equivalent to RAID-5, and any competent implementation of RAID-5 will stop a repair when it loses two drives.

That's the key right there: you lost two drives. That will kill any single-disk-redundancy system, whether ZFS raidz1, 2-disk RAID-1, or RAID-5 without spares.

Yes, you replaced the first failed drive, but according to your question, the array hadn't yet rebuilt itself, so it was effectively still missing.

Take the lesson: use dual-disk redundancy, add a hot spare, or both. Disks are too big these days to rebuild fast enough for single-disk-redundancy to be good enough any more.

1
  • Interestingly when I re added the missing virtual device then the read activity in the resilver appeared to be dominated by that device. I've also successfully opened a couple of files that had been reported as having errors. Commented Sep 1, 2013 at 17:08

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.