4

Normally to make a fault-tolerant or corruption-repairing filesystem, you use multiple drives and raid 5, or anything but raid 0.

There are also many ways to make a fault-tolerant archive file like dar etc.

What I am looking for is a way to make a single external ssd safer against bitrot from extended unpowered storage, yet otherwise use the drive as a normal drive, just mount and read/write files when I want like any other filesystem. Merely "when I want" can sometimes be years apart.

"normal" doesn't mean usable from Windows and Mac. Linux-only is ok.

2
  • 4
    raid can not repair corruption, it only knows there is a mismatch, not which one is correct. instead it relies on the drives to report bad sectors as read errors. if you think that your drive will report read errors properly, you could partition it and run raid across its partitions, to create a single drive raid5/6. that would be kinda weird but it should work. but still not the same as proper checksums. Commented Mar 29, 2023 at 7:26
  • Yes raid or ecc in general, does or can, replace lost information as well as detect and correct corrupt information, simply depending on the algorithm and the amount of redundant data maintained. Yes one could create a virtual raid out of image files or partitions, but that is essentially using a solution which exists for the block device layer and virtualizing it just to use it at the filesystem layer. The question is, doesn't an equivalent exist at the filesystem layer? Commented Mar 30, 2023 at 9:50

2 Answers 2

1

First, it's good to know that most of modern HDDs already have internal ECC mechanisms (https://en.wikipedia.org/wiki/Hard_disk_drive#Error_rates_and_handling). It doesn't replace RAID 5, but it's good to know that it exists. I'd recommend to actively monitor some key SMART attributes, mainly #5, #187, #188, #197 and #198 (https://www.backblaze.com/blog/hard-drive-smart-stats/).

As for the file system, I guess what you are looking for is an FS that supports data scrubbing (https://en.wikipedia.org/wiki/Data_scrubbing). ZFS can do this (https://docs.oracle.com/cd/E19253-01/819-5461/gbbxi/index.html)

[EDIT] From https://www.45drives.com/community/articles/zfs-best-practices/

ZFS scrub is the best way to handle the dreaded bit rot. Every time ZFS reads a block, it compares that block to its checksum and then automatically fixes it if need be. However there may be data that you write to your ZPool and then it doesn't get read again for a very long time.

Fortunately, this data is not then automatically protected from bit rot. This is the reason why we have data scrubs. It's best practice to schedule at least one scrub a month, and some may want to do it as often is even one time a week, although this isn't completely necessary. While you can still use your ZPool during a scrub, you may want to schedule these for off hours or downtime because while it's not too intensive, it does handle some IO on your disks.

1

@ChennyStar mentions ZFS, but there are other, possibly better, or easier options (ZFS can be a heavy lift to use correctly)

Btrfs and ZFS do this, plus maybe otherss

  • BtrFS supports a number of hashing functions for data integrity, some of which like SHA256, can also be used for data deduplication. Great filesystem, but less mature than something like Ext4 for production systems.
  • XFS supports crc32 checksums of metadata ONLY
  • Ext4 supports crc32 checksums of metadata ONLY, but requires enabling it Ext4 Metadata Checksums
  • ZFS supports a few different hash algorithms, supports deduplication, and many modes of data redundancy, including single disks, multiple disks. It has a somewhat steep learning curve though, but it very mature.

DIY, simpler perhaps

You might also consider writing your own cron job etc to sha256 all your files, and keep track of their state, and alert you to issues. You could even pigeon-hole bolt this onto something like Ext4 by using extended attributes like this: setfattr -n user.checksum -v "3baf9ebce4c664ca8d9e5f6314fb47fb" file.txt - OR simple save it in a txt file somewhere. This way you fully control which files are checked, and the behavior. You can test this mechanism by taking a file and copy it, then corrupt one bit and see what happens. To corrupt one bit (which is surprisingly hard to do from CLI):

a=$(xxd -b -l 1 -seek 3 -p <original_important_file>);b=1;echo -e "\x$((${a}^${b}))" | dd of=<corrupted_important_file> bs=1 seek=3 count=1 conv=notrunc

I sourced and tweaked this script from @Kantium's answer

2
  • The bitrot is ocurring on a usb ssd in a storage unit or drawer for some years. It's not plugged in to anything and no cron job can maintain it. Commented Nov 10, 2023 at 1:07
  • 2
    For an unplugged drive, the bitrot issue will be problematic regardless. For the cron job, it could test for the filesystem being available. The really handle the bitrot, with a single, offline drive, you'll probably need ZFS used like this: zfs set copies=2 users/home then zfs get copies users/home with results like NAME PROPERTY VALUE SOURCE users/home copies 2 local, and for Btrfs, set the RAID level to DUP Commented Dec 4, 2023 at 21:09

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.