Are there any filesystems with builtin data repairing via checksums?

Question

I've read that ZFS/BtrFS have a checksum check, but they don't use it for data recovery, only for recovering data from a full local copy or a mirror copy.

On the other hand, RAR archives support data redundancy for a long time, with a configurable amount. The more the amount, the higher is the probability of a successful recovery. Same for Dvdisaster which is able to create .ecc files with recovery data, yet on a separate medium.

Many advanced media, like optical disks or hard disks, have a low-level ECC check implemented in a drive controller, so it's not that needed on higher levels of abstraction. But other ones, like cheap microSD cards, may lack it and are perceivably unreliable.

So, there are ECC checks on hardware level and application level, but are there any ECC-backed filesystems?

Note that SD cards already do plenty error correction under the hood that you don't see, because it's done in hardware. The claim that especially cheap SD cards don't have it is plain wrong; there cheaper the physical storage in the card is, the more ECC it needs. All modern NAND flash SD controllers have extensive ECC built in, that's typically even comfortable by the SD card manufacturer. — Marcus Müller
– Marcus Müller, Commented Sep 4, 2023 at 3:45
There's good reason to do this error correction at the hardware level: the decoders on the SD card controller get soft bits, which means they can correct better than anything that gets only already hard-decided bits. The hardware memory cells are interleaved to make errors seem uncorrelated or even independent. That also gets lost in bits coming out of an iterative decoder already, potentially. — Marcus Müller
– Marcus Müller, Commented Sep 4, 2023 at 3:49
@MarcusMüller I concur: This is why I'm a stickler for ECC memory in my boxes, and ALL the complementary parts that are required to make it work - most especially the motherboard, CPU. I'm generally OK with EXT4 so long as one makes a confirming pass afterwards on anything that's actually important, though I may be being overly cautious. ...For around two decades now, I've literally "attached" checksums to my backups, starting with MD5 back in the day and now there are lots of better choices - the idea is the same, though. — Richard T
– Richard T, Commented Sep 4, 2023 at 3:55

Marcus Müller · Accepted Answer · 2023-09-05 14:05:53Z

Linux has the dm-integrity layer with which you can add error correction to any block device.

Sadly, it'll be relatively bad at actually solving the issues the unreliable SD cards pose:

The most typical fault mode is.. just not working anymore. That's typically happening when the amount of wear leveling the physically available memory has been able to sustain has been depleted. Nothing you can do about that but write and read less. Adding error coding information is, counter-intuitively, hurting there, because you amplify the amount of data you write and read. But that's just an amplification by 1/r, r being the rate of the code.

After the SD card has applied its built-in error correction, the data you read is either correct, or the errors are block-local and correlated. If you need to correct these, you will have to use a code whose blocks span multiple logical blocks from the SD card. That again means a read and write amplification, but this time by an integer factor of at least two. So, that's actually significantly worse.

So, in all honesty, if your problem is unreliable flash storage, the appropriate response is to deal with that between the physical flash and the point where the things appear as blocks of memory to your storage system. In other words, in the flash translation layer within the SD card; that would additionally allow you to apply soft decoding for additional coding gain, and could use codes designed for the asymmetric channel (typically: a Z-channel!!) which flash memory represents, at the lowest level - these are properties lost through the decoding/decision and deinterleaving happening in the FTL itself. That loss will be hard to compensate on the data you get from the SD card.

You there would directly choose a code that fulfills your reliability requirements. The problem with that is that the worse the physical flash memory is, and the more reliable you want the storage to behave, the lower your code rate gets, meaning that you need more flash cells for a bit of data. Which is exactly the trade-off that makes any flash based storage device either cheap and less reliable or expensive and more reliable.

So, with unreliable SD cards, you've basically lost. There might be a window where a bit of coding in your PC could correct errors without making errors more likely than that prevents, but you'd really need to run a large study on how long it takes to make your SD card fail before you could settle on a rate for that. Which isn't worth the trouble - you're not buying 100000 cards from the same factory run just to figure out how to make them 0.1% more reliable. You'd just buy or order more reliable cards.

Sorry.

What you could do is if course add true redundancy by using independent cards in a fairly data balancing mirroring or parity scheme, but the usual caveats for any kind RAID apply: you need to make sure that the moment you need to restore one of the underlying volumes from the others, it's not too late and the intense recovery read load uncovers or even causes further, then unfortunately uncorrectable errors. Again, cheap SD card are the worst commercially available choice for that, because the quality of information on reliability is low, and so is their individual device reliability.

Concluding, I don't really see a practical scenario where you would want to make reliable storage out of unreliable SD cards.

Thanks for the write-up; comprehensive and accurate. ...My own solution would be like what the Apollo astronauts used to get to the moon: An odd number of identical operations with any disagreement settled by "majority vote." — Richard T
– Richard T, Commented Sep 4, 2023 at 18:12
Thanks, dm-integrity seems to be a more straightforward solution for the task than RAID 5 from partitions on the same medium. But it also seems to be more oriented to consistency check than to error repairing. I've read the manual of integritysetup and couldn't find if I can increase the ECC amount. The default of service data for a test 128 MB image happened to be only around 2 MB, that's too few to assist with fixing flaky bits. — bodqhrohro
– bodqhrohro, Commented Sep 5, 2023 at 11:09
Again, you have a strange model of what kind of errors appear, how often. If that is not enough, throw away the SD card; I intentionally spent most of the time writing that answer on explaining why it's mathematically hard to make reliable memory out of bad SD cards, but you seem to choose to cherry-pick the information out that it's possible. — Marcus Müller
– Marcus Müller, Commented Sep 5, 2023 at 12:07

Richard T · Accepted Answer · 2023-09-04 00:40:39Z

1

Yes, they're called, collectively, "RAID": Redundant Array of Independent Disks.

The best that's out there for NON-RAID configurations includes EXT4 on Linux. I believe there are others. But these don't fix them so much as catch errors when writing, I believe.

answered Sep 4, 2023 at 0:40

Richard T

2883 silver badges11 bronze badges

I have mentioned mirror copies in the question, they're different from fixing errors with the use of error codes, and they have a minimum level of 100% redundancy.

bodqhrohro
– bodqhrohro

2023-09-04 01:54:14 +00:00
Commented Sep 4, 2023 at 1:54
@bodqhrohro many raid types do not do copies, but really do parity. Read up on raid5 and raid6.

Marcus Müller
– Marcus Müller

2023-09-04 03:39:02 +00:00
Commented Sep 4, 2023 at 3:39
1

@bodqhrohro also, a mirror with a checksum is an error correcting code. Code rate is a little less than 1/2, minimum hamming distance is roughly half the checksum length, a vast majority of error patterns is detectable, all errors only affecting one of the copies is correctable, and basically all else boils down to a list decoder that is very close to maximum likelihood decoding. So, classical error correction code in coding theory.

Marcus Müller
– Marcus Müller

2023-09-04 03:43:28 +00:00
Commented Sep 4, 2023 at 3:43
1

@MarcusMüller correct, I forgot about that kind of RAIDs. I just tried to create a 128 MB disk image, partitioned it into 6 even partitions, created RAID 5 from them and got a 95 MB virtual disk finally. Seems like a great solution for unreliable microSD cards, thank y'all so much!

bodqhrohro
– bodqhrohro

2023-09-04 11:58:32 +00:00
Commented Sep 4, 2023 at 11:58
It is totally not a great solution for unreliable microSD cards, whose failure modes are not single bit flips or small-region erasures, but complete-device failures accelerated by the now vastly increased number of both read and write accesses, which maximize degradation of the physical storage and minimize ability to do wear leveling. As said before, effective error correction should be done at the hardware level there.

Marcus Müller
– Marcus Müller

2023-09-04 12:52:05 +00:00
Commented Sep 4, 2023 at 12:52

| Show 12 more comments

Stack Exchange Network

Are there any filesystems with builtin data repairing via checksums?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Are there any filesystems with builtin data repairing via checksums?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions