Why is the apparent size of a file much larger than the actual disk usage in this case? (4.4GiB vs. 512B)

Question

While browsing folders via ncdu, I noticed that the apparent size of a file was sometimes much larger than the actual disk usage. Example via ncdu, then a to toggle between showing disk usage and showing apparent, then i to show more details:

I was told this may be due to some automatic process that only keeps a small portion of the data in a "fast" layer and and keeps the rest on slower place such as AWS S3. How can I check that?

As suggested by Chris Down, here is part of the output of hexdump run on that file:

It seems to indicate the file isn't sparse.

As suggested by Artem S. Tashkinov, the file system is Lustre (checked with sudo df -T).

That explanation sounds bogus, unless you are using FUSE or similar and the implementation is blatantly lying. Most likely this file is sparse. What does it actually look like if you view the hexdump? — Chris Down
– Chris Down, Commented Mar 9, 2024 at 16:53
@ChrisDown Thanks, not sparse it seems: i.sstatic.net/QPA2Q.jpg — Franck Dernoncourt
– Franck Dernoncourt, Commented Mar 9, 2024 at 16:58
That hex dump shows fewer than 512 bytes. Are there more non-zero bytes? — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 9, 2024 at 18:16
Since it's a virtual network (!) FS all bets are off. It's impossible to say how much these files actually occupy because files on this FS are not files per se, they are "objects" and can be physically stored anywhere. — Artem S. Tashkinov
– Artem S. Tashkinov, Commented Mar 9, 2024 at 18:47

Stephen Kitt · Accepted Answer · 2024-03-10 13:55:54Z

This is usual (but non-obvious) behaviour on hierarchical storage systems such as Lustre. ncdu and other sparseness-measuring tools generally rely on the value given in st_blocks in response to a stat call. On historical, non-copy-on-write file systems, the obvious behaviour is the one we’ve grown to expect: each file occupies on disk at least the exact amount of non-zero data it contains, so st_blocks indicates at least the amount of actual non-zero data stored in the file. In Lustre, st_blocks represents the storage used on the frontend system, not the overall amount of data stored in the file.

There is just one slight exception to this: “released” files (whose contents are entirely removed from frontend storage) indicate that they occupy 512 bytes, not 0. This was implemented as a workaround for an issue with tools such as tar which will skip reading files with 0 st_blocks entirely, resulting in data loss on Lustre-based file systems (archiving a released file with sparse file detection, which is common in backup scenarios, would end up storing no data at all). When a file indicates that it occupies 512 bytes, tools have to read it (or use fiemap ioctls etc.) to determine exactly what it contains; on Lustre such actions prompt the file’s data to be retrieved from wherever it is stored in the hierarchy.

With huge files, it’s unusual for the entire file to be restored to frontend storage, which is why you only end up with a partial “occupied block count” in some scenarios.

Stack Exchange Network

Why is the apparent size of a file much larger than the actual disk usage in this case? (4.4GiB vs. 512B)

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Why is the apparent size of a file much larger than the actual disk usage in this case? (4.4GiB vs. 512B)

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions