0

While browsing folders via ncdu, I noticed that the apparent size of a file was sometimes much larger than the actual disk usage. Example via ncdu, then a to toggle between showing disk usage and showing apparent, then i to show more details:

enter image description here

I was told this may be due to some automatic process that only keeps a small portion of the data in a "fast" layer and and keeps the rest on slower place such as AWS S3. How can I check that?


As suggested by Chris Down, here is part of the output of hexdump run on that file:

enter image description here

It seems to indicate the file isn't sparse.

As suggested by Artem S. Tashkinov, the file system is Lustre (checked with sudo df -T).

9
  • That explanation sounds bogus, unless you are using FUSE or similar and the implementation is blatantly lying. Most likely this file is sparse. What does it actually look like if you view the hexdump? Commented Mar 9, 2024 at 16:53
  • @ChrisDown Thanks, not sparse it seems: i.sstatic.net/QPA2Q.jpg Commented Mar 9, 2024 at 16:58
  • Soft/Hard link maybe? Some weird FUSE fs? Commented Mar 9, 2024 at 17:27
  • 1
    That hex dump shows fewer than 512 bytes. Are there more non-zero bytes? Commented Mar 9, 2024 at 18:16
  • 2
    Since it's a virtual network (!) FS all bets are off. It's impossible to say how much these files actually occupy because files on this FS are not files per se, they are "objects" and can be physically stored anywhere. Commented Mar 9, 2024 at 18:47

1 Answer 1

5

This is usual (but non-obvious) behaviour on hierarchical storage systems such as Lustre. ncdu and other sparseness-measuring tools generally rely on the value given in st_blocks in response to a stat call. On historical, non-copy-on-write file systems, the obvious behaviour is the one we’ve grown to expect: each file occupies on disk at least the exact amount of non-zero data it contains, so st_blocks indicates at least the amount of actual non-zero data stored in the file. In Lustre, st_blocks represents the storage used on the frontend system, not the overall amount of data stored in the file.

There is just one slight exception to this: “released” files (whose contents are entirely removed from frontend storage) indicate that they occupy 512 bytes, not 0. This was implemented as a workaround for an issue with tools such as tar which will skip reading files with 0 st_blocks entirely, resulting in data loss on Lustre-based file systems (archiving a released file with sparse file detection, which is common in backup scenarios, would end up storing no data at all). When a file indicates that it occupies 512 bytes, tools have to read it (or use fiemap ioctls etc.) to determine exactly what it contains; on Lustre such actions prompt the file’s data to be retrieved from wherever it is stored in the hierarchy.

With huge files, it’s unusual for the entire file to be restored to frontend storage, which is why you only end up with a partial “occupied block count” in some scenarios.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.