How does df know how much space is used without needing to go through all the files?

Question

When using du to get the total size of a folder, the command enumerates every file from every (sub)folder and adds it up, in my understanding.

yann@p:~$ du /var/log 4 /var/log/ntpstats ... 148 /var/log/apt 564 /var/log/installer 8 /var/log/cups 91748 /var/log

However, how can the df command return instantly results such as

Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 35209808 18707476 14694008 57% /

without needing to enumerate all the files on the drive?

If there is a fast way to know the used space on a whole drive, then why is there no fast way to know the size of a folder? Or is there?

Thanks in advance.

schily · Accepted Answer · 2018-05-24 17:43:47Z

dfuses the statvfs() system call and asks the filesystem for the current space statistics. This is of course fast as the filesystem always keeps track on the space used on the filesystem while it manages it.

So the reason for df being fast is that is uses precomputed cached values from the filesystem.

Here is the history:

In the 1970s, df has been a suid root program that did access the raw disk device and fetched the filesystem statistics from the super block.

In the mid 1980s, SunOS introduced the syscall statfs() together with the first VFS implementation. This call did not need privilleges anymore. This interface has been given to *BSD during the last SunOS/BSD code exchange in the Tahoe meeting.

In 1989, SVr4/Solaris introduced an enhanced VFS interface that renamed the syscall to statvfs(). This version of the syscall has been added to POSIX from where various OS copied the interface.

Since the df data is indirectly obtained from the super block that only has values for the whole filesystem, there is no quick way to get the numbers for a single directory.

ilkkachu · Accepted Answer · 2018-05-24 16:54:55Z

The file system probably keeps a count of used and free data blocks as part of normal operation. df uses this information.

Even if the file system doesn't keep a real-time counter, it needs a quick way to find free blocks when writing new data, and that same data can also be used to find the number of free blocks.

In theory, some filesystem could keep such a used space counter on a per-directory basis, too. However, there are a few problems.

If the count was kept for the whole subtree recursively, the filesystem would need to propagate the usage numbers upwards to an arbitrary depth. That might slow down all write operations. If it was only kept for the files immediately within the directory, a recursive walk of the tree would still be required to find the total size of a tree.

On Unix-like filesystems, hard links are an even bigger obstacle. When a file can be linked to from multiple directories (or multiple times from the same directory), it has no unique parent directory. Where would the file's size be counted? Counting it on all directories that link to it would produce an inflated total usage, as the file could be counted multiple times. Counting it on only one directory would also be obviously wrong.

In fact, files (i.e. inodes) on traditional Unix filesystems don't even know the directories they reside in, only the count of links to them (names they have). In most usage, that information is not needed, since files are accessed primarily be name anyway. Storing it would also require an arbitrary amount of data in the inode, duplicating the data in the directories.

Stack Exchange Network

How does df know how much space is used without needing to go through all the files?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How does df know how much space is used without needing to go through all the files?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions