Readdir()'s inode numbers versus OverlayFS
Recently I re-read Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise (via) and this time around, I stumbled over a bit in the writeup that made me raise my eyebrows:
Bash’s fallback getcwd() assumes that the inode [number] from stat() matches one returned by readdir(). OverlayFS breaks that assumption.
I wouldn't call this an 'assumption' so much as 'sane POSIX semantics', although I'm not sure that POSIX absolutely requires this.
As we've seen before, POSIX talks about 'file serial number(s)' instead of inode numbers. The best definition of these is covered in sys/stat.h, where we see that a 'file identity' is uniquely determined by the combination the inode number and the device ID (st_dev), and POSIX says that 'at any given time in a system, distinct files shall have distinct file identities' while hardlinks have the same identity. The POSIX description of readdir() and dirent.h don't caveat the d_ino file serial numbers from readdir(), so they're implicitly covered by the general rules for file serial numbers.
In theory you can claim that the POSIX guarantees don't apply here since readdir() is only supplying d_ino, the file serial number, not the device ID as well. I maintain that this fails due to a POSIX requirement:
[...] The value of the structure's
d_inomember shall be set to the file serial number of the file named by thed_namemember. [...]
If readdir() gives one file serial number and a fstatat() of the same name gives another, a plain reading of POSIX is that one of them is lying. Files don't have two file serial numbers, they have one. Readdir() can return duplicate d_ino numbers for files that aren't hardlinks to each other (and I think legitimately may do so in some unusual circumstances), but it can't return something different than what fstatat() does for the same name.
The perverse argument here turns on POSIX's 'at any given time'. You can argue that the readdir() is at one time and the stat() is at another time and the system is allowed to entirely change file serial numbers between the two times. This is certainly not the intent of POSIX's language but I'm not sure there's anything in the standard that rules it out, even though it makes file serial numbers fairly useless since there's no POSIX way to get a bunch of them at 'a given time' so they have to be coherent.
So to summarize, OverlayFS has chosen what are effectively non-POSIX semantics for its readdir() inode numbers (under some circumstances, in the interests of performance) and Bash used readdir()'s d_ino in a traditional Unix way that caused it to notice. Unix filesystems can depart from POSIX semantics if they want, but I'd prefer if they were a bit more shamefaced about it. People (ie, programs) count on those semantics.
(The truly traditional getcwd() way wouldn't have been a problem, because it predates readdir() having d_ino and so doesn't use it (it stat()s everything to get inode numbers). I reflexively follow this pre-d_ino algorithm when I'm talking about doing getcwd() by hand (cf), but these days you want to use the dirent d_ino and if possible d_type, because they're much more efficient than stat()'ing everything.)
|
|