According to the Open Group specs, POSIX du doesn't have the -b option to display the size in bytes. So what is the POSIX-compliant way to get the size of a file or folder in bytes?
2 Answers
As an approximation of what GNU du does with -sb, you could do:
cumulative_size() ( export LC_ALL=C ret=0 [ "$#" -gt 0 ] || set . for file do case $file in (/*) sanitized=$file;; (*) sanitized=./$file;; esac size=$( find "$sanitized" ! -type b ! -type c -exec ls -niqd {} + | awk '! seen[$1]++ {sum += $6} END {print sum}' ) if [ -n "$size" ]; then printf '%s\t%s\n' "$size" "$file" else ret=1 fi done exit "$ret" ) Like GNU du does we try to count files only once, by looking at their inode number (as reported in the first field of ls -ni), but since we don't have the device number which ls cannot report, that assumes the directory hierarchies don't span several filesystems.
Contrary to du we also only do the deduplication in each file argument.
For instance in:
cumulative_size dir dir The cumulative disk usage of dir and its contents is reported twice, with files within each counted only once, while GNU du -bs would only report dir disk usage once.
We exclude device files because ls -n doesn't report their size. On Linux at least, that won't make a difference as their size is otherwise always reported as 0.
find can't be given file paths that start with - or whose name matches its predicates (including !, (... as well). Here we work around that by prefixing the file paths with ./ if they don't start with /, so find ! or find -print becomes find ./! or find ./-print for instance. That assumes no find implementation has a predicate that starts with /. That means we also don't need to pass a -- to ls to mark the end of its options.
We use ls -n instead of ls -l to avoid decoding uids/gids into user or group names (which would be expensive and also cause problem here for names with spaces). POSIX specifies -o/-g option to remove those fields altogether, but they are optional there.
The output of ls -n is only specified in the C/POSIX locale. Also, file paths being arbitrary sequences of non-null bytes, you can only process them as text in the C locale, hence the LC_ALL=C.
We also use -q to make sure newline characters in file names or symlink targets don't put a spanner in the works.
Also note that since the full paths are passed to ls, we can't process directory structures of arbitrary depth as it will stop working once paths lengths exceed PATH_MAX.
The error reporting is rather crude. We only report a non-zero exit status if any of the computed size return empty. So a zero exit status is not a guarantee that all file sizes have been counted.
- Your answers are always very helpful and a great source to learn from. I am still trying to understand each line. "The output of ls -n is only specified in the C/POSIX locale." Where are you getting this information from? I had a look at the opengroup utilities/ls pagefinefoot– finefoot2023-05-29 23:34:24 +00:00Commented May 29, 2023 at 23:34
- @finefoot, the date field is locale-dependant as well as what blank characters may be used to separate fields (though in practice, that date field appears after the field we're interested in so unless it contains newline characters, it's likely not going to be a problem, and I've not seen any
lsimplementation that uses blank characters other than space to separate fields). C locale in any case removes complex processing in both printing and parsing and reduces the risk of bad surprise.Stéphane Chazelas– Stéphane Chazelas2023-05-30 05:06:35 +00:00Commented May 30, 2023 at 5:06 - Ahh, great. That's explains it, thank you. :) I've seen
LC_ALL=Cquite a few times in scripts and always wanted to read about the reason why it's used. And you opted forlscompared towc -c(see below) due to much better performance of reading the size instead of counting the length, right?finefoot– finefoot2023-05-30 12:27:20 +00:00Commented May 30, 2023 at 12:27 - @finefoot, See also What does "LC_ALL=C" do?.
wc -conly works for non-directory files you have permission to read and have possible side effects for non-regular files. See also How can I get the size of a file in a bash script?Stéphane Chazelas– Stéphane Chazelas2023-05-30 12:53:49 +00:00Commented May 30, 2023 at 12:53
Unfortunately, the output format of ls is apparently not standardized. So it might not be the best idea to parse its output.
An alternative POSIX-compliant way to find out the size in bytes of a single file is to use wc -c:
-cWrite to the standard output the number of bytes in each input file.
Source: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html
$ printf %s 0123456789ABCDEF >sixteenbytestestfile # example file of 16 bytes length $ wc -c sixteenbytestestfile 16 sixteenbytestestfile If we don't pass the file as an argument, but via standard input, the filename will be omitted from the output:
$ wc -c <sixteenbytestestfile 16 Apparently, some systems add some whitespace around the number output. We can remove it by using Arithmetic Expansion without any arithmetic operations:
$ filesize=" 123 " # possible wc -c output $ printf %s\\n "-$filesize-" - 123 - $ printf %s\\n "-$((filesize))-" -123- In conclusion, here is a definition of a simple function to get the size of a file:
$ filesize() { printf %s\\n "$(($(wc -c <"$1")))"; } $ filesize sixteenbytestestfile 16
ls -ld somefile | awk '{print $5}'--apparent-sizeoption as well (e.g on fs with compression support disk usage could be less than apparent size).du -bdoes?