4

I'm looking for a simpler way to have a cross-os-unix file size check. I could use wc -c but I'm concerned the performance may suck on large files (I'm assuming it just counts chars and doesn't do a stat under the covers?)

The below works for linux and macos (perhaps bsd). Is there a simpler well performing approach?

function filesize { local file=$1 size=`stat -c %s $file 2>/dev/null` # linux if [ $? -eq 0 ]; then echo $size return 0 fi eval $(stat -s $file) # macos if [ $? -eq 0 ]; then echo $st_size return 0 fi echo 0 return -1 } 
3
  • 1
    wc generally does a stat (and a lseek to leave the cursor at the end) for regular files. The only notable exception I'm aware of is busybox wc and old implementations on some systems (can't remember which now). Commented Aug 22, 2014 at 15:32
  • 1
    You are right in avoiding wc if you want a cross-OS solution. At least the SVR4 wc used on Solaris 10 reads the whole file. Commented Aug 22, 2014 at 15:52
  • 2
    function x { is the ksh syntax. The Bourne and POSIX syntax is x() {. Commented Aug 22, 2014 at 17:03

5 Answers 5

5

From the source of wc (coreutils/src/wc.c) in GNU coreutils (i.e. the verion on non-embedded Linux and Cygwin):

 When counting only bytes, save some line- and word-counting overhead. If FD is a 'regular' Unix file, using lseek is enough to get its 'size' in bytes. 

So using wc -c to count the bytes will perform well.

You can easily test this optimisation on a large file (i.e. one that would take some time reading). wc -c on a 9.9Gb file took 0.015s of real time on a file that is located on my server and I would rejoice if the whole file would have been transferred in that time, but my gigabit ethernet is unfortunately not that fast (it takes 21s to copy that file to /dev/null over the network).

1
  • 2
    Not all wc implementations will do that though. You may also want to make sure the file is a regular file first. Commented Aug 22, 2014 at 15:35
4

I'm ruling out stat and perl which are not POSIX so are more likely to be missing than ls and awk.

I'm ruling out wc too as while the GNU implementation of wc is optimized when the -c option is used, you should not rely on it to be present for a portable script. Moreover, some non standard compliant wc -c might return the number of characters which is not necessarily the same than the number of bytes depending on the locale.

Here is then a solution only based on standard utilities that will report the size of the file provided as argument:

filesize() { [ -f "$1" ] && ls -dnL -- "$1" | awk '{print $5;exit}' || { echo 0; return 1; } } 

Beware though that the reported size might be either larger or smaller than the actual size the file content takes on disk, depending on the file system used, sparse files support, and options like compression or deduplication.

3
  • 3
    -g and -o are not POSIX (only required for those systems implementing the XSI option (Unix)). You don't want -l. You can use -n. You'll want a -- as well for file names that start with -. Note that [ -f "$1" ] is true for symlinks to regular files, you may want ls -L to be consistent. So maybe ls -nLd -- "$1" Commented Aug 22, 2014 at 17:01
  • Thanks @StéphaneChazelas , answer updated. I guess -d shouldn't be required as test -f rules out directories, doesn't it ? Commented Aug 22, 2014 at 18:24
  • Yes -d strickly speaking would only be necessary in case the file was changed from a regular (or symlink to regular) file between the test -f and ls, but it doesn't harm otherwise, and it's good practice to always include it unless you want to list the content of directories. A good reflex is: to get file attributes, use ls -d .... Commented Aug 23, 2014 at 7:32
2

You should use this, I think. As I've just found out, it's a POSIX-specified standard utility.

du 

The POSIX-specified options include:

The du utility shall conform to XBD Utility Syntax Guidelines .

The following options shall be supported:

  • -a In addition to the default output, report the size of each file not of type directory in the file hierarchy rooted in the specified file. Regardless of the presence of the -a option, non-directories given as file operands shall always be listed.
  • -H If a symbolic link is specified on the command line, du shall count the size of the file or file hierarchy referenced by the link.
  • -k Write the files sizes in units of 1024 bytes, rather than the default 512-byte units.
  • -L If a symbolic link is specified on the command line or encountered during the traversal of a file hierarchy, du shall count the size of the file or file hierarchy referenced by the link.
  • -s Instead of the default output, report only the total sum for each of the specified files.
  • -x When evaluating file sizes, evaluate only those files that have the same device as the file specified by the file operand. Specifying more than one of the mutually-exclusive options -H and -L shall not be considered an error. The last option specified shall determine the behavior of the utility.

The problem with that though is that it doesn't report on file-size, it reports on disk-usage. They are different concepts and the differences are file-system dependent. If you wanted to get the files sizes for a set of files you might use something like the following:

{ echo /usr/bin/ls -ndL .//* } | sed '/\n/P;//D;N \|//|s|\n|/&/| $s|$|/|;s| .//|/\ /|;2!P;D' 

It's a fairly simple-minded bit of sed that maintains a two-line addressable window on ls's output. It works by sliding through its input - always Printing then Deleting the oldest of the two lines in its pattern space then pulling in the Next input line to replace it. It's a one-line lookahead, basically.

It has some major deficiencies as written. For instance, for my own convenience I avoided handling ls's -> linkpath and use the -L option instead which makes ls report on the link target rather than the link itself. It also assumes only current directory globs. It depends on the / not occurring in the filenames - because it is the separator. This is actually fairly common for this kind of stuff - you cd into the directory and cd - back out.

All of that could be handled maybe in a few lines or more, but it's just a demo.

The key component here - and the reason for the lookahead - is this bit:

\|//|s|\n|/&/| 

When the newest line in pattern space contains the string .// append a / to the tail of the oldest line and inject a / at the head of the newest. I also then replace the .// with another \newline and two more line-delimiting slashes.

And so this:

drwxr-xr-x 1 1000 1000 6 Aug 4 14:40 .//dir* drwxr-xr-x 1 1000 1000 0 Aug 4 14:40 .//dir1 drwxr-xr-x 1 1000 1000 6 Aug 8 17:34 .//dir2 drwxr-xr-x 1 1000 1000 22 Aug 10 18:12 .//dir3 drwxr-xr-x 1 1000 1000 16 Jul 11 21:59 .//new -rw-r--r-- 1 1000 1000 8 Aug 20 11:32 .//newfile -rw-r--r-- 1 1000 1000 0 Jul 6 11:24 .//new file -rw-r--r-- 1 1000 1000 0 Jul 6 11:24 .//new file link 

Becomes this:

/drwxr-xr-x 1 1000 1000 6 Aug 4 14:40/ /dir*/ /drwxr-xr-x 1 1000 1000 0 Aug 4 14:40/ /dir1/ /drwxr-xr-x 1 1000 1000 6 Aug 8 17:34/ /dir2/ /drwxr-xr-x 1 1000 1000 22 Aug 10 18:12/ /dir3/ /drwxr-xr-x 1 1000 1000 16 Jul 11 21:59/ /new/ /-rw-r--r-- 1 1000 1000 8 Aug 20 11:32/ /newfile/ /-rw-r--r-- 1 1000 1000 0 Jul 6 11:24/ /new file/ /-rw-r--r-- 1 1000 1000 0 Jul 6 11:24/ /new file link/ 

But what use is that, right? Well, it makes all the difference:

IFS=/; set -f; set $(set +f { echo /usr/bin/ls -ndL .//* }| sed '/\n/P;//D;N \|//|s|\n|/&/| $s|$|/|;s| .//|/\ /|;2!P;D' ) unset IFS while [ -n "$2" ] do printf 'Type :\t <%.1s>\tSize :\t %.0s%.0s%.0s<%d>%.0s%.0s%.0s\nFile :\t %s\n' \ $2 "<$4>" shift 4; done 

OUTPUT

Type : <d> Size : <6> File : <dir*> Type : <d> Size : <0> File : <dir1> Type : <d> Size : <6> File : <dir2> Type : <d> Size : <22> File : <dir3> Type : <d> Size : <16> File : <new> Type : <-> Size : <8> File : <newfile> Type : <-> Size : <0> File : <new file> Type : <-> Size : <0> File : <new file link> 
5
  • 1
    With most ls implementations, you can't combine -f with anything else. In the C locale, you can have good hope that the glob order will be the same as the ls order. Commented Aug 22, 2014 at 15:39
  • @StephaneChazelas - yup - thanks very much. I think I take the hing... This option shall turn off -l, -t, -s, and -r Commented Aug 22, 2014 at 15:47
  • No. I think you were hinting at LC_ALL=C - but, filenames beginning with a newline sort differently for the shell and ls. It breaks the thing. It's still worthwhile maybe by arg - it's still pretty cheap as execs go, probably, but I think I'm going to delete it. Commented Aug 22, 2014 at 16:03
  • 3
    du (like ls -s) reports the disk usage, not the file size. Commented Aug 22, 2014 at 17:09
  • @StéphaneChazelas - finally addressed that bit. Commented Aug 23, 2014 at 8:51
2

There's a time for everything, including parsing ls. There's no portable way to prevent ls from mangling file names, but when you're only interested in some of the metadata about one file, it's ok.

filesize () { LC_ALL=C ls -dn -- "$1" | awk 'NR==1 {print $5}' } 

This works on all systems compliant with POSIX with the XSI extension. It works with BusyBox, which understands -n but not -o. The option -n causes the user and group columns to contain numeric values (UID and GID) instead of names (which may contain spaces on some systems, making the column parsing unreliable). I don't think LC_ALL is necessary, as this part of the output isn't supposed to be affected by locales, but there might be an implementation out there that does something weird like print sizes according to LC_NUMERIC, and it can't hurt.

If you only want portability between systems where you can install zsh (which comes bundled with OSX), you can use zsh's zstat:

zmodload -F zsh/stat b:zstat filesize () { zstat +size -- $1 } 
1
  • 1
    zstat -L to get the size of the symlink like ls -dn or ls -Ldn to get the size of that target of the symlink like zstat alone. Commented Aug 23, 2014 at 13:38
2

The simpler, more portable maybe perl:

filesize() { file="$1" if [ -e "$file" ]; then size="$(perl -e 'print -s shift' "$file")" printf '%s\n' "$size" return 0 else printf "0\n" return -1 fi } 
2
  • function filesize { is the ksh syntax. local is not portable (not in AT&T implementations of ksh for instance). You'll want to do error checking. Note that for symlinks, that report the size of the target. Commented Aug 22, 2014 at 17:08
  • @StéphaneChazelas: Updated. About symlink, what is difference between size of link and target? Commented Aug 22, 2014 at 17:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.