4

We host a share of size 4 TB. How efficient is it to find a file with highest size.

Usually we use:

du -ak | sort -k1 -bn | tail -1 

and it is not easy to scan through a share of such huge size and then again sort them.

Any suggestions to know only the single largest file in the share.

And also du -ak is returning the size of current directory like (". 123455"). How do i avoid that?

2 Answers 2

5

I don't know of any other way besides scanning the directory tree in question to collect the file sizes so that you can determine the largest file. If you know that there's a threshold of size you can instruct find to dismiss files that are below this threshold size.

$ find . -type f -size +50M .... 

Would dismiss any files below the size of 50MB. If you know these files are always in a specific location you can target your find to this area instead of scanning the entire disk.

NOTE: This is a method that I typically employee since you shouldn't be getting random files in non /var types of directories, typically.

As to du you can tell it to output the sizes in human readable formats using the -h switch. The sort command knows how to sort these as well, again using its -h switch.

Example

$ find /home/saml/apps -type f -size +50M -print0 | \ du -h --files0-from=- | sort -h | tail -1 1.4G /home/saml/apps/MeVisLabSDK2.2.1_gcc-64.bin 

The above find returns the list of files that are > 50MB using a null (\0) character as the separator. The du command takes this list and knows to split on nulls using the --files0-from=- switch. This output is then sorted by its human formatted sizes.

Without the tail -1:

$ find /home/saml/apps -type f -size +50M -print0 | \ du -h --files0-from=- | sort -h 55M /home/saml/apps/MeVisLabSDK/Packages/MeVis/ThirdParty/lib/libQtXmlPatternsMLAB.so.4.6.2.debug 55M /home/saml/apps/MeVisLabSDK/Packages/MeVis/ThirdParty/Sources/Qt4/qt/lib/libQtXmlPatternsMLAB.so.4.6.2.debug 56M /home/saml/apps/MeVisLabSDK/Packages/FMEwork/ThirdParty/lib/libitkvnl-4.0_d.a 66M /home/saml/apps/MeVisLabSDK/Packages/FMEwork/Release/lib/libMLDcmtkAccessories_d.so 79M /home/saml/apps/MeVisLabSDK/Packages/FMEwork/Release/lib/libMLDcmtkMLConverters_d.so 94M /home/saml/apps/MeVisLabSDK/Packages/MeVis/ThirdParty/lib/libQtGuiMLAB.so.4.6.2.debug 94M /home/saml/apps/MeVisLabSDK/Packages/MeVis/ThirdParty/Sources/Qt4/qt/lib/libQtGuiMLAB.so.4.6.2.debug 112M /home/saml/apps/ParaView-3.14.1-Linux-64bit.tar.gz 204M /home/saml/apps/Slicer-4.1.1-linux-amd64.tar.gz 283M /home/saml/apps/MeVisLabSDK/Packages/FMEwork/Release/lib/libMLDcmtkIODWrappers_d.so 1.4G /home/saml/apps/MeVisLabSDK2.2.1_gcc-64.bin 
4
  • +1 for the hint to du -h | sort -h or the SI prefixed variant preferred by me over the binary prefixed one: du --si | sort -h. Commented May 18, 2014 at 19:42
  • @erik, you don't want to use -h here as you lose precision. You could end up with 20 files with since 200M not knowing which one is the largest and thus returning the wrong result. You want to convert to "human readable" after you've sorted your list. Commented May 19, 2014 at 9:33
  • @StephaneChazelas, well, ok. And how could I achive that without writing a complex script? Is there a “convert-to-human-readable” command? Commented May 19, 2014 at 23:20
  • 1
    Ok, I’ve found it. But my distro (Fedora 17) is to old to have numfmt included, as I have a version lower than coreutils-8.21 which introduces this command line util. Commented May 19, 2014 at 23:29
4

You need to traverse the whole directory tree and check the size of each file in order to find the largest one.

In zsh, there's an easy way to sort files by size, thanks to the o glob qualifier:

print -rl -- **/*(D.oL) 

To see just the largest files:

echo **/*(D.oL[-1]) 

To see the 10 largest files:

print -rl -- **/*(D.oL[-10,-1]) 

You can also use ls -S to sort the files by size. For example, this shows the top 10 largest files. In bash, you need to run shopt -s globstar first to enable recursive globbing with **; in ksh93, run set -o globstar first, and in zsh this works out of the box. This only works if there aren't so many files that the combined length of their names goes over the command line limit.

ls -Sd **/* | head -n 10 

If there are lots of large files, collecting the information can take a very long time, and you should traverse the filesystem only once and save the output to a text file. Since you're interested in individual files, use the -S option of GNU du in addition to -a; this way, the display for directories doesn't include the size files in subdirectories, only files directly in that directory, which reduces the noise.

du -Sak >du sort -k1n du | head -n 2 

If you only want the size of files, you can use GNU find's -printf action.

find -type f -printf '%s\t%P\n' | sort -k1n >file-sizes.txt tail file-sizes.txt 

Note that if you have file names that contain newlines, this will mess up automated processing. Most GNU utilities have a way to use null bytes (which cannot appear in file names) instead, e.g. du -0, sort -z, \0 instead of \n, etc.

2
  • Note that %s gives the file size (and oL sorts on file size), while du gives the disk usage which are separate things. -k1n is the same as -n. Commented May 19, 2014 at 9:38
  • To sort by file size in reverse order (from largest to smallest), use the uppercase O glob qualifier. zsh.sourceforge.net/Doc/Release/Expansion.html Commented Nov 10, 2015 at 16:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.