2

I have some 5 million text files under a directory - all of the same format (nothing special, just plain text files with some integers in each line). I would like to compute the maximum and minimum line count amongst all these files, along with its two filenames (the one for max and another for min).

I started out by trying to write out all the line count like so (and then workout how to find the min and max from this list):

wc -l `find /some/data/dir/with/text/files/ -type f` > report.txt 

but this throws me an error:

bash: /usr/bin/wc: Argument list too long

Perhaps there is a better way to go about this?

2
  • 3
    Note that using $(find ...) or $(ls ...) in a command line is a bad practice in general -- see BashPitfalls #1, and UsingFind describing what to do instead. Commented Sep 28, 2020 at 22:40
  • 1
    ...btw, given that you have millions of files, I'd consider filtering on byte count before running the line counts; unless you're prone to having wild outliers in terms of line length, it'd be a lot more efficient to only scan the 100,000 largest and smallest files (as measured in bytes, which is a constant-time operation to measure rather than one that scales with size) for length in lines. Commented Sep 28, 2020 at 22:48

1 Answer 1

3

There is a limit to the argument list length. Since you have several millions files passed to wc, the command certainly crossed this line.

Better invoke find -exec COMMAND instead:

find /some/data/dir/with/text/files/ -type f -exec wc -l {} + > report.txt 

Here, each found file find will be appended to the argument list of the command following -exec in place of {}. Before the argument length is reached, the command is run and the remaining found files will be processed in a new run of the command the same way, until the whole list is done.

See man page of find for more details.


Thanks to Charles Duffy for the improvements of this answer.

Sign up to request clarification or add additional context in comments.

4 Comments

I can't speak to the downvotes, but I would strongly suggest changing -exec ... {} \; to -exec ... {} +; as it is, this is wildly inefficient (starting a new copy of wc per file, instead of starting a new one only when the argument list would otherwise get too long).
BTW, while this takes us out of POSIX-compliance into GNUisms, one might also think of switching from -exec to -execdir (unless the files are organized in lots of small subdirectories); running wc in the individual directories means the directory names don't need to be passed on the command line, increasing the number of individual files that can be passed to each copy of wc.
@CharlesDuffy Is it a GNUism? find /some/data/dir/with/text/files/ -type f -print0 | xargs -0 wc -l >report.txt
@LéaGris, apparently not. (Adopted in PASC Interpretation 1003.2 #210 in 2001.)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.