"Argument list too long" error when passing a find output as arguments

Question

I have some 5 million text files under a directory - all of the same format (nothing special, just plain text files with some integers in each line). I would like to compute the maximum and minimum line count amongst all these files, along with its two filenames (the one for max and another for min).

I started out by trying to write out all the line count like so (and then workout how to find the min and max from this list):

wc -l `find /some/data/dir/with/text/files/ -type f` > report.txt

but this throws me an error:

bash: /usr/bin/wc: Argument list too long

Perhaps there is a better way to go about this?

Note that using $(find ...) or $(ls ...) in a command line is a bad practice in general -- see BashPitfalls #1, and UsingFind describing what to do instead. — Charles Duffy
– Charles Duffy, Commented Sep 28, 2020 at 22:40
...btw, given that you have millions of files, I'd consider filtering on byte count before running the line counts; unless you're prone to having wild outliers in terms of line length, it'd be a lot more efficient to only scan the 100,000 largest and smallest files (as measured in bytes, which is a constant-time operation to measure rather than one that scales with size) for length in lines. — Charles Duffy
– Charles Duffy, Commented Sep 28, 2020 at 22:48

Amessihel · Accepted Answer · 2020-09-28 23:17:16Z

3

There is a limit to the argument list length. Since you have several millions files passed to wc, the command certainly crossed this line.

Better invoke find -exec COMMAND instead:

find /some/data/dir/with/text/files/ -type f -exec wc -l {} + > report.txt

Here, each found file find will be appended to the argument list of the command following -exec in place of {}. Before the argument length is reached, the command is run and the remaining found files will be processed in a new run of the command the same way, until the whole list is done.

See man page of find for more details.

Thanks to Charles Duffy for the improvements of this answer.

edited Sep 28, 2020 at 23:17

answered Sep 28, 2020 at 21:45

Amessihel

6,6363 gold badges25 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Charles Duffy Over a year ago

I can't speak to the downvotes, but I would strongly suggest changing -exec ... {} \; to -exec ... {} +; as it is, this is wildly inefficient (starting a new copy of wc per file, instead of starting a new one only when the argument list would otherwise get too long).

Charles Duffy Over a year ago

BTW, while this takes us out of POSIX-compliance into GNUisms, one might also think of switching from -exec to -execdir (unless the files are organized in lots of small subdirectories); running wc in the individual directories means the directory names don't need to be passed on the command line, increasing the number of individual files that can be passed to each copy of wc.

Léa Gris Over a year ago

@CharlesDuffy Is it a GNUism? find /some/data/dir/with/text/files/ -type f -print0 | xargs -0 wc -l >report.txt

Amessihel Over a year ago

@LéaGris, apparently not. (Adopted in PASC Interpretation 1003.2 #210 in 2001.)

Collectives™ on Stack Overflow

"Argument list too long" error when passing a find output as arguments

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related