I have a folder containing approximately 320116 .pdb.gz files. I want to uncompress them all. If I use gunzip *.gz it gives me an error i.e. argument list too long. The folder is about 2GB. Please give me an appropriate suggestion.
- If you will have to work on this directory structure on the long term, split this directory in many ones. For example based on files modification time or on files name.athena– athena2015-03-19 07:35:48 +00:00Commented Mar 19, 2015 at 7:35
- Yes I do have to work on long term. they have been extracted now I want to subdivide and classify them into three folders on the basis of their names. Is there a shell script to do so?Lily Sharpton– Lily Sharpton2015-03-19 10:21:36 +00:00Commented Mar 19, 2015 at 10:21
- I suggest you to search for similar questions there. If you don't find one to suit your needs, ask your own new question.athena– athena2015-03-19 12:11:24 +00:00Commented Mar 19, 2015 at 12:11
6 Answers
find . -name '*.pdb.gz' -exec gunzip {} + -exec gunzip {} + will provide gunzip with many but not too many file names on its command line. This is more efficient than -exec gunzip {} \; which starts a new gunzip process for each and every file.
- 3
- 2Note that the "+" is a GNUism and so will not work on non-GNU systems like *BSD.Martin Schröder– Martin Schröder2015-03-20 08:03:42 +00:00Commented Mar 20, 2015 at 8:03
- 3Later versions of BSD
findallow the "+" notation. See, for example, thefindman page for BSD 10.1. Also applies to OS X (10.9 and later at least, maybe earlier).plasma– plasma2015-03-25 19:20:01 +00:00Commented Mar 25, 2015 at 19:20
Whenever you get "argument list too long" errors you can work around it by invoking the desired command multiple times, each time with a subset of the arguments you want to use. xargs is a tool that helps you do that automatically.
find . -type f -a -name \*.pdb.gz -print0 | xargs -0 gunzip - doesn't this have the same inefficiency as
-execdir gunzip "{}" \;i.e. that xargs will invoke gunzip separately for each file? That is my reading of man page.gogoud– gogoud2015-03-19 07:45:30 +00:00Commented Mar 19, 2015 at 7:45 - 5No,
xargswill stuff as many filenames as will fit on thegunzipcommand line. Try it!echo a b c d e f | xargs echoonly invokesechoonce with all 6 arguments so you see one line of output (pretty useless command to execute though!!!!) while if you forcexargsto only supply up to 3 arguments per invocation of the command usingecho a b c d e f | xargs -n 3 echothen you get 2 lines of output.Celada– Celada2015-03-19 07:49:32 +00:00Commented Mar 19, 2015 at 7:49 - 4Another advantage of using
xargsis that, with the-Poption, you can run multiplegunzipprocesses in parallel, which (depending on the exact parameters of your system) may go faster.psmears– psmears2015-03-19 14:44:35 +00:00Commented Mar 19, 2015 at 14:44 - thanks for the pointer to
-P, @psmears. Now I learned something too!Celada– Celada2015-03-20 01:00:22 +00:00Commented Mar 20, 2015 at 1:00
I think this should work, it passes the path/name of each file individually to gunzip for processing:
find /my/dir -name "*.pdb.gz" -execdir gunzip "{}" \; - 1That will execute gunzip once per file. See John1024's answer for a slightly different way that avoids that inefficiency.Celada– Celada2015-03-19 07:29:39 +00:00Commented Mar 19, 2015 at 7:29
- @Celada This was deliberate; my concern was that using + might again lead to an error message because of overloading gunzip. If John1024's method works, it is technically more efficient, but mine should work if his doesn't.gogoud– gogoud2015-03-19 07:34:00 +00:00Commented Mar 19, 2015 at 7:34
- 1
findwith+andxargsare expressly designer with exactly that issue in mind. They will always supply as many arguments as they can, while not exceeding the operating system's limit. Because, by the way, it is an operating system limit, nothing to do withgunzip.Celada– Celada2015-03-19 07:37:38 +00:00Commented Mar 19, 2015 at 7:37 - 1@Celada ok thanks for that info, so presumably with '+' gunzip might get invoked more than once, but less than 320,000 times?gogoud– gogoud2015-03-19 07:38:33 +00:00Commented Mar 19, 2015 at 7:38
- 1
Try this way:
find . -name '*.gz' -exec gunzip {} \; - 3That will execute
gunziponce per file. See John1024's answer for a slightly different way that avoids that inefficiency.Celada– Celada2015-03-19 07:27:56 +00:00Commented Mar 19, 2015 at 7:27 - Make sure to escape the * in *.gz...Stack Exchange Broke The Law– Stack Exchange Broke The Law2015-03-19 08:15:13 +00:00Commented Mar 19, 2015 at 8:15
If you have a multi-core machine you will probably see that using gunzip is not going to max out the capabilities of your machine. For that you would need to run multiple gunzips in parallel. To keep track of which are done in which terminal by hand is is cumbersome, but you can easily do that with GNU parallel:
find . -name "*.gz" | parallel -X gunzip {} - 1Won't that fail because the argument list to
parallelis too long?Stack Exchange Broke The Law– Stack Exchange Broke The Law2015-03-19 08:14:37 +00:00Commented Mar 19, 2015 at 8:14 - @immibis Yes, I forgot the original problem, I'll update my postAnthon– Anthon2015-03-19 08:16:48 +00:00Commented Mar 19, 2015 at 8:16
- Won't that still fail because the argument list to
findis too long?Stack Exchange Broke The Law– Stack Exchange Broke The Law2015-03-19 08:27:33 +00:00Commented Mar 19, 2015 at 8:27 - 1yes but you're passing all the filenames on
find's command line.Stack Exchange Broke The Law– Stack Exchange Broke The Law2015-03-19 08:34:13 +00:00Commented Mar 19, 2015 at 8:34 - Seems this is not a good day to answer questions, I forgot to quote the argument to
-nameAnthon– Anthon2015-03-19 08:49:09 +00:00Commented Mar 19, 2015 at 8:49
It is unnecessary to use find for this, since you did not mention subfolders. What you need to do is:
for f in *.gz;do gunzip $f;done - 4You do need
findif you don't want to spawn 320116gunzipprocesses, just like this loop does.John WH Smith– John WH Smith2015-03-19 17:29:41 +00:00Commented Mar 19, 2015 at 17:29