13

I have a folder containing approximately 320116 .pdb.gz files. I want to uncompress them all. If I use gunzip *.gz it gives me an error i.e. argument list too long. The folder is about 2GB. Please give me an appropriate suggestion.

3
  • If you will have to work on this directory structure on the long term, split this directory in many ones. For example based on files modification time or on files name. Commented Mar 19, 2015 at 7:35
  • Yes I do have to work on long term. they have been extracted now I want to subdivide and classify them into three folders on the basis of their names. Is there a shell script to do so? Commented Mar 19, 2015 at 10:21
  • I suggest you to search for similar questions there. If you don't find one to suit your needs, ask your own new question. Commented Mar 19, 2015 at 12:11

6 Answers 6

25
find . -name '*.pdb.gz' -exec gunzip {} + 

-exec gunzip {} + will provide gunzip with many but not too many file names on its command line. This is more efficient than -exec gunzip {} \; which starts a new gunzip process for each and every file.

3
  • 3
    One find, fewer gunzip! Commented Mar 19, 2015 at 7:32
  • 2
    Note that the "+" is a GNUism and so will not work on non-GNU systems like *BSD. Commented Mar 20, 2015 at 8:03
  • 3
    Later versions of BSD find allow the "+" notation. See, for example, the find man page for BSD 10.1. Also applies to OS X (10.9 and later at least, maybe earlier). Commented Mar 25, 2015 at 19:20
7

Whenever you get "argument list too long" errors you can work around it by invoking the desired command multiple times, each time with a subset of the arguments you want to use. xargs is a tool that helps you do that automatically.

find . -type f -a -name \*.pdb.gz -print0 | xargs -0 gunzip 
4
  • doesn't this have the same inefficiency as -execdir gunzip "{}" \; i.e. that xargs will invoke gunzip separately for each file? That is my reading of man page. Commented Mar 19, 2015 at 7:45
  • 5
    No, xargs will stuff as many filenames as will fit on the gunzip command line. Try it! echo a b c d e f | xargs echo only invokes echo once with all 6 arguments so you see one line of output (pretty useless command to execute though!!!!) while if you force xargs to only supply up to 3 arguments per invocation of the command using echo a b c d e f | xargs -n 3 echo then you get 2 lines of output. Commented Mar 19, 2015 at 7:49
  • 4
    Another advantage of using xargs is that, with the -P option, you can run multiple gunzip processes in parallel, which (depending on the exact parameters of your system) may go faster. Commented Mar 19, 2015 at 14:44
  • thanks for the pointer to -P, @psmears. Now I learned something too! Commented Mar 20, 2015 at 1:00
1

I think this should work, it passes the path/name of each file individually to gunzip for processing:

find /my/dir -name "*.pdb.gz" -execdir gunzip "{}" \; 
8
  • 1
    That will execute gunzip once per file. See John1024's answer for a slightly different way that avoids that inefficiency. Commented Mar 19, 2015 at 7:29
  • @Celada This was deliberate; my concern was that using + might again lead to an error message because of overloading gunzip. If John1024's method works, it is technically more efficient, but mine should work if his doesn't. Commented Mar 19, 2015 at 7:34
  • 1
    find with + and xargs are expressly designer with exactly that issue in mind. They will always supply as many arguments as they can, while not exceeding the operating system's limit. Because, by the way, it is an operating system limit, nothing to do with gunzip. Commented Mar 19, 2015 at 7:37
  • 1
    @Celada ok thanks for that info, so presumably with '+' gunzip might get invoked more than once, but less than 320,000 times? Commented Mar 19, 2015 at 7:38
  • 1
    correct‍‍‍‍‍‍‍. Commented Mar 19, 2015 at 7:55
1

Try this way:

find . -name '*.gz' -exec gunzip {} \; 
2
  • 3
    That will execute gunzip once per file. See John1024's answer for a slightly different way that avoids that inefficiency. Commented Mar 19, 2015 at 7:27
  • Make sure to escape the * in *.gz... Commented Mar 19, 2015 at 8:15
1

If you have a multi-core machine you will probably see that using gunzip is not going to max out the capabilities of your machine. For that you would need to run multiple gunzips in parallel. To keep track of which are done in which terminal by hand is is cumbersome, but you can easily do that with GNU parallel:

find . -name "*.gz" | parallel -X gunzip {} 
5
  • 1
    Won't that fail because the argument list to parallel is too long? Commented Mar 19, 2015 at 8:14
  • @immibis Yes, I forgot the original problem, I'll update my post Commented Mar 19, 2015 at 8:16
  • Won't that still fail because the argument list to find is too long? Commented Mar 19, 2015 at 8:27
  • 1
    yes but you're passing all the filenames on find's command line. Commented Mar 19, 2015 at 8:34
  • Seems this is not a good day to answer questions, I forgot to quote the argument to -name Commented Mar 19, 2015 at 8:49
-1

It is unnecessary to use find for this, since you did not mention subfolders. What you need to do is:

for f in *.gz;do gunzip $f;done 
1
  • 4
    You do need find if you don't want to spawn 320116 gunzip processes, just like this loop does. Commented Mar 19, 2015 at 17:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.