xargs -n option to achieve parallelism and less processing time

Question

I am trying to understand how i could use xargs -n option to loop over a csv file and run CURL operation to collect output to a file to achieve faster processing time .

Example:

I need to check webpages health based on a CSV file with URIs (1000s of them).

URI.csv

signup account edit close

I am trying check their status in parallel, using :

cat URI.csv | xargs -n1 -I {} /bin/bash -c 'curl -I http://localhost/{} &> /dev/null && echo "{},Online">>healthcheck.log || echo "{},Offline">>healthcheck.log '

Would i be able to speed up processing by making -n2 ? I am aware that i could use something like -P4 to achieve parallelism, however, not able to understand how -n could be used for my use case.

-n alone doesn't do anything wrt. parallelism; you can combine it with -P towards that end. — Charles Duffy
– Charles Duffy, Commented Apr 3, 2018 at 15:34
BTW, using -I {} and then {} inside a bash -c argument can easily lead to a shell injection vulnerability. Don't ever do that. (Think about what happens if you have a URI that includes $(rm -rf ~)) — Charles Duffy
– Charles Duffy, Commented Apr 3, 2018 at 15:34
Also, foo && bar || baz acts like a ternary sometimes, but it's not a ternary operator -- the corner cases can kill you. Think about what happens if foo succeeds but then bar fails -- you can end up running baz in addition to bar. — Charles Duffy
– Charles Duffy, Commented Apr 3, 2018 at 15:36

Charles Duffy · Accepted Answer · 2018-04-03 21:11:06Z

Consider the following code:

xargs -d $'\n' -P2 -n2 \ /bin/bash -c ' for arg do if curl --fail -I "http://localhost/$arg" &>/dev/null; then printf "%s,Online\n" "$arg" else printf "%s,Offline\n" "$arg" fi done ' _ >>healthcheck.log <URI.csv

xargs -d $'\n' tells GNU xargs to operate line-by-line, rather than splitting your input file into words, trying to honor quotes, and otherwise using much more complicated parsing than you presumably actually want.
xargs -P 2 specifies that you run two processes at a time. Tune this as you wish.
xargs -n 2 specifies that each process is given two URLs to run. Tune this as you wish.
bash -c '...' _ arg1 arg2 runs the script ... with _ in $0, arg1 in $1, arg2 in $2, etc. Thus, arguments appended by xargs become the positional arguments to the script, which for arg do iterates over.
foo && bar || baz acts a little like if foo; then bar; else baz; fi, but it's not identical. See BashPitfalls #22.
Note that we're only opening healthcheck.log for write once, for the entire compound command, rather than re-opening the file every time we want to write a single line to it.

Thanks for very detailed response including teaching me to write better bash script. It is really useful and appreciated. When i executed the code, the output is written without newline separator -- do i need to do anything to get results printed in new line? Thanks!
Oops -- I made the mistake of using single quotes inside single quotes (used to hold the whole script). See the edit, using single-quotes for the format string (usually a bad practice, since format strings should be constant, but safe enough with this one).
Thanks for being patient with novice shell script developer:).
Hopefully last question -- Is it possible "pass" a variable to xargs? For example -- localhost -- i would like to make it $host which gets initialized beginning of the script.
host=localhost xargs ... will export an environment variable host for the duration of the command's execution. That's not specific to xargs -- works for any command. (Caveat: Note that foo=bar echo $foo doesn't do what you'd expect -- that's because the echo doesn't look at its environment for foo, but instead expects the shell calling it to perform expansions from local variables before the invocation). Or you can export host after initializing it earlier in your script; that'll make it available also.

Ole Tange · Accepted Answer · 2018-04-10 11:51:50Z

Using GNU Parallel it looks like this:

cat URI.csv | parallel -j100 'curl -I http://localhost/{} &> /dev/null && echo {}",Online" || echo {}",Offline"' >>healthcheck.log

Or easier to read:

#!/bin/bash doit() { if curl -I http://localhost/"$1" &> /dev/null ; then echo "$1,Online" else echo "$1,Offline" fi } export -f doit cat URI.csv | parallel -j100 doit >>healthcheck.log

Adjust -j100 to the number of jobs you want run in parallel.

By using GNU Parallel the jobs will be run in parallel, but the output to healthcheck.log will be serialized and you will never see a race condition, where two jobs write to the log simultaneously which can mess up the logfile. In this example account and edit wrote at the same time:

signup,Online accouedit,Online nt,Offline close,Online

This will never happen to the output from GNU Parallel.

If you can generate the race condition discussed in the immediate case where we have a loop with a single echo or printf (of content less than 4KB in length), with default line-buffering of stdout, I'll be quite surprised.
@CharlesDuffy Why limit yourself to content less than 4KB in length? If that is a limitation of your solution, and you are aware of this, then would it not be right to warn the reader about that, too? (And yes: You can get mixing below 4 KB: See gist.github.com/ole-tange/88ae153797748b3618e2433377e2870a)

Collectives™ on Stack Overflow

xargs -n option to achieve parallelism and less processing time

2 Answers 2

11 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

2 Comments

Related