21

I was wondering how, if possible, I can create a simple job management in BASH to process several commands in parallel. That is, I have a big list of commands to run, and I'd like to have two of them running at any given time.

I know quite a bit about bash, so here are the requirements that make it tricky:

  • The commands have variable running time so I can't just spawn 2, wait, and then continue with the next two. As soon as one command is done a next command must be run.
  • The controlling process needs to know the exit code of each command so that it can keep a total of how many failed

I'm thinking somehow I can use trap but I don't see an easy way to get the exit value of a child inside the handler.

So, any ideas on how this can be done?


Well, here is some proof of concept code that should probably work, but it breaks bash: invalid command lines generated, hanging, and sometimes a core dump.

# need monitor mode for trap CHLD to work set -m # store the PIDs of the children being watched declare -a child_pids function child_done { echo "Child $1 result = $2" } function check_pid { # check if running kill -s 0 $1 if [ $? == 0 ]; then child_pids=("${child_pids[@]}" "$1") else wait $1 ret=$? child_done $1 $ret fi } # check by copying pids, clearing list and then checking each, check_pid # will add back to the list if it is still running function check_done { to_check=("${child_pids[@]}") child_pids=() for ((i=0;$i<${#to_check};i++)); do check_pid ${to_check[$i]} done } function run_command { "$@" & pid=$! # check this pid now (this will add to the child_pids list if still running) check_pid $pid } # run check on all pids anytime some child exits trap 'check_done' CHLD # test for ((tl=0;tl<10;tl++)); do run_command bash -c "echo FAIL; sleep 1; exit 1;" run_command bash -c "echo OKAY;" done # wait for all children to be done wait 

Note that this isn't what I ultimately want, but would be groundwork to getting what I want.


Followup: I've implemented a system to do this in Python. So anybody using Python for scripting can have the above functionality. Refer to shelljob

2
  • You can use the shell's builtin 'wait' command to reap each child and get its exit status, but you need to wait for a specific pid, otherwise it will not return until all children have exited. You don't want to wait in the signal handler though. This is tricky in bash, much easier to do it in C honestly. Commented Jun 17, 2011 at 10:11
  • Well, if I could get the PID in the signal handler I think I'd be fine, but I don't see anyway to get the PID. I know it can be easily done in other languages, but I'm trying to make an extension to a bash script. Commented Jun 17, 2011 at 10:13

7 Answers 7

25

GNU Parallel is awesomesauce:

$ parallel -j2 < commands.txt $ echo $? 

It will set the exit status to the number of commands that failed. If you have more than 253 commands, check out --joblog. If you don't know all the commands up front, check out --bg.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks very much for the reference. This command seems great. I'll see if I can adapt my script.
FWIW, something like xargs -P2 -n1 -d '\n' sh -c < commands.txt can be used as a poor man's parallel substitute
7

Can I persuade you to use make? This has the advantage that you can tell it how many commands to run in parallel (modify the -j number)

echo -e ".PHONY: c1 c2 c3 c4\nall: c1 c2 c3 c4\nc1:\n\tsleep 2; echo c1\nc2:\n\tsleep 2; echo c2\nc3:\n\tsleep 2; echo c3\nc4:\n\tsleep 2; echo c4" | make -f - -j2 

Stick it in a Makefile and it will be much more readable

.PHONY: c1 c2 c3 c4 all: c1 c2 c3 c4 c1: sleep 2; echo c1 c2: sleep 2; echo c2 c3: sleep 2; echo c3 c4: sleep 2; echo c4 

Beware, those are not spaces at the beginning of the lines, they're a TAB, so a cut and paste won't work here.

Put an "@" infront of each command if you don't the command echoed. e.g.:

 @sleep 2; echo c1 

This would stop on the first command that failed. If you need a count of the failures you'd need to engineer that in the makefile somehow. Perhaps something like

command || echo F >> failed 

Then check the length of failed.

3 Comments

No, this won't do what I want. All the command lines are generated and I need to keep a total count of failed and okay. Plus I don't want to stop running if one of the children fails.
The "command || echo F >> failed" will make them continue when they fail. What do you mean by the commands are generated? How does that fit with this?
I suppose I could generate the make file from the bash script. I wouldn't have much control over the output. Plus I still don't have an easy way to count the results (total number and failed). I'm not saying it won't work, it's just not an easy solution.
4

The problem you have is that you cannot wait for one of multiple background processes to complete. If you observe job status (using jobs) then finished background jobs are removed from the job list. You need another mechanism to determine whether a background job has finished.

The following example uses starts to background processes (sleeps). It then loops using ps to see if they are still running. If not it uses wait to gather the exit code and starts a new background process.

#!/bin/bash sleep 3 & pid1=$! sleep 6 & pid2=$! while ( true ) do running1=`ps -p $pid1 --no-headers | wc -l` if [ $running1 == 0 ] then wait $pid1 echo process 1 finished with exit code $? sleep 3 & pid1=$! else echo process 1 running fi running2=`ps -p $pid2 --no-headers | wc -l` if [ $running2 == 0 ] then wait $pid2 echo process 2 finished with exit code $? sleep 6 & pid2=$! else echo process 2 running fi sleep 1 done 

Edit: Using SIGCHLD (without polling):

#!/bin/bash set -bm trap 'ChildFinished' SIGCHLD function ChildFinished() { running1=`ps -p $pid1 --no-headers | wc -l` if [ $running1 == 0 ] then wait $pid1 echo process 1 finished with exit code $? sleep 3 & pid1=$! else echo process 1 running fi running2=`ps -p $pid2 --no-headers | wc -l` if [ $running2 == 0 ] then wait $pid2 echo process 2 finished with exit code $? sleep 6 & pid2=$! else echo process 2 running fi sleep 1 } sleep 3 & pid1=$! sleep 6 & pid2=$! sleep 1000d 

8 Comments

Can this be done without the polling somehow? If I consume one processor just with BASH part of the value in running in parallel is somewhat lost.
Problem here is that ChildFinished could be called before you manage to set pid1. Obviously not with sleep 3 but some random process could exit quickly (particularly if it has an error at startup)
How about using (sleep 1 && realcommand) &? This will always take at least one second before ChildFinished is called. There is still a race with the second command finishing, so perhaps set pid1 to 0 (invalid) before starting the next command and check that in ChildFinished.
I don't like the sleep, but setting to 0 seems to be okay. I'll just skip 0 in the check, and each time I start a process do the check again after assigning the variable (in case already done). I'll wrap this in a few arrays and see if I can get it to work as I want.
This also presumes the bash properly cleans up, otherwise ps could return 1 with a zombie process. It usually does, so I'll probably be fine.
|
2

I think the following example answers some of your questions, I am looking into the rest of question

(cat list1 list2 list3 | sort | uniq > list123) & (cat list4 list5 list6 | sort | uniq > list456) & 

from:

Running parallel processes in subshells

Comments

0

There is another package for debian systems named xjobs.

You might want to check it out:

http://packages.debian.org/wheezy/xjobs

Comments

0

If you cannot install parallel for some reason this will work in plain shell or bash

# String to detect failure in subprocess FAIL_STR=failed_cmd result=$( (false || echo ${FAIL_STR}1) & (true || echo ${FAIL_STR}2) & (false || echo ${FAIL_STR}3) ) wait if [[ ${result} == *"$FAIL_STR"* ]]; then failure=`echo ${result} | grep -E -o "$FAIL_STR[^[:space:]]+"` echo The following commands failed: echo "${failure}" echo See above output of these commands for details. exit 1 fi 

Where true & false are placeholders for your commands. You can also echo $? along with the FAIL_STR to get the command status.

Comments

0

Yet another bash only example for your interest. Of course, prefer the use of GNU parallel, which will offer much more features out of the box.

This solution involve tmp file output creation for collecting of job status.

We use /tmp/${$}_ as temporary file prefix $$ is the actual parent process number and it is the same for all the script execution.

First, the loop for starting parallel job by batch. The batch size is set using max_parrallel_connection. try_connect_DB() is a slow bash function in the same file. Here we collect stdout + stderr 2>&1 for failure diagnostic.

nb_project=$(echo "$projects" | wc -w) i=0 parrallel_connection=0 max_parrallel_connection=10 for p in $projects do i=$((i+1)) parrallel_connection=$((parrallel_connection+1)) try_connect_DB $p "$USERNAME" "$pass" > /tmp/${$}_${p}.out 2>&1 & if [[ $parrallel_connection -ge $max_parrallel_connection ]] then echo -n " ... ($i/$nb_project)" wait parrallel_connection=0 fi done if [[ $nb_project -gt $max_parrallel_connection ]] then # final new line echo fi # wait for all remaining jobs wait 

After run all jobs is finished review all results:

SQL_connection_failed is our convention of error, outputed by try_connect_DB() you may filter job success or failure the way that most suite your need.

Here we decided to only output failed results in order to reduce the amount of output on large sized jobs. Especially if most of them, or all, passed successfully.

# displaying result that failed file_with_failure=$(grep -l SQL_connection_failed /tmp/${$}_*.out) if [[ -n $file_with_failure ]] then nb_failed=$(wc -l <<< "$file_with_failure") # we will collect DB name from our output file naming convention, for post treatment db_names="" echo "=========== failed connections : $nb_failed/$nb_project" for failure in $file_with_failure do echo "============ $failure" cat $failure db_names+=" $(basename $failure | sed -e 's/^[0-9]\+_\([^.]\+\)\.out/\1/')" done echo "$db_names" ret=1 else echo "all tests passed" ret=0 fi # temporary files cleanup, could be kept is case of error, adapt to suit your needs. rm /tmp/${$}_*.out exit $ret 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.