Spreading stdin to parallel processes

Question

I have a task that processes a list of files on stdin. The start-up time of the program is substantial, and the amount of time each file takes varies widely. I want to spawn a substantial number of these processes, then dispatch work to whichever ones are not busy. There are several different commandline tools that almost do what I want, I've narrowed it down to two almost working options:

find . -type f | split -n r/24 -u --filter="myjob" find . -type f | parallel --pipe -u -l 1 myjob

The problem is that split does a pure round-robin, so one of the processes gets behind and stays behind, delaying the completion of the entire operation; while parallel wants to spawn one process per N lines or bytes of input and I wind up spending way too much time on startup overhead.

Is there something like this that will re-use the processes and feed lines to whichever processes have unblocked stdins?

Where is that split command from? The name conflicts with the standard text processing utility. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Oct 9, 2012 at 22:27
@Gilles, it's the GNU one: "split (GNU coreutils) 8.13". Using it as a weird alternative to xargs is probably not the intended use but it's the closest to what I want I've found. — BCoates
– BCoates, Commented Oct 9, 2012 at 22:44
I've been thinking about that, and a fundamental problem is knowing that an instance of myjob is ready to receive more input. There is no way to know that a program is ready to process more input, all you can know is that some buffer somewhere (a pipe buffer, an stdio buffer) is ready to receive more input. Can you arrange for your program to send some kind of request (e.g. display a prompt) when it's ready? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Oct 10, 2012 at 1:21
Assuming that the program isn't using bufering on stdin, a FUSE filesystem that reacts to read calls would do the trick. That's a fairly large programming endeavor. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Oct 10, 2012 at 1:42
why are you using -l 1 in the parallel args? IIRC, that tells parallel to process one line of input per job (i.e. one filename per fork of myjob, so lots of startup overhead). — cas
– cas, Commented Oct 10, 2012 at 4:40

Ole Tange · Accepted Answer · 2012-10-10 14:18:54Z

For GNU Parallel you can set the block size using --block. It does, however, require you have enough memory to keep 1 block in memory for each of the running processes.

I understand this is not precisely what you are looking for, but it may be an acceptable work-around for now.

If your tasks on average take the same time, then you might be able to use mbuffer:

find . -type f | split -n r/24 -u --filter="mbuffer -m 2G | myjob"

estani · Accepted Answer · 2013-03-26 17:36:06Z

That doesn't look possible in such a general case. It implies you have a buffer for each process and you can watch the buffers from outside to decide where to put the next entry (scheduling)... Of course you might write something (or use a batch system like slurm)

But depending on what the process is, you might be able to pre-process the input. For example if you want to download files, update entries from a DB, or similar, but 50% of them will end up being skipped (and therefor you have a large processing difference depending on the input) then, just setup a pre-processor that verifies which entries are going to take long (file exists, data was changed, etc), so whatever comes from the other side is guaranteed to take a fairly equal amount of time. Even if the heuristic is not perfect you might end up with a considerable improvement. You might dump the others to a file and process afterwards in the same manner.

But that depends on your use case.

dannysauer · Accepted Answer · 2013-10-21 23:16:27Z

No, there isn't a generic solution. Your dispatcher needs to know when each program is ready to read another line, and there's no standard I'm aware of which allows for that. All you can do is put a line on STDOUT and wait for something to consume it; there's not really a good way for the producer on a pipeline to tell if the next consumer is ready or not.

Bananguin · Accepted Answer · 2012-10-09 21:48:27Z

I don't think so. In my favorite magazine was an article once on bash programming which did what you want. I'm willing to believe that if there were tools to do that they would have mentioned them. So you want something along the lines of:

set -m # enable job control max_processes=8 concurrent_processes=0 child_has_ended() { concurrent_processes=$((concurrent_processes - 1)) } trap child_has_ended SIGCHLD # that's magic calling our bash function when a child processes ends for i in $(find . -type f) do # don't do anything while there are max_processes running while [ ${concurrent_processes} -ge ${max_processes}]; do sleep 0.5; done # increase the counter concurrent_processes=$((concurrent_processes + 1)) # start a child process to actually deal with one file /path/to/script/to/handle/one/file $i & done

Obviously you may change the invocation to the actual working script to your liking. The magazine I mentionen initially does things like setting up pipes and actually starting worker threads. Check out mkfifo for that, but that route is far more complicated as the worker processes need to signal the master process that they are ready to receive more data. So you need one fifo for each worker process to send it data and one fifo for the master process to receive stuff from the workers.

DISCLAIMER I wrote that script from the top of my head. It may have some syntax issues.

This doesn't seem to meet the requirements: you're starting a different instance of the program for each item. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Oct 9, 2012 at 22:26
It's usually preferable to use find . -type f | while read i rather than for i in $(find . -type f). — user26112
– user26112, Commented Jun 30, 2013 at 14:39

Johan · Accepted Answer · 2013-02-16 15:18:50Z

Try this:

mkfifo for each process.

Then hang tail -f | myjob on each fifo.

For example setting up the workers (myjob processes)

mkdir /tmp/jobs for X in 1 2 3 4 do mkfifo pipe$X tail -f pipe$X | myjob & jobs -l| awk '/pipe'$X'/ {print $2, "'pipe$X'"}' >> pipe-job-mapping done

Depending on your application (myjob) you might eb able to use jobs -s to find stopped jobs. Otherwise list the processes sorted by CPU and select the one consuming fewest resources. Of have the job report itself, eg by setting a flag in the file system when it wants more work.

Assuming the job stops when waiting for input, use

jobs -sl to find out pid of a stopped job and assign it work, for example

grep "^$STOPPED_PID" pipe-to-job-mapping | while read PID PIPE do cat workset > $PIPE done

I tested this with

garfield:~$ cd /tmp garfield:/tmp$ mkfifo f1 garfield:/tmp$ mkfifo f2 garfield:/tmp$ tail -f f1 | sed 's/^/1 /' & [1] 21056 garfield:/tmp$ tail -f f2 | sed 's/^/2 /' & [2] 21058 garfield:/tmp$ echo hello > f1 1 hello garfield:/tmp$ echo what > f2 2 what garfield:/tmp$ echo yes > f1 1 yes

This I must admit was just concocted so ymmv.

ash · Accepted Answer · 2013-08-26 22:07:59Z

What is really needed to solve this is a queue mechanism of some type.

Is it possible to have the jobs reading their input from a Queue, such as a SYSV message queue, and then have the programs run by parallel simply push the values onto the queue?

Another possibility is to use a directories for the queue, like this:

the find output creates a symlink to each file to process in a directory, pending
each job process performs a mv of the first file it sees in the directory to a sibling directory of pending, named inprogress.
if the job succesfully moves the file, it performs the processing; otherwise, it goes back to find and move another filename from pending

kouk · Accepted Answer · 2013-10-18 09:24:06Z

expounding on @ash's answer, you can use a SYSV message queue to distribute the work. If you don't want to write your own program in C there is a utility called ipcmd that can help. Here's what I put together to pass the output of find $DIRECTORY -type f to $PARALLEL number of processes:

set -o errexit set -o nounset export IPCMD_MSQID=$(ipcmd msgget) DIRECTORY=$1 PARALLEL=$2 # clean up message queue on exit trap 'ipcrm -q $IPCMD_MSQID' EXIT for i in $(seq $PARALLEL); do { while true do message=$(ipcmd msgrcv) || exit [ -f $message ] || break sleep $((RANDOM/3000)) done } & done find "$DIRECTORY" -type f | xargs ipcmd msgsnd for i in $(seq $PARALLEL); do ipcmd msgsnd "/dev/null/bar" done wait

Here's a test run:

$ for i in $(seq 20 10 100) ; do time parallel.sh /usr/lib/ $i ; done parallel.sh /usr/lib/ $i 0.30s user 0.67s system 0% cpu 1:57.23 total parallel.sh /usr/lib/ $i 0.28s user 0.69s system 1% cpu 1:09.58 total parallel.sh /usr/lib/ $i 0.19s user 0.80s system 1% cpu 1:05.29 total parallel.sh /usr/lib/ $i 0.29s user 0.73s system 2% cpu 44.417 total parallel.sh /usr/lib/ $i 0.25s user 0.80s system 2% cpu 37.353 total parallel.sh /usr/lib/ $i 0.21s user 0.85s system 3% cpu 32.354 total parallel.sh /usr/lib/ $i 0.30s user 0.82s system 3% cpu 28.542 total parallel.sh /usr/lib/ $i 0.27s user 0.88s system 3% cpu 30.219 total parallel.sh /usr/lib/ $i 0.34s user 0.84s system 4% cpu 26.535 total

peterph · Accepted Answer · 2013-12-03 16:16:21Z

Unless you can estimate how long a particular input file will be processed and the worker processes don't have a way to report back to the scheduler (as they do in normal parallel computing scenarios - often through MPI), you are generally out of luck - either pay the penalty of some workers processing input longer than others (because of inequality of input), or pay the penalty of spawning a single new process for every input file.

Ole Tange · Accepted Answer · 2019-04-18 17:15:15Z

GNU Parallel has changed in the past 7 years. So today it can do it:

This example shows that more blocks are given to process 11 and 10 than process 4 and 5 because 4 and 5 read slower:

seq 1000000 | parallel -j8 --tag --roundrobin --pipe --block 1k 'pv -qL {}0000 | wc' ::: 11 4 5 6 9 8 7 10

Stack Exchange Network

Spreading stdin to parallel processes

9 Answers 9

You must log in to answer this question.

Linked

Hot Network Questions

Spreading stdin to parallel processes

9 Answers 9

You must log in to answer this question.

Linked

Related

Hot Network Questions