I have a task that processes a list of files on stdin. The start-up time of the program is substantial, and the amount of time each file takes varies widely. I want to spawn a substantial number of these processes, then dispatch work to whichever ones are not busy. There are several different commandline tools that almost do what I want, I've narrowed it down to two almost working options:
find . -type f | split -n r/24 -u --filter="myjob" find . -type f | parallel --pipe -u -l 1 myjob The problem is that split does a pure round-robin, so one of the processes gets behind and stays behind, delaying the completion of the entire operation; while parallel wants to spawn one process per N lines or bytes of input and I wind up spending way too much time on startup overhead.
Is there something like this that will re-use the processes and feed lines to whichever processes have unblocked stdins?
splitcommand from? The name conflicts with the standard text processing utility.myjobis ready to receive more input. There is no way to know that a program is ready to process more input, all you can know is that some buffer somewhere (a pipe buffer, an stdio buffer) is ready to receive more input. Can you arrange for your program to send some kind of request (e.g. display a prompt) when it's ready?readcalls would do the trick. That's a fairly large programming endeavor.-l 1in theparallelargs? IIRC, that tells parallel to process one line of input per job (i.e. one filename per fork of myjob, so lots of startup overhead).