78

How do I execute the following shell command using the Python subprocess module?

echo "input data" | awk -f script.awk | sort > outfile.txt 

The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?

p_awk = subprocess.Popen(["awk","-f","script.awk"], stdin=subprocess.PIPE, stdout=file("outfile.txt", "w")) p_awk.communicate( "input data" ) 

UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!

9 Answers 9

57

You'd be a little happier with the following.

import subprocess awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt", stdin=subprocess.PIPE, shell=True ) awk_sort.communicate( b"input data\n" ) 

Delegate part of the work to the shell. Let it connect two processes with a pipeline.

You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.

Edit. Some of the reasons for suggesting that awk isn't helping.

[There are too many reasons to respond via comments.]

  1. Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.

  2. The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.

  3. The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.

  4. Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.

  5. Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.

Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.

Sidebar Why building a pipeline (a | b) is so hard.

When the shell is confronted with a | b it has to do the following.

  1. Fork a child process of the original shell. This will eventually become b.

  2. Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".

  3. Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.

  4. The b child closes replaces its stdin with the new b's stdin. Exec the b process.

  5. The b child waits for a to complete.

  6. The parent is waiting for b to complete.

I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).

Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.

However, it's easier to delegate that operation to the shell.

Sign up to request clarification or add additional context in comments.

14 Comments

And I think Awk is actually a good fit for what I am doing, the code is shorter and simpler than the equivalent Python code (it's a domain specific language after all.)
-c tells the shell (the actual application your starting) that the following argument is a command to run. In this case, the command is a shell pipeline.
"the code is shorter" does not -- actually -- mean simpler. It only means shorter. Awk has a lot of assumptions and hidden features that make the code very hard to work with. Python, while longer, is explicit.
Sure, I understand your points & concerns and agree that in many cases my example above would better written in pure Python. I'm not ready to do that in my case yet however as the awk script works and is debugged. Sooner or later, but not right now.
And, that doesn't change the original question, which is how to use subprocess.Popen. Awk and sort are only used for illustration as potential answerers are likely to have them to test with.
|
37
import subprocess some_string = b'input_data' sort_out = open('outfile.txt', 'wb', 0) sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in, stdin=subprocess.PIPE).communicate(some_string) 

4 Comments

excellent! I modified it to make a self-contained example without the awk script, it uses sed: sam.nipl.net/code/python/pipeline.py
@SamWatkins: you don't need p1.wait() in your code. p1.communicate() reaps the child process.
Isn't this answer more pythonic and better? It doesn't use shell=True as discouraged in the subprocess documentation. I can't see the reason why people up-voted @S.Lott answer.
@KenT: the shell solution is more readable and less error-prone (if you don't accept untrusted input). The pythonic solution would use plumbum (the shell syntax embedded in Python) or another module that accepts a similar syntax (in a string) and constructs the pipeline for you (same behavior whatever local /bin/sh does).
22

To emulate a shell pipeline:

from subprocess import check_call check_call('echo "input data" | a | b > outfile.txt', shell=True) 

without invoking the shell (see 17.1.4.2. Replacing shell pipeline):

#!/usr/bin/env python from subprocess import Popen, PIPE a = Popen(["a"], stdin=PIPE, stdout=PIPE) with a.stdin: with a.stdout, open("outfile.txt", "wb") as outfile: b = Popen(["b"], stdin=a.stdout, stdout=outfile) a.stdin.write(b"input data") statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already 

plumbum provides some syntax sugar:

#!/usr/bin/env python from plumbum.cmd import a, b # magic (a << "input data" | b > "outfile.txt")() 

The analog of:

#!/bin/sh echo "input data" | awk -f script.awk | sort > outfile.txt 

is:

#!/usr/bin/env python from plumbum.cmd import awk, sort (awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")() 

6 Comments

Plumbum looks very nice, but I'm wary of "magic." This isn't Perl!
Plumbum does look nice! I wouldn't worry about the magic @KyleStrand - from a quick peek at the docs, you're not required to use the "magic" bits, the module has other ways of doing the same thing - and a quick look at the code shows that the magic is harmless and actually quite slick, not nasty at all.
@Tom I don't know, that's a lot of operator overloading with potentially surprising meanings. Part of me loves it, but I'd be reluctant to use it anywhere but in a personal project.
@KyleStrand: In general I would agree with you but in practice it is much more likely that people either construct the command line incorrrectly (e.g., by forgetting pipes.quote()) or introduce bugs while implementing the pipeline in Python, even a | b could be implemented with errors.
@jfs, How do I read the file if file in coming via POST request using cat or << operator?
|
16

The accepted answer is sidestepping actual question. here is a snippet that chains the output of multiple processes: Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.

#!/usr/bin/env python3 from subprocess import Popen, PIPE # cmd1 : dd if=/dev/zero bs=1m count=100 # cmd2 : gzip # cmd3 : wc -c cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100'] cmd2 = ['tee'] cmd3 = ['wc', '-c'] print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}") p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE) p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE) print("Output from last process : " + (p3.communicate()[0]).decode()) # thoretically p1 and p2 may still be running, this ensures we are collecting their return codes p1.wait() p2.wait() print("p1 return: ", p1.returncode) print("p2 return: ", p2.returncode) print("p3 return: ", p3.returncode) 

7 Comments

If p*.returncode returns 0, could I assume that there is no error generated? @Omry Yadan
you can be sure it returned 0. "error generated" is not well defined. it can still print things to stderr.
So as I understand, I have to check stderr as well if it is empty string I can be sure that there is error generated.
It depends on what you mean by an error. some programs would print to stderr routinely even if there is no error.
Can this deadlock? cmd1 is writing to stderr and nothing is consuming p1.stderr. If the file buffer fills up, the OS will stop executing p1 process. Same for p2.
|
2

http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?

Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().

4 Comments

What I don't understand (given the documentation's example) is if I say p2.communicate("input data"), does that actually get sent to p1.stdin?
You wouldn't. p1's stdin arg would be set to PIPE and you'd write p1.communicate('foo') then pick up the results by doing p2.stdout.read()
@Leonid - The Python people aren't very good at backwards compatibility. You can get much of the same information from: docs.python.org/2/library/subprocess.html#popen-objects but I've replaced the link with a wayback machine link anyway.
There's no need for the snarkiness of asking if there's "some part of [the docs] that [OP] didn't understand". As shown in this question, the part of the docs that you posted doesn't actually address the issue of passing input to the first process: stackoverflow.com/q/6341451/1858225
2

Inspired by @Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:

grep_proc = subprocess.Popen(["grep", "rabbitmq"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin) out, err = grep_proc.communicate() 

This is tested.

What has been done

  • Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
  • Called the primary command ps with stdout directed to the pipe used by the grep command.
  • Grep communicated to get stdout from the pipe.

I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.

1 Comment

to avoid zombies, call ps_proc.wait() after grep_proc.communicate(). err is always None unless you set stderr=subprocess.PIPE.
2

The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.

The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:

import os import subprocess input = """\ input data more input """ * 10 rd, wr = os.pipe() if os.fork() != 0: # parent os.close(wr) else: # child os.close(rd) os.write(wr, input) os.close(wr) exit() p_awk = subprocess.Popen(["awk", "{ print $2; }"], stdin=rd, stdout=subprocess.PIPE) p_sort = subprocess.Popen(["sort"], stdin=p_awk.stdout, stdout=subprocess.PIPE) p_awk.stdout.close() out, err = p_sort.communicate() print (out.rstrip()) 

Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.

There are ways to achieve the same effect without pipes:

from tempfile import TemporaryFile tf = TemporaryFile() tf.write(input) tf.seek(0, 0) 

Now use stdin=tf for p_awk. It's a matter of taste what you prefer.

The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.

1 Comment

it is sufficient to run communicate on the last process, see my answer.
1

EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.

The Python standard library now includes the pipes module for handling this:

https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html

I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.

4 Comments

pipes existed even before subprocess module. It builds a (*nix) shell pipeline (a string with "|" that is executed in /bin/sh). It is not portable. It is not an alternative to subprocess module that is portable and does not require to start a shell to run a command. pipes interface is from the time when Enterprise JavaBeans were shiny new things (it is not a compliment). Could you provide pipes code example that is "vastly simpler" than subprocess': check_call('echo "input data" | a | b > outfile.txt', shell=True) from my answer?
@J.F.Sebastian Huh. Shouldn't your check_call command be equally non-portable? DOS provides standard (i.e. *NIX-comparable, at least AFAIK) | behavior, so which systems are you expecting pipes not to work on? I admit that using check_call with a string representing your shell command is arguably just as simple as using pipes, but I was hoping for something that would facilitate the programmatic construction of a pipeline rather than just taking a single string to pass to the shell (a la your other examples).
As I said in my comment, plumbum does look nice--it appears to provide exactly the simplicity, flexibility, and power that I'm looking for. However, the syntax is entirely opaque and non-Pythonic. So what I want is something that's approximately as simple and easy as standard *NIX-shell pipes (if perhaps slightly less concise) while still syntactically and stylistically "looking" like Python. pipes, at first glance, certainly seems to meet these requirements; if, however, you're right that it's non-portable (which you probably are), then of course it's the least attractive option.
...and, yes, it looks like a simple piping together of echo hello world with C:\cygwin\bin\tr a-z A-Z fails on Windows, even though echo hello world | C:\cygwin\bin\tr.exe a-z A-Z works. That's...strange and disappointing.
1

For me, the below approach is the cleanest and easiest to read

from subprocess import Popen, PIPE def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename): with open(output_filename, 'wb') as out_f: p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f) p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE) p1.communicate(input=bytes(input_s)) p1.wait() p2.stdin.close() p2.wait() 

which can be called like so:

string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt') 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.