Why does this python script to write to file abruptly stops?

Question

This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.") with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input: for line in input.readlines(): if re.findall(regex,line): output.write(line) input.close() output.close()

However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.

The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.

Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!

Try using top to see how much memory the process is using. And/or add some progress output. — Jesse W at Z - Given up on SE
– Jesse W at Z - Given up on SE, Commented Oct 28, 2014 at 18:22
As a side note, you probably don't want findall if you're just checking whether there are any matches. It probably won't have a huge performance impact to find all the matches instead of just the first one, but it can't help, and since it's also conceptually a little confusing, better to just not do it. — abarnert
– abarnert, Commented Oct 28, 2014 at 18:25
Also, if you're going to compile a pattern to a regex object, use its methods (regex.findall(line)), not the top-level functions (re.findall(regex, line)). The performance impact is probably even smaller here; again, it's about readability. (Also, the methods are more flexible, if you ever want to, say, extend things to, e.g., ignore the first 3 characters.) — abarnert
– abarnert, Commented Oct 28, 2014 at 18:28

Robᵩ · Accepted Answer · 2014-10-28 18:31:48Z

It stopped because it ran out of memory. input.readlines() reads the entire file into memory before returning a list of the lines.

Instead, use input as an iterator. This only reads a few lines at a time, and returns them immediately.

Don't do this:

for line in input.readlines():

Do do this:

for line in input:

Taking everyone's advice into account, your program becomes:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.") with open("dbtropes.nt", "rb") as input: with open("dbtropes-v2.nt", "a") as output for line in input: if regex.search(line): output.write(line)

Beat me to it. Even in python when working with enough data you have be careful of how you handle it because you can use too much memory then your computer has.

theodox · Accepted Answer · 2014-10-28 18:24:45Z

1

Use for line in input rather than readlines() to keep it from reading the whole file.

A minor point: You don't need to close files if you open them as context managers. You might find it cleaner like this:

with open("dbtropes-v2.nt", "a") as output with open("dbtropes.nt", "rb") as input: for line in input: if re.findall(regex,line): output.write(line)

answered Oct 28, 2014 at 18:24

theodox

12.2k3 gold badges25 silver badges38 bronze badges

3 Comments

Jonathan Eunice Over a year ago

I like this code sample. I might reorder the with statements, though, to open input before opening output. That way, if the input file is not present, no extra resources will be allocated, and no spurious output files will be created.

kormak Over a year ago

Why is it not necessary to close the file in this context? I read earlier (see here for example: stackoverflow.com/questions/5972277/write-not-working-in-python) that due to buffering a file might not get written at all if it's not properly closed.

Robᵩ Over a year ago

It is not necessary to explicitly close the file because with automatically calls .close() at the end of the indented statement. It is necessary to close the file: that's why you used with. See the example at file.close().

Collectives™ on Stack Overflow

Why does this python script to write to file abruptly stops?

2 Answers 2

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Linked

Related