0

This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.") with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input: for line in input.readlines(): if re.findall(regex,line): output.write(line) input.close() output.close() 

However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.

The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.

Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!

3
  • Try using top to see how much memory the process is using. And/or add some progress output. Commented Oct 28, 2014 at 18:22
  • As a side note, you probably don't want findall if you're just checking whether there are any matches. It probably won't have a huge performance impact to find all the matches instead of just the first one, but it can't help, and since it's also conceptually a little confusing, better to just not do it. Commented Oct 28, 2014 at 18:25
  • Also, if you're going to compile a pattern to a regex object, use its methods (regex.findall(line)), not the top-level functions (re.findall(regex, line)). The performance impact is probably even smaller here; again, it's about readability. (Also, the methods are more flexible, if you ever want to, say, extend things to, e.g., ignore the first 3 characters.) Commented Oct 28, 2014 at 18:28

2 Answers 2

7

It stopped because it ran out of memory. input.readlines() reads the entire file into memory before returning a list of the lines.

Instead, use input as an iterator. This only reads a few lines at a time, and returns them immediately.

Don't do this:

for line in input.readlines(): 

Do do this:

for line in input: 

Taking everyone's advice into account, your program becomes:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.") with open("dbtropes.nt", "rb") as input: with open("dbtropes-v2.nt", "a") as output for line in input: if regex.search(line): output.write(line) 
Sign up to request clarification or add additional context in comments.

1 Comment

Beat me to it. Even in python when working with enough data you have be careful of how you handle it because you can use too much memory then your computer has.
1

Use for line in input rather than readlines() to keep it from reading the whole file.

A minor point: You don't need to close files if you open them as context managers. You might find it cleaner like this:

with open("dbtropes-v2.nt", "a") as output with open("dbtropes.nt", "rb") as input: for line in input: if re.findall(regex,line): output.write(line) 

3 Comments

I like this code sample. I might reorder the with statements, though, to open input before opening output. That way, if the input file is not present, no extra resources will be allocated, and no spurious output files will be created.
Why is it not necessary to close the file in this context? I read earlier (see here for example: stackoverflow.com/questions/5972277/write-not-working-in-python) that due to buffering a file might not get written at all if it's not properly closed.
It is not necessary to explicitly close the file because with automatically calls .close() at the end of the indented statement. It is necessary to close the file: that's why you used with. See the example at file.close().

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.