This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:
regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.") with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input: for line in input.readlines(): if re.findall(regex,line): output.write(line) input.close() output.close() However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.
The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.
Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!
topto see how much memory the process is using. And/or add some progress output.findallif you're just checking whether there are any matches. It probably won't have a huge performance impact to find all the matches instead of just the first one, but it can't help, and since it's also conceptually a little confusing, better to just not do it.regex.findall(line)), not the top-level functions (re.findall(regex, line)). The performance impact is probably even smaller here; again, it's about readability. (Also, the methods are more flexible, if you ever want to, say, extend things to, e.g., ignore the first 3 characters.)