Optimization of the removal of lines using Python 3

Question

I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:

with open('input.txt', 'r') as input_file: with open('output.txt', 'w') as output_file: # We only have to loop through the large file once for line in input_file: # Looping through my data many times is OK as it only contains ~100 elements for stuff in data: # Search the line line_data = re.search(r"(match group a)|(match group b)", line) # Verify there is indeed a match to avoid raising an exception. # I found using try/except was negligibly slower here if line_data: if line_data.group(1): output_file.write('\n') elif line_data.group(2) == stuff: output_file.write('stuff') output_file.close() input_file.close()

However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.

I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.

If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.

@JoshDetwiler That seemed to have shaved off a few percent of the time taken and I am grateful for it but I'm hoping to find something horribly inefficient that will save me perhaps an order of magnitude of my compute time. — Permittivity
– Permittivity, Commented Jul 9, 2018 at 18:34

mypetlion · Accepted Answer · 2018-07-09 18:35:36Z

You could move your inner loop to only run when needed. Right now, you're looping over data for every line in the large file, but only using the stuff variable when you match. So just move the for stuff in data: loop to inside the if block that actually uses it.

for line in input_file: # Search the line line_data = re.search(r"(match group a)|(match group b)", line) # Verify there is indeed a match to avoid raising an exception. # I found using try/except was negligibly slower here if line_data: for stuff in data: if line_data.group(1): output_file.write('\n') elif line_data.group(2) == stuff: output_file.write('stuff')

This is it! I knew I was doing something foolish but too much time spent thinking about it turned my brain into mush. I will accept this answer in 4 minutes when it lets me. Thank you.
@Physics Have you tried running it yet? Just wondering how much of a speed up it gets you. For personal curiosity.
I just did a quick test letting the original script run for ~102 seconds and moving the for loop reduced the time to complete this task by ~30x to ~3.4 seconds.

Baptiste Mille-Mathias · Accepted Answer · 2018-07-09 18:30:54Z

0

You're generating the regex for each line with consume a lot of CPU, you should compile the regex at the beginning of the search instead that would save some cycles.

answered Jul 9, 2018 at 18:30

Baptiste Mille-Mathias

2,1874 gold badges31 silver badges39 bronze badges

3 Comments

Matteo Italia Over a year ago

Python automatically caches the last compiled regexes. While it's true that by compiling the regex explicitly you can avoid one lookup in this cache, it's unlikely that this is actually the bottleneck.

user7851115 Over a year ago

@MatteoItalia Is that a Python 3 thing?

Matteo Italia Over a year ago

@JoshDetwiler: nope, it's there since Python 2; see docs.python.org/2/library/re.html#re.compile

Collectives™ on Stack Overflow

Optimization of the removal of lines using Python 3

2 Answers 2

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Related