1

I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:

with open('input.txt', 'r') as input_file: with open('output.txt', 'w') as output_file: # We only have to loop through the large file once for line in input_file: # Looping through my data many times is OK as it only contains ~100 elements for stuff in data: # Search the line line_data = re.search(r"(match group a)|(match group b)", line) # Verify there is indeed a match to avoid raising an exception. # I found using try/except was negligibly slower here if line_data: if line_data.group(1): output_file.write('\n') elif line_data.group(2) == stuff: output_file.write('stuff') output_file.close() input_file.close() 

However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.

I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.

If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.

1
  • @JoshDetwiler That seemed to have shaved off a few percent of the time taken and I am grateful for it but I'm hoping to find something horribly inefficient that will save me perhaps an order of magnitude of my compute time. Commented Jul 9, 2018 at 18:34

2 Answers 2

3

You could move your inner loop to only run when needed. Right now, you're looping over data for every line in the large file, but only using the stuff variable when you match. So just move the for stuff in data: loop to inside the if block that actually uses it.

for line in input_file: # Search the line line_data = re.search(r"(match group a)|(match group b)", line) # Verify there is indeed a match to avoid raising an exception. # I found using try/except was negligibly slower here if line_data: for stuff in data: if line_data.group(1): output_file.write('\n') elif line_data.group(2) == stuff: output_file.write('stuff') 
Sign up to request clarification or add additional context in comments.

3 Comments

This is it! I knew I was doing something foolish but too much time spent thinking about it turned my brain into mush. I will accept this answer in 4 minutes when it lets me. Thank you.
@Physics Have you tried running it yet? Just wondering how much of a speed up it gets you. For personal curiosity.
I just did a quick test letting the original script run for ~102 seconds and moving the for loop reduced the time to complete this task by ~30x to ~3.4 seconds.
0

You're generating the regex for each line with consume a lot of CPU, you should compile the regex at the beginning of the search instead that would save some cycles.

3 Comments

Python automatically caches the last compiled regexes. While it's true that by compiling the regex explicitly you can avoid one lookup in this cache, it's unlikely that this is actually the bottleneck.
@MatteoItalia Is that a Python 3 thing?
@JoshDetwiler: nope, it's there since Python 2; see docs.python.org/2/library/re.html#re.compile

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.