I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:
with open('input.txt', 'r') as input_file: with open('output.txt', 'w') as output_file: # We only have to loop through the large file once for line in input_file: # Looping through my data many times is OK as it only contains ~100 elements for stuff in data: # Search the line line_data = re.search(r"(match group a)|(match group b)", line) # Verify there is indeed a match to avoid raising an exception. # I found using try/except was negligibly slower here if line_data: if line_data.group(1): output_file.write('\n') elif line_data.group(2) == stuff: output_file.write('stuff') output_file.close() input_file.close() However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.
I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.
If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.