2

I am currently using a C++ script with a Python wrapper for manipulating a larger (15 GB) text file line-by-line. Effectively what it does is it reads a line from input.txt, processes it, the outputs the result to output.txt. I am using the straigtforward loop here (inp being opened as input.txt, out being opened as output.txt):

for line in inp: result = operate(line) out.write(result) 

However, because of the C++ script's issues, it has some failure rate, which causes the loop to shut after about ten million iterations. This leaves me with an output file made using only like 10% of the input.

Since I have no means of fixing the original script, I thought about just restarting it where it stopped. I counted the lines of output.txt, made another called output2.txt, and started the following code:

k = 0 for line in inp: if k < 12123253: k + = 1 else: result = operate(line) out2.write(result) k + = 1 

However, compared to when I was counting the lines, which ended under a minute, this method takes long hours to get to the designated line.

Why is this method inefficient? Is there a faster one? I am on a Windows pc with a strong calculating capability (72GB RAM, good processors), and using python 2.7.

1

2 Answers 2

5

I suggest you to use itertools

with open(inp) as f: result = itertools.islice(f, start_line, None) for i in result: #do something with this line 
Sign up to request clarification or add additional context in comments.

1 Comment

This will still read through the whole file upto the point of interest.
1

you may use file.seek and file.tell. Below is the sample (pseudo) code:

def seralizebreakpoint(pos): pass def desearializebreakpoint(): '''return -1 if there is actually no break point''' pass def process(inp): pos = inp.tell() for line in inp: try: result = operate(line) pos = inp.tell() except: seralizebreakpoint(pos) raise def processEntry(pathtoinput): bp = desearializebreakpoint() with open(pathtoinput, 'r') as inp: if bp > -1: inp.seek(bp) process(inp) 

Comments