2

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:

This is important text. 9 Title 2012 and 2013 \fCompany Important text begins again. 

The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:

report = open('file.txt').readlines() data = range(len(report)) name = [] for line_i in data: line = report[line_i] if re.match('.*\\x0cCompany', line ): name.append(report[line_i]) print name 

This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.

1 Answer 1

2

Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.

It would also be more efficient to iterate your actual file object, rather than putting it all into one list:

keeplines = [] with open('file.txt') as b: for line in b: if re.match('.*\\x0cCompany', line): keeplines = keeplines[:-3] #shave off the preceding lines else: keeplines.append(line) file = open('file.txt', 'w'): for line in keeplines: file.write(line) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.