Deleting Relative Lines with Regex

Question

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:

This is important text. 9 Title 2012 and 2013 \fCompany Important text begins again.

The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:

report = open('file.txt').readlines() data = range(len(report)) name = [] for line_i in data: line = report[line_i] if re.match('.*\\x0cCompany', line ): name.append(report[line_i]) print name

This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.

R Nar · Accepted Answer · 2016-01-31 17:58:31Z

Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.

It would also be more efficient to iterate your actual file object, rather than putting it all into one list:

keeplines = [] with open('file.txt') as b: for line in b: if re.match('.*\\x0cCompany', line): keeplines = keeplines[:-3] #shave off the preceding lines else: keeplines.append(line) file = open('file.txt', 'w'): for line in keeplines: file.write(line)

Collectives™ on Stack Overflow

Deleting Relative Lines with Regex

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related