Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text. 9 Title 2012 and 2013 \fCompany Important text begins again. The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines() data = range(len(report)) name = [] for line_i in data: line = report[line_i] if re.match('.*\\x0cCompany', line ): name.append(report[line_i]) print name This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.