0

I have a file like this :

SOME_INFO_BEGIN .... .... SOME_INFO_END ACTUAL_DETAIL_BEGIN TEST|1|23|abcd| TEST|2|5|efgs| TEST|3|124|zyz| ACTUAL_DETAIL_END 

I only to read the lines between ACTUAL_DETAILS_BEGIN and ACTUAL_DETAILS_END and they will always start with TEST, however i also only to read the line which has 5 in the 3rd column.

The below code works for me except it gets all the 3 lines -

with open(dir+filename, 'r') as filehandle: filecontent = filehandle.readlines() ifa = [k for k in filecontent if 'TEST' in k] df = pd.DataFrame([sub.split("|") for sub in ifa]) df.columns= ['Type','Amt','Desc','Value1','Value2'] df1 = df[['Type','Desc']] print(df1) df1.to_excel (dir+"test.xlsx", index = False) 

Q1. Is there a better way to code this ? For eg. how is the filehandle closed for excel write?

Q2. How do i only pick up the 2nd row ?

2
  • Why are you not opening the file in pandas directly? Commented Jun 22, 2019 at 12:55
  • Nevermind. You could use the csv module to open the file and parse the lines lazily. readlines() is putting the whole thing in memory before you even start filtering Commented Jun 22, 2019 at 12:58

1 Answer 1

1

Nothing to test with but you can iterate files and lazily load the lines. Perhaps this is more efficient:

rebuilt = [] with open(dir+filename, 'r') as infile: for row in infile: if row[:4] == 'TEST': rebuild.append(row.split('|')) df = pd.DataFrame(rebuilt, columns= ['Type','Amt','Desc','Value1','Value2']) 

readlines() is going to load the whole thing into memory, regardless, so you can filter the lines as you read them instead. You're also splitting each line before checking a condition, so it may be faster to check against a list slice.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.