0

I have a file with many fields terminated by "|" (pipe) character. I'd like to read this file and create as many files as there are values ​​of a specific field. Here an example:

L219| |791|P|PIPPO|PLUTO|1|18081926|I262|XYZXCV12D35F345S|| L219| |1241|P|PAPERINO|TOPOLINO|2|21041937|F335|FVGHWU54G56S456U|| L219| |437793|G|TOPOLANDIA SAS|L219|12345678910| L219| |437794|G|PAPERANDIA|L219|10987654321| 

If the fourth fields is equal to "G" then record goes into "file_pg.txt", otherwise if it is equal to "P" goes in "file_pf.txt".

I write the code below (I'm new in Python) but it takes too long time to execute file with huge dimension (300mb), do you have any suggestions to improve it?

file = open('D:\\mydirectory\\soggetti.txt','r') file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pf.close() file_pg.close() i = 0 with file: for line in file: i = 0 c = 0 while i < len(line): carattere = line[i] if carattere == "|": c = c + 1 if c == 4: if line[i-1] == "P": file_pf = open("D:\\mydirectory\\file_pf.txt","a") file_pf.write(line) file_pf.close() break elif line[i-1] == "G": file_pg = open("D:\\mydirectory\\file_pg.txt","a") file_pg.write(line) file_pg.close() break i = i + 1 file.close() 

Thanks!

Alberto

1
  • line.split('|')[3] should give you 'P' or 'G' for each line. And opening and closing your output files for each write is also quite expensive. Open them at the start, and close them both at the end. If you are worried about exceptions, then using the closing context manager. Commented Oct 10, 2013 at 13:43

5 Answers 5

1

I'll go with:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: for line in source_file: if line.split("|")[3] == "P": file_pf.write(line) elif line.split("|")[3] == "G": file_pg.write(line) 

If you are concerned with speed, it might be better to do:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: listP = [] listG = [] for line in source_file: char = line.split("|")[3] if char == "P": listP.append(line) file_pf.write(line) elif char == "G": listG.append(line) file_pg.write(line) with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: for line in listP file_pf.write(line) with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: for line in listG file_pg.write(line) 
Sign up to request clarification or add additional context in comments.

Comments

0

Opening and closing files are relatively slow operations. When possible, you should only open and close a file once. In your case, you can store your p and g lines in lists, and then write all of the lines at once after the loop is over.

file = open('D:\\mydirectory\\soggetti.txt','r') file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pf.close() file_pg.close() p_lines = [] g_lines = [] i = 0 with file: for line in file: i = 0 c = 0 while i < len(line): carattere = line[i] if carattere == "|": c = c + 1 if c == 4: if line[i-1] == "P": p_lines.append(line) break elif line[i-1] == "G": g_lines.append(line) break i = i + 1 file.close() file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pf.writelines(p_lines) file_pf.close() file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pg.writelines(g_lines) file_pg.close() 

You can also more easily identify the contents of the fields in each line by using split.

file = open('D:\\mydirectory\\soggetti.txt','r') file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pf.close() file_pg.close() p_lines = [] g_lines = [] with file: for line in file: fields = line.split("|") if fields[3] == "P": p_lines.append(line) elif fields[3] == "G": g_lines.append(line) file.close() file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pf.writelines(p_lines) file_pf.close() file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pg.writelines(g_lines) file_pg.close() 

By the way, strictly speaking, you don't need to use with and explicitly close the file once you're done with it. You can do one or the other. And it isn't necessary to open and immediately close file_pf and file_pg at the beginning of the script.

p_lines = [] g_lines = [] with open('D:\\mydirectory\\soggetti.txt','r') as file: for line in file: fields = line.split("|") if fields[3] == "P": p_lines.append(line) elif fields[3] == "G": g_lines.append(line) file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pf.writelines(p_lines) file_pf.close() file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pg.writelines(g_lines) file_pg.close() 

If you want to have more line types other than "p" and "g" in the future, it may save you some time to store each kind of line in a dictionary:

from collections import defaultdict lines_to_write = defaultdict(list) with file as open('D:\\mydirectory\\soggetti.txt','r'): for line in file: fields = line.split("|") lineType = fields[3].lower() lines_to_write[lineType].append(line) for lineType, lines in lines_to_write.iteritems(): filename = "D:\\mydirectory\\file_{}f.txt".format(lineType) with file as open(filename,"w"): file.writelines(lines) 

You can report to the user how many lines have been processed, by keeping track of what line number you are on, and periodically printing messages.

how_often_to_report = 100 #prints message every one hundred lines with file as open('D:\\mydirectory\\soggetti.txt','r'): for line_number, line in enumerate(file): if line_number % how_often_to_report == 0: print "{} lines processed", line_number #do rest of processing work here 

2 Comments

It is possible to insert a counter while proc is in execution to view record processed?
Yes, you can determine the number of records processed by tracking the current line number, using enumerate. Edited.
0
Read line from file split on | P = empty list G = empty list if splitted_line[index] is equal to P add line to P elif splitted_line[index] is equal to G add line to G open file for P write all lines in P close file for P open file for G write all lines in G close file for G 

Comments

0

I haven't test this, but something like below should be quicker

file = open('D:\\mydirectory\\soggetti.txt','r') file_pf = open("D:\\mydirectory\\file_pf.txt","a") file_pg = open("D:\\mydirectory\\file_pg.txt","a") for line in file: bits = line.split("|") if bits[3] == "P": file_pf.write(line) if bits[3] == "G": file_pg.write(line) file.close() file_pf.close() file_pg.close() 

Comments

0

The code below should be faster than what you are doing, because.

  1. You are not looping over every character.
  2. you are not opening the files everytime you have to write.
  3. There are less if conditions to evaluate.

file = open('D:\\mydirectory\\soggetti.txt','r') file_pf = open("D:\\mydirectory\\file_pf.txt","w") file_pg = open("D:\\mydirectory\\file_pg.txt","w") file_pf.close() file_pg.close() file_pf = open("D:\\mydirectory\\file_pf.txt","a") file_pg = open("D:\\mydirectory\\file_pg.txt","a") with file: for line in file: switch = line.split('|')[3] write = file_pf.write if 'P' in switch else file_pg.write write(line) file_pg.close() file_pf.cloe() file.close() 

1 Comment

I believe you need to leave out the parentheses in your write = ... line, or else write won't refer to the function object you want.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.