4

I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.

import tempfile from itertools import groupby, count temp_dir = tempfile.mkdtemp() def tempfile_split(filename, temp_dir, chunk=4000000): with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k)) for line in group: with open(output_name, 'a') as outfile: outfile.write(line) 

the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7

4 Answers 4

6
 for line in group: with open(output_name, 'a') as outfile: outfile.write(line) 

is opening the file, and writing one line, for each line in group. This is slow.

Instead, write once per group.

 with open(output_name, 'a') as outfile: outfile.write(''.join(group)) 
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks unutbu. line is a string. Could i change "a" to "w"?
Yes, you could change a to w.
is there "speed" difference between "a" and "w"?
hey unutbu change outfile.write(''.join(group)) to outfile.write(line), because with <<''.join(group)>> the function saves one files for each line (~ 8 millions of files)
Gianni, keep the outfile.write(''.join(group)) and remove the for line in group entirely.
|
1

Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. Basically, one pass to get the line count, second pass to do the split write.

On my system, the time it took to perform the first pass run was about 2-3 seconds. To complete the run and the write of the split file(s), total time took was under 21 seconds.

Did not implement the lamba functions in the OP's post. Code used below:

#!/usr/bin/env python import sys import math infile = open("input","r") linecount=0 for line in infile: linecount=linecount+1 splitpoint=linecount/2 infile.close() infile = open("input","r") outfile1 = open("output1","w") outfile2 = open("output2","w") print linecount , splitpoint linecount=0 for line in infile: linecount=linecount+1 if ( linecount <= splitpoint ): outfile1.write(line) else: outfile2.write(line) infile.close() outfile1.close() outfile2.close() 

No, it's not going to win any performance or code elegance tests. :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file.

EDIT:

My environment: Mac OS X, storage was a single FW800 connected hard drive. File was created fresh to avoid filesystem caching benefits.

Comments

1

You can use tempfile.NamedTemporaryFile directly in the context manager:

import tempfile import time from itertools import groupby, count def tempfile_split(filename, temp_dir, chunk=4*10**6): fns={} with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: with tempfile.NamedTemporaryFile(delete=False, dir=temp_dir,prefix='{}_'.format(str(k))) as outfile: outfile.write(''.join(group)) fns[k]=outfile.name return fns def make_test(size=8*10**6+1000): with tempfile.NamedTemporaryFile(delete=False) as fn: for i in xrange(size): fn.write('Line {}\n'.format(i)) return fn.name fn=make_test() t0=time.time() print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0 

On my computer, the tempfile_split part runs in 3.6 seconds. It is OS X.

1 Comment

Perhaps explain why using the context manager changes things?
0

If you're in a linux or unix environment you could cheat a little and use the split command from inside python. Does the trick for me, and very fast too:

def split_file(file_path, chunk=4000): p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path, os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.communicate() # Remove the original file if required try: os.remove(file_path) except OSError: pass return True 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.