fast method in Python to split a large text file using number of lines as input variable

Question

I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.

import tempfile from itertools import groupby, count temp_dir = tempfile.mkdtemp() def tempfile_split(filename, temp_dir, chunk=4000000): with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k)) for line in group: with open(output_name, 'a') as outfile: outfile.write(line)

the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7

unutbu · Accepted Answer · 2013-03-26 19:53:02Z

6

 for line in group: with open(output_name, 'a') as outfile: outfile.write(line)

is opening the file, and writing one line, for each line in group. This is slow.

Instead, write once per group.

 with open(output_name, 'a') as outfile: outfile.write(''.join(group))

answered Mar 26, 2013 at 19:53

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Gianni Spear Over a year ago

Thanks unutbu. line is a string. Could i change "a" to "w"?

unutbu Over a year ago

Yes, you could change a to w.

Gianni Spear Over a year ago

is there "speed" difference between "a" and "w"?

Gianni Spear Over a year ago

hey unutbu change outfile.write(''.join(group)) to outfile.write(line), because with <<''.join(group)>> the function saves one files for each line (~ 8 millions of files)

unutbu Over a year ago

Gianni, keep the outfile.write(''.join(group)) and remove the for line in group entirely.

|

Wing Tang Wong · Accepted Answer · 2013-03-26 20:24:11Z

Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. Basically, one pass to get the line count, second pass to do the split write.

On my system, the time it took to perform the first pass run was about 2-3 seconds. To complete the run and the write of the split file(s), total time took was under 21 seconds.

Did not implement the lamba functions in the OP's post. Code used below:

#!/usr/bin/env python import sys import math infile = open("input","r") linecount=0 for line in infile: linecount=linecount+1 splitpoint=linecount/2 infile.close() infile = open("input","r") outfile1 = open("output1","w") outfile2 = open("output2","w") print linecount , splitpoint linecount=0 for line in infile: linecount=linecount+1 if ( linecount <= splitpoint ): outfile1.write(line) else: outfile2.write(line) infile.close() outfile1.close() outfile2.close()

No, it's not going to win any performance or code elegance tests. :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file.

EDIT:

My environment: Mac OS X, storage was a single FW800 connected hard drive. File was created fresh to avoid filesystem caching benefits.

dawg · Accepted Answer · 2013-03-26 21:07:00Z

You can use tempfile.NamedTemporaryFile directly in the context manager:

import tempfile import time from itertools import groupby, count def tempfile_split(filename, temp_dir, chunk=4*10**6): fns={} with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: with tempfile.NamedTemporaryFile(delete=False, dir=temp_dir,prefix='{}_'.format(str(k))) as outfile: outfile.write(''.join(group)) fns[k]=outfile.name return fns def make_test(size=8*10**6+1000): with tempfile.NamedTemporaryFile(delete=False) as fn: for i in xrange(size): fn.write('Line {}\n'.format(i)) return fn.name fn=make_test() t0=time.time() print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0

On my computer, the tempfile_split part runs in 3.6 seconds. It is OS X.

Perhaps explain why using the context manager changes things?

radtek · Accepted Answer · 2018-02-02 07:44:31Z

If you're in a linux or unix environment you could cheat a little and use the split command from inside python. Does the trick for me, and very fast too:

def split_file(file_path, chunk=4000): p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path, os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.communicate() # Remove the original file if required try: os.remove(file_path) except OSError: pass return True

Collectives™ on Stack Overflow

fast method in Python to split a large text file using number of lines as input variable

4 Answers 4

6 Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

1 Comment

Comments

Linked

Related