42

I have a text file say really_big_file.txt that contains:

line 1 line 2 line 3 line 4 ... line 99999 line 100000 

I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.

I would appreciate any suggestions on the easiest way to accomplish this using Python

0

10 Answers 10

64
lines_per_file = 300 smallfile = None with open('really_big_file.txt') as bigfile: for lineno, line in enumerate(bigfile): if lineno % lines_per_file == 0: if smallfile: smallfile.close() small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file) smallfile = open(small_filename, "w") smallfile.write(line) if smallfile: smallfile.close() 
Sign up to request clarification or add additional context in comments.

Comments

37

Using itertools grouper recipe:

from itertools import zip_longest def grouper(n, iterable, fillvalue=None): "Collect data into fixed-length chunks or blocks" # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx args = [iter(iterable)] * n return zip_longest(fillvalue=fillvalue, *args) n = 300 with open('really_big_file.txt') as f: for i, g in enumerate(grouper(n, f, fillvalue=''), 1): with open('small_file_{0}'.format(i * n), 'w') as fout: fout.writelines(g) 

The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.

Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.

import os, tempfile with open('really_big_file.txt') as f: for i, g in enumerate(grouper(n, f, fillvalue=None)): with tempfile.NamedTemporaryFile('w', delete=False) as fout: for j, line in enumerate(g, 1): # count number of lines in group if line is None: j -= 1 # don't count this line break fout.write(line) os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j)) 

This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.

2 Comments

If you are using the first script in python 3.x, replace the izip_longest with the new zip_longest docs.python.org/3/library/itertools.html#itertools.zip_longest
@YuvalPruss I updated based on your comment now that Py3 is the standard
6

I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.

Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question

Here are the steps to do this in basic python:

First you should read your file into a list for safekeeping:

my_file = 'really_big_file.txt' hold_lines = [] with open(my_file,'r') as text_file: for row in text_file: hold_lines.append(row) 

Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:

outer_count = 1 line_count = 0 sorting = True while sorting: count = 0 increment = (outer_count-1) * 300 left = len(hold_lines) - increment file_name = "small_file_" + str(outer_count * 300) + ".txt" 

Third, inside that loop you need some nested loops that will save the correct rows into an array:

hold_new_lines = [] if left < 300: while count < left: hold_new_lines.append(hold_lines[line_count]) count += 1 line_count += 1 sorting = False else: while count < 300: hold_new_lines.append(hold_lines[line_count]) count += 1 line_count += 1 

Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file

outer_count += 1 with open(file_name,'w') as next_file: for row in hold_new_lines: next_file.write(row) 

note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.

It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.

In case you could not follow what was in what loop, here is the entirety of the function:

my_file = 'really_big_file.txt' sorting = True hold_lines = [] with open(my_file,'r') as text_file: for row in text_file: hold_lines.append(row) outer_count = 1 line_count = 0 while sorting: count = 0 increment = (outer_count-1) * 300 left = len(hold_lines) - increment file_name = "small_file_" + str(outer_count * 300) + ".txt" hold_new_lines = [] if left < 300: while count < left: hold_new_lines.append(hold_lines[line_count]) count += 1 line_count += 1 sorting = False else: while count < 300: hold_new_lines.append(hold_lines[line_count]) count += 1 line_count += 1 outer_count += 1 with open(file_name,'w') as next_file: for row in hold_new_lines: next_file.write(row) 

Comments

2
lines_per_file = 300 # Lines on each small file lines = [] # Stores lines not yet written on a small file lines_counter = 0 # Same as len(lines) created_files = 0 # Counting how many small files have been created with open('really_big_file.txt') as big_file: for line in big_file: # Go throught the whole big file lines.append(line) lines_counter += 1 if lines_counter == lines_per_file: idx = lines_per_file * (created_files + 1) with open('small_file_%s.txt' % idx, 'w') as small_file: # Write all lines on small file small_file.write('\n'.join(stored_lines)) lines = [] # Reset variables lines_counter = 0 created_files += 1 # One more small file has been created # After for-loop has finished if lines_counter: # There are still some lines not written on a file? idx = lines_per_file * (created_files + 1) with open('small_file_%s.txt' % idx, 'w') as small_file: # Write them on a last small file small_file.write('n'.join(stored_lines)) created_files += 1 print '%s small files (with %s lines each) were created.' % (created_files, lines_per_file) 

2 Comments

The only thing is that you have to store each small_file in memory at once before writing it with this method, that may or may not be a problem though. Of course you could fix that by just chaing this to write to the file line by line.
name 'strored_lines' is not defined
2
import csv import os import re MAX_CHUNKS = 300 def writeRow(idr, row): with open("file_%d.csv" % idr, 'ab') as file: writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL) writer.writerow(row) def cleanup(): for f in os.listdir("."): if re.search("file_.*", f): os.remove(os.path.join(".", f)) def main(): cleanup() with open("large_file.csv", 'rb') as results: r = csv.reader(results, delimiter=',', quotechar='\"') idr = 1 for i, x in enumerate(r): temp = i + 1 if not (temp % (MAX_CHUNKS + 1)): idr += 1 writeRow(idr, x) if __name__ == "__main__": main() 

3 Comments

Hey quick question, would you mind explaining why you using quotechar='\"' thanks
I was using it as I had a different quote char ( | ) in my case. You can skip setting this one as the default quote character is (quotes ")
For people who are concerned about speed, a CSV file with 98500 records (and about 13MB in size) was split with this code in about 2.31 seconds. I'd say that's pretty good.
1

Set files to the number of file you want to split the master file to in my exemple i want to get 10 files from my master file

files = 10 with open("data.txt","r") as data : emails = data.readlines() batchs = int(len(emails)/10) for id,log in enumerate(emails): fileid = id/batchs file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+') file.write(log) 

1 Comment

thanks @JoeVenner I try that aproach but its to slow for big Files
1
with open('/really_big_file.txt') as infile: file_line_limit = 300 counter = -1 file_index = 0 outfile = None for line in infile.readlines(): counter += 1 if counter % file_line_limit == 0: # close old file if outfile is not None: outfile.close() # create new file file_index += 1 outfile = open('small_file_%03d.txt' % file_index, 'w') # write to file outfile.write(line) 

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
0

I had to do the same with 650000 line files.

Use the enumerate index and integer div it (//) with the chunk size

When that number changes close the current file and open a new one

This is a python3 solution using format strings.

chunk = 50000 # number of lines from the big file to put in small file this_small_file = open('./a_folder/0', 'a') with open('massive_web_log_file') as file_to_read: for i, line in enumerate(file_to_read.readlines()): file_name = f'./a_folder/{i // chunk}' print(i, file_name) # a bit of feedback that slows the process down a if file_name == this_small_file.name: this_small_file.write(line) else: this_small_file.write(line) this_small_file.close() this_small_file = open(f'{file_name}', 'a') 

2 Comments

You can get significant speedup by commenting print(i, file_name)
Also by changing file_to_read.readlines() to just file_to_read...
0

A very easy way would if you want to split it in 2 files for example:

with open("myInputFile.txt",'r') as file: lines = file.readlines() with open("OutputFile1.txt",'w') as file: for line in lines[:int(len(lines)/2)]: file.write(line) with open("OutputFile2.txt",'w') as file: for line in lines[int(len(lines)/2):]: file.write(line) 

making that dynamic would be:

with open("inputFile.txt",'r') as file: lines = file.readlines() Batch = 10 end = 0 for i in range(1,Batch + 1): if i == 1: start = 0 increase = int(len(lines)/Batch) end = end + increase with open("splitText_" + str(i) + ".txt",'w') as file: for line in lines[start:end]: file.write(line) start = end 

Comments

0

In Python files are simple iterators. That gives the option to iterate over them multiple times and always continue from the last place the previous iterator got. Keeping this in mind, we can use islice to get the next 300 lines of the file each time in a continuous loop. The tricky part is knowing when to stop. For this we will "sample" the file for the next line and once it is exhausted we can break the loop:

from itertools import islice lines_per_file = 300 with open("really_big_file.txt") as file: i = 1 while True: try: checker = next(file) except StopIteration: break with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file: out_file.write(checker) for line in islice(file, lines_per_file-1): out_file.write(line) i += 1 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.