Is there any way of accelerating file read/write in Pandas?

Question

I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):

import pandas as pd f1 = pd.read_excel('File_1.xlsx') f2 = pd.read_excel('File_2.xlsx') f3 = pd.read_excel('File_3.xlsx') f4 = pd.read_excel('File_4.xlsx') f5 = pd.read_excel('File_5.xlsx') FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False) FULL.to_excel('filename.xlsx', index=False)'

But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program. Is there any way I could accelerate both read/write?

The problem is the attempt to do everything in memory. You've loaded 5*300MB, that's 1.5GB if not 3GB and more. xlsx is a ZIP package containing Excel files so the actual data size can be a lot bigger. Then you create a contatenated frame, that's another 1.5GB (or 3GB) in RAM. Then you try to export this in one go, which means generating the XML content in memory before saving it to a new ZIP package — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jan 23, 2020 at 11:11
@PanagiotisKanavos I tried using del keyword to delete variables before attempting to_excel() but memory% remained the same — Ach113
– Ach113, Commented Jan 23, 2020 at 11:13
That's answered here. Remove all references to the intermediate dataframes. Loading the files in a loop, using the same variable for the dataframe and appending it to a "master" dataframe should be enough. You'll still be using 2x-3x more RAM than necessary though — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jan 23, 2020 at 11:15
Check this question - stackoverflow.com/questions/37756991/… — Dishin Goyani
– Dishin Goyani, Commented Jan 23, 2020 at 11:15
@DishinHGoyani that's what the OP is already doing and got into trouble. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jan 23, 2020 at 11:16

alec_djinn · Accepted Answer · 2020-01-23 11:26:54Z

3

In this post it is defined a nice function append_df_to_excel().

You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.

files = ['File_1.xlsx','File_2.xlsx',...] for file in files: df = pd.read_excel(file) append_df_to_excel('filename.xlsx', df)

Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.

Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.

edited Jan 23, 2020 at 11:26

answered Jan 23, 2020 at 11:21

alec_djinn

10.9k9 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ach113 Over a year ago

So I just need to create a file (excel or csv) called "full.csv" for instance, and then append read files to it via to_csv(mode='a') ?

alec_djinn Over a year ago

If your excel is a simple table, yes. pd. to_csv("file.csv" , mode="a") Check on google what a csv is, if it is enough for you, go for it. I personally prefer csv over excel to store tabular data.

Ach113 Over a year ago

Ok, I will try that. I have worked with csv before, did not know that writing to it was faster though.

Panagiotis Kanavos Over a year ago

@Ach113 csv is nothing more than a text file with specific delimiters. That means it's a lot larger too, since there's no compression. Concatenating CSVs is essentially the same as concatenating text files. Stream processing is a lot faster too as you're just reading single lines. Writing is easy, even in multithreading scenarios becuse you're just appending lines to a file, something typically supported at the OS level

Panagiotis Kanavos Over a year ago

@Ach113 that's why cloud event processing and analytics systems use CSV or JSON-per-line. You can even partition a file simply by seeking to the closest newline, and have each thread (or process) work on a different file block

morganics · Accepted Answer · 2020-01-23 11:36:09Z

Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.

Collectives™ on Stack Overflow

Is there any way of accelerating file read/write in Pandas?

2 Answers 2

5 Comments

Comments

Hot Network Questions