3

I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):

import pandas as pd f1 = pd.read_excel('File_1.xlsx') f2 = pd.read_excel('File_2.xlsx') f3 = pd.read_excel('File_3.xlsx') f4 = pd.read_excel('File_4.xlsx') f5 = pd.read_excel('File_5.xlsx') FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False) FULL.to_excel('filename.xlsx', index=False)' 

But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program. Is there any way I could accelerate both read/write?

7
  • 1
    The problem is the attempt to do everything in memory. You've loaded 5*300MB, that's 1.5GB if not 3GB and more. xlsx is a ZIP package containing Excel files so the actual data size can be a lot bigger. Then you create a contatenated frame, that's another 1.5GB (or 3GB) in RAM. Then you try to export this in one go, which means generating the XML content in memory before saving it to a new ZIP package Commented Jan 23, 2020 at 11:11
  • 1
    @PanagiotisKanavos I tried using del keyword to delete variables before attempting to_excel() but memory% remained the same Commented Jan 23, 2020 at 11:13
  • That's answered here. Remove all references to the intermediate dataframes. Loading the files in a loop, using the same variable for the dataframe and appending it to a "master" dataframe should be enough. You'll still be using 2x-3x more RAM than necessary though Commented Jan 23, 2020 at 11:15
  • 1
    Check this question - stackoverflow.com/questions/37756991/… Commented Jan 23, 2020 at 11:15
  • @DishinHGoyani that's what the OP is already doing and got into trouble. Commented Jan 23, 2020 at 11:16

2 Answers 2

3

In this post it is defined a nice function append_df_to_excel().

You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.

files = ['File_1.xlsx','File_2.xlsx',...] for file in files: df = pd.read_excel(file) append_df_to_excel('filename.xlsx', df) 

Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.

Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.

Sign up to request clarification or add additional context in comments.

5 Comments

So I just need to create a file (excel or csv) called "full.csv" for instance, and then append read files to it via to_csv(mode='a') ?
If your excel is a simple table, yes. pd. to_csv("file.csv" , mode="a") Check on google what a csv is, if it is enough for you, go for it. I personally prefer csv over excel to store tabular data.
Ok, I will try that. I have worked with csv before, did not know that writing to it was faster though.
@Ach113 csv is nothing more than a text file with specific delimiters. That means it's a lot larger too, since there's no compression. Concatenating CSVs is essentially the same as concatenating text files. Stream processing is a lot faster too as you're just reading single lines. Writing is easy, even in multithreading scenarios becuse you're just appending lines to a file, something typically supported at the OS level
@Ach113 that's why cloud event processing and analytics systems use CSV or JSON-per-line. You can even partition a file simply by seeking to the closest newline, and have each thread (or process) work on a different file block
1

Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.