6

I have a big excel file that contains many different sheets. All the sheets have the same structure like:

Name col1 col2 col3 col4 1 1 2 4 4 3 2 1 
  • How can I concatenate (vertically) all these sheets in Pandas without having to name each of them manually? If these were files, I could use glob to obtain a list of files in a directory. But here, for excel sheets, I am lost.
  • Is there a way to create a variable in the resulting dataframe that identifies the sheet name from which the data comes from?

Thanks!

4 Answers 4

12

Try this:

dfs = pd.read_excel(filename, sheet_name=None, skiprows=1) 

this will return you a dictionary of DFs, which you can easily concatenate using pd.concat(dfs) or as @jezrael has already posted in his answer:

df = pd.concat(pd.read_excel(filename, sheet_name=None, skiprows=1)) 

sheet_name: None -> All sheets as a dictionary of DataFrames

UPDATE:

Is there a way to create a variable in the resulting dataframe that identifies the sheet name from which the data comes from?

dfs = pd.read_excel(filename, sheet_name=None, skiprows=1) 

assuming we've got the following dict:

In [76]: dfs Out[76]: {'d1': col1 col2 col3 col4 0 1 1 2 4 1 4 3 2 1, 'd2': col1 col2 col3 col4 0 3 3 4 6 1 6 5 4 3} 

Now we can add a new column:

In [77]: pd.concat([df.assign(name=n) for n,df in dfs.items()]) Out[77]: col1 col2 col3 col4 name 0 1 1 2 4 d1 1 4 3 2 1 d1 0 3 3 4 6 d2 1 6 5 4 3 d2 
Sign up to request clarification or add additional context in comments.

9 Comments

Then pd.concat(dfs.values()) yields the result.
nice but how can I get rid of the Name super column? I thought I could use some read_excel (skip = 1) somewhere with the dictionary?
@blacksite, yes, thank you. I thought OP knows it already... ;-)
@ℕʘʘḆḽḘ, use skiprows=1
@ℕʘʘḆḽḘ, i've updated my answer - is that what you want?
|
5

Taking a note from this question:

import pandas as pd file = pd.ExcelFile('file.xlsx') names = file.sheet_names # see all sheet names df = pd.concat([file.parse(name) for name in names]) 

Results:

df Out[6]: A B 0 1 3 1 2 4 0 5 6 1 7 8 

Then you can run df.reset_index(), to, well, reset the index.

Edit: pandas.ExcelFile.parse is, according to the pandas docs:

Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters

3 Comments

thanks but why using the file.parse instead of some read.excel?
See above, please.
great. thanks guys but I had to give this one to the fastest one! :)
4

First add parameter sheetname=None for dict of DataFrames and skiprows=1 for omit first row and then use concat for MultiIndex DataFrame.

Last use reset_index for column from first level:

df = pd.concat(pd.read_excel('multiple_sheets.xlsx', sheetname=None, skiprows=1)) df = df.reset_index(level=1, drop=True).rename_axis('filenames').reset_index() 

3 Comments

thanks jezrael. same issue, how can I get rid of the first line for every sheet here?
skiprows=1 should help.
great. thanks guys but I had to give this one to the fastest one! :)
0
file_save_location='myfolder' file_name='filename' location = ''myfolder1' os.chdir(location) files_xls = glob.glob("*.xls*") excel_names=[f for f in files_xls] sheets = pd.ExcelFile(files_xls[0]).sheet_names def combine_excel_to_dfs(excel_names, sheet_name): sheet_frames = [pd.read_excel(x, sheet_name=sheet_name) for x in excel_names] combined_df = pd.concat(sheet_frames).reset_index(drop=True) return combined_df i = 0 while i < len(sheets): process = sheets[i] consolidated_file= combine_excel_to_dfs(excel_names, process) consolidated_file.to_csv(file_save_location+file_name+'.csv') i = i+1 else: "we done on consolidation part" 

1 Comment

Would be better if you could edit the post with some description followed by the code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.