Pandas: Looking up the list of sheets in an excel file

Question

The new version of Pandas uses the following interface to load Excel files:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

but what if I don't know the sheets that are available?

For example, I am working with excel files that the following sheets

Data 1, Data 2 ..., Data N, foo, bar

but I don't know N a priori.

Is there any way to get the list of sheets from an excel document in Pandas?

Ezekiel Kruglick · Accepted Answer · 2015-07-18 19:46:19Z

532

You can still use the ExcelFile class (and the sheet_names attribute):

xl = pd.ExcelFile('foo.xls') xl.sheet_names # see all sheet names xl.parse(sheet_name) # read a specific sheet to DataFrame

see docs for parse for more options...

edited Jul 18, 2015 at 19:46

Ezekiel Kruglick

4,70641 silver badges49 bronze badges

answered Jul 31, 2013 at 18:01

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andy Hayden Over a year ago

Mentioned before here, but I like to keep a dictionary of DataFrames using {sheet_name: xl.parse(sheet_name) for sheet_name in xl.sheet_names}

Ezekiel Kruglick Over a year ago

Wish I could give you more upvotes, this works across multiple versions of pandas too! (don't know why they like changing the API so often) Thanks for pointing me at the parse function, here is the current link though: pandas.pydata.org/pandas-docs/stable/generated/…

vjangus Over a year ago

When opening xlsx files, this will fail in pandas 1.1.5. But can be fixed by using xl = pd.ExcelFile('foo.xls', engine='openpyxl'). Related on my issue, see this thread

Ohumeronen · Accepted Answer · 2024-06-05 06:04:28Z

You should explicitly specify the second parameter (sheet_name) as None. like this:

 df = pandas.read_excel("/yourPath/FileName.xlsx", sheet_name=None);

"df" are all sheets as a dictionary of DataFrames, you can verify it by run this:

df.keys()

result like this:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

This unnecessarily parses every sheet as a DataFrame, which is not required. "How to read an xls/xlsx file" is a different question.
@AndyHayden it might not be efficient, but it might be the best if you care about all the sheets, or you don't care about the additional overhead.
The named argument is called sheet_name. I.e., df = pandas.read_excel("/yourPath/FileName.xlsx", sheet_name=None, engine='openpyxl')

Jean-Francois T. · Accepted Answer · 2022-05-20 02:36:58Z

The easiest way to retrieve the sheet-names from an excel (xls., xlsx) is:

tabs = pd.ExcelFile("path").sheet_names print(tabs)

Then to read and store the data of a particular sheet (say, sheet names are "Sheet1", "Sheet2", etc.), say "Sheet2" for example:

data = pd.read_excel("path", "Sheet2") print(data)

S.E.A · Accepted Answer · 2019-09-06 21:38:13Z

This is the fastest way I have found, inspired by @divingTobi's answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.

from zipfile import ZipFile from bs4 import BeautifulSoup # you also need to install "lxml" for the XML parser with ZipFile(file) as zipped_file: summary = zipped_file.open(r'xl/workbook.xml').read() soup = BeautifulSoup(summary, "xml") sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]

Note for anyone trying this. This will only work for xlsx, not xls. As xls is a binary format not xml.

Suraj Rao · Accepted Answer · 2021-12-07 12:04:06Z

#It will work for Both '.xls' and '.xlsx' by using pandas import pandas as pd excel_Sheet_names = (pd.ExcelFile(excelFilePath)).sheet_names #for '.xlsx' use only openpyxl from openpyxl import load_workbook excel_Sheet_names = (load_workbook(excelFilePath, read_only=True)).sheet_names

I believe the method is called ".sheetnames" (without underscore).

Glen Thompson · Accepted Answer · 2020-11-09 22:23:24Z

If you:

care about performance
don't need the data in the file at execution time.
want to go with conventional libraries vs rolling your own solution

Below was benchmarked on a ~10Mb xlsx, xlsb file.

`xlsx, xls`

from openpyxl import load_workbook def get_sheetnames_xlsx(filepath): wb = load_workbook(filepath, read_only=True, keep_links=False) return wb.sheetnames

Benchmarks: ~ 14x speed improvement

# get_sheetnames_xlsx vs pd.read_excel 225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

`xlsb`

from pyxlsb import open_workbook def get_sheetnames_xlsb(filepath): with open_workbook(filepath) as wb: return wb.sheets

Benchmarks: ~ 56x speed improvement

# get_sheetnames_xlsb vs pd.read_excel 96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

This is a good resource - http://www.python-excel.org/
xlrd is no longer maintained as of 2020

Dhwanil shah · Accepted Answer · 2019-05-27 05:43:02Z

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.

def get_sheet_details(file_path): sheets = [] file_name = os.path.splitext(os.path.split(file_path)[-1])[0] # Make a temporary directory with the file name directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name) os.mkdir(directory_to_extract_to) # Extract the xlsx file as it is just a zip file zip_ref = zipfile.ZipFile(file_path, 'r') zip_ref.extractall(directory_to_extract_to) zip_ref.close() # Open the workbook.xml which is very light and only has meta data, get sheets from it path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml') with open(path_to_workbook, 'r') as f: xml = f.read() dictionary = xmltodict.parse(xml) for sheet in dictionary['workbook']['sheets']['sheet']: sheet_details = { 'id': sheet['@sheetId'], 'name': sheet['@name'] } sheets.append(sheet_details) # Delete the extracted files directory shutil.rmtree(directory_to_extract_to) return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.

@Daniel I have used only zipfile which is an in-built module and xmltodict which I used to convert the XML into an easily iterable dictionary. Although you can look at @divingTobi's answer below where you can read the same file without actually extracting the files within.
When I tried openpyxl with the read_only flag it is significantly faster (200X faster for my 5 MB file). load_workbook(excel_file).sheetnames averaged 8.24s where load_workbook(excel_file, read_only=True).sheetnames averaged 39.6ms.

divingTobi · Accepted Answer · 2019-07-01 14:01:13Z

Building on @dhwanil_shah 's answer, you do not need to extract the whole file. With zf.open it is possible to read from a zipped file directly.

import xml.etree.ElementTree as ET import zipfile def xlsxSheets(f): zf = zipfile.ZipFile(f) f = zf.open(r'xl/workbook.xml') l = f.readline() l = f.readline() root = ET.fromstring(l) sheets=[] for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'): sheets.append(c.attrib['name']) return sheets

The two consecutive readlines are ugly, but the content is only in the second line of the text. No need to parse the whole file.

This solution seems to be much faster than the read_excel version, and most likely also faster than the full extract version.

No, .xls is a completely different file format, so I would not expect this code to work.

flutefreak7 · Accepted Answer · 2020-06-04 20:54:41Z

from openpyxl import load_workbook sheets = load_workbook(excel_file, read_only=True).sheetnames

For a 5MB Excel file I'm working with, load_workbook without the read_only flag took 8.24s. With the read_only flag it only took 39.6 ms. If you still want to use an Excel library and not drop to an xml solution, that's much faster than the methods that parse the whole file.

Dominique · Accepted Answer · 2021-08-10 11:05:15Z

if you read excel file

dfs = pd.ExcelFile('file')

then use

dfs.sheet_names dfs.parse('sheetname')

another variant

df = pd.read_excel('file', sheet_name='sheetname')

philipnye · Accepted Answer · 2024-01-06 21:37:07Z

If using pd.read_excel() with sheet_name=None to read sheet names, setting nrows=0 can significantly boost speeds.

Benchmarking on an 8MB XLSX file with nine sheets:

%timeit pd.read_excel('foo.xlsx', sheet_name=None, nrows=0).keys() >>> 145 ms ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit pd.read_excel('foo.xlsx', sheet_name=None).keys() >>> 16.5 s ± 2.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

This takes advantage of the fact that (as you might expect) pandas stops reading a sheet once it reaches nrows (reference).

Pornthep Nivatyakul · Accepted Answer · 2022-11-08 00:00:38Z

import pandas as pd path = "\\DB\\Expense\\reconcile\\" file_name = "202209-v01.xlsx" df = pd.read_excel(path + file_name, None) print(df) sheet_names = list(df.keys()) # print last sheet name print(sheet_names[len(sheet_names)-1]) last_month = df.get(sheet_names[len(sheet_names)-1]) print(last_month)

Gopalarathnam Sambasivan · Accepted Answer · 2021-04-08 12:19:41Z

With the load_workbook readonly option, what was earlier seen as a execution seen visibly waiting for many seconds happened with milliseconds. The solution could however be still improved.

 import pandas as pd from openpyxl import load_workbook class ExcelFile: def __init__(self, **kwargs): ........ ..... self._SheetNames = list(load_workbook(self._name,read_only=True,keep_links=False).sheetnames)

The Excelfile.parse takes the same time as reading the complete xls in order of 10s of sec. This result was obtained with windows 10 operating system with below package versions

 C:\>python -V Python 3.9.1 C:\>pip list Package Version --------------- ------- et-xmlfile 1.0.1 numpy 1.20.2 openpyxl 3.0.7 pandas 1.2.3 pip 21.0.1 python-dateutil 2.8.1 pytz 2021.1 pyxlsb 1.0.8 setuptools 49.2.1 six 1.15.0 xlrd 2.0.1

Collectives™ on Stack Overflow

Pandas: Looking up the list of sheets in an excel file

13 Answers 13

3 Comments

3 Comments

Comments

1 Comment

1 Comment

`xlsx, xls`

`xlsb`

Comments

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions