Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Question

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.

EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help

import pandas as pd with open('1.xlsx', 'r', encoding='utf8') as f: data = pd.read_excel(f)

Result:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Without encoding ='utf8'

'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>

P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.

Alan Franzoni · Accepted Answer · 2020-07-08 09:54:09Z

Most probably you're using Python3. In Python2 this wouldn't happen.

xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:

open('1.xlsx', 'rb')

There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.

thanks, was running into the same issue. Using open in binary mode along with pandas.read_excel solved an issue for me.

Stu · Accepted Answer · 2018-11-02 18:03:49Z

The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.

I ran into this same error using:

df = pd.read_excel(open("file.xlsx",'r'))

but correct is:

df = pd.read_excel("file.xlsx")

Your answer fixed my problem :) Don't know why it's at the bottom.
Passing in a file handler is perfectly fine but has to open as a binary file. read_excel's first param can take a variety of argument types: str, bytes, ExcelFile, xlrd.Book, path object, or file-like object. pandas.pydata.org/docs/reference/api/…

plnnvkv · Accepted Answer · 2018-03-11 12:00:12Z

Most probably the problem is in Russian symbols.

Charmap is default decoding method used in case no encoding is beeing noticed.

As I see if utf-8 and latin-1 do not help then try to read this file not as

pd.read_excel(f)

but

pd.read_table(f)

or even just

f.readline()

in order to check what is a symbol raise an exeception and delete this symbol/symbols.

It helped indeed. I found that when I just read excel as text the one specific symbol is appeared at the beginning
Not only Russian symbols, but also Chinese, Japanese, Korean and other "special characters" can cause this decode problem in python, this depends on the charset used for saving and reading in.

Nishant Patel · Accepted Answer · 2018-02-06 16:27:24Z

1

Panda support encoding feature to read your excel In your case you can use:

df=pd.read_excel('your_file.xlsx',encoding='utf-8')

or if you want in more of system specific without any surpise you can use:

df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

answered Feb 6, 2018 at 16:27

Nishant Patel

1,43617 silver badges25 bronze badges

4 Comments

Denis Over a year ago

sys.getfilesystemcoding() does not work too. Unknown encoding

Nishant Patel Over a year ago

@pure_true: using sys.getfilesystemcoding("encoding") should be then enforeced to take an encoding pattern, which you need to identify from your file.

Denis Over a year ago

It seems you did not understand me or I do not understand you. When I am putting this sys.getfilesystemcoding() into encoding parametr - I got the error : Unknown encoding . Please explain properly what you meant

Nishant Patel Over a year ago

Unknown encoding means your file contains characters which are not recognized by any inbuilt encoding methods. Refer below link to find encoding for your file stackoverflow.com/questions/3710374/…

SAIKO · Accepted Answer · 2023-06-10 18:39:25Z

Well sometimes, this error is bound to occur if we have empty cells in the Excel sheet, that is null values in terms of the Pandas data frame, and for that purpose, we can have the following approach which worked for me.

import pandas as pd import openpyxl # Load Excel file using openpyxl wb = openpyxl.load_workbook('excel_sheet_name.xlsx') sheet = wb.active # Create an empty DataFrame data = [] # Iterate over the rows in the sheet for row in sheet.iter_rows(values_only=True): data.append(row) # Convert to DataFrame excel_file = pd.DataFrame(data)

xuzepei · Accepted Answer · 2023-10-27 02:07:53Z

If you get an error when reading empty cells or special characters in excel, you can try the code below:

 df = pd.read_excel(excel_filepath, sheet_name=sheet_name, dtype = str) df = df.fillna('') json_str = df.to_json(orient='records', force_ascii=False)

For me, none of the solutions above, but this one solved the issue and works properly with "üäö"-like characters. Thanks!

Collectives™ on Stack Overflow

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

6 Answers 6

1 Comment

2 Comments

2 Comments

4 Comments

Comments

1 Comment

Hot Network Questions