UnicodeDecodeError, invalid continuation byte

Question

Why is the below item failing? Why does it succeed with "latin-1" codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving v = o.decode("utf-8")

Which results in:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Vishal Singh · Accepted Answer · 2020-09-14 18:01:21Z

600

I had the same error when I tried to open a CSV file by pandas.read_csv method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

edited Sep 14, 2020 at 18:01

Vishal Singh

6,2522 gold badges19 silver badges34 bronze badges

answered Jul 18, 2015 at 15:33

Mazen Aly

6,4211 gold badge17 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Yu Chen Over a year ago

Does this actually solve the problem though? Doesn't it basically just tell pandas to ignore the byte by downgrading to a less complex encoding style?

Leonardo Maffei Over a year ago

Works well for builtin open function. Thanks

Josh Lee · Accepted Answer · 2017-04-05 16:56:41Z

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8') u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8') b'\xc3\xa9' >>> u'\xe9'.encode('latin-1') b'\xe9'

(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

Thanks (and to the other that replied), I was under the mistaken belief that chars up until 255 would directly convert.
I get UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) error on using .encode(latin-1)

Sami J. Lehtinen · Accepted Answer · 2011-04-05 13:35:21Z

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.

If you can't do that, you'll need heuristics.

neurino · Accepted Answer · 2011-04-05 13:28:50Z

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8") 'a test of \xc3\xa9 char'

Latin-1 is a single byte encoding family so everything in it should be defined in UTF-8. But why sometime Latin-1 wins?

Anshul Singh Suryan · Accepted Answer · 2020-04-14 07:21:32Z

33

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

answered Apr 14, 2020 at 7:21

Anshul Singh Suryan

9961 gold badge11 silver badges15 bronze badges

Comments

Patrick Mutuku · Accepted Answer · 2018-07-04 23:09:08Z

31

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

answered Jul 4, 2018 at 23:09

Patrick Mutuku

1,19516 silver badges13 bronze badges

3 Comments

Isaac Philip Over a year ago

Thanks to this answer, was able to avoid the error of, UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 2024079: invalid continuation byte by soup = BeautifulSoup(open('webpage.html', 'rb'), 'html.parser')

x__x Over a year ago

This was also the solution for the same problem when using trimesh.util.concatenate()'s export method.

Benjamin Johnson Over a year ago

got this working with png and jpg files

HK boy · Accepted Answer · 2020-01-18 17:20:16Z

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point. 2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

Hi and welcome to SO! Please edit your answer to ensure that it improves upon other answers already present in this question.

Perry · Accepted Answer · 2020-06-27 00:36:48Z

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

It works but it does not gets some country names from FAO data

Alon G · Accepted Answer · 2019-02-21 07:53:53Z

12

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

answered Feb 21, 2019 at 7:53

Alon G

3,42133 silver badges31 bronze badges

1 Comment

Cecilia Guadalupe Fuentes Falc Nov 19 at 17:21

I have a problem, i'm trying to set up Odoo 19, and I don't have any custom modules yet. I don't know if it's a library issue, but I've already installed all the libraries listed in the requirements.txt file. Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 85: invalid continuation byte

malvoisen · Accepted Answer · 2022-04-17 10:32:50Z

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.

I got this error as I was processing a large number of zip files with additional zip files in them.

My workflow was the following:

Read zip
Read child zip
Read text from child zip

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

Zrufy · Accepted Answer · 2021-12-11 21:22:24Z

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1], port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5]) cursor = con.cursor() sqlfile = open(path, 'r')

Nesha25 · Accepted Answer · 2022-09-26 19:21:34Z

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.

What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

masilva70 masilva70 · Accepted Answer · 2021-06-02 21:06:54Z

-1

The solution was change to "UTF-8 sin BOM"

answered Jun 2, 2021 at 21:06

masilva70 masilva70

11

Collectives™ on Stack Overflow

UnicodeDecodeError, invalid continuation byte

13 Answers 13

2 Comments

2 Comments

1 Comment

1 Comment

Comments

3 Comments

1 Comment

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

2 Comments

2 Comments

1 Comment

1 Comment

Comments

3 Comments

1 Comment

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related