UnicodeDecodeError on python3 [duplicate]

Question

Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:

file = open("exampleFileName", "r") for line in file: pass

The error message:

Traceback (most recent call last): File "example.py", line 34, in <module> example() File "example.py", line 16, in example for line in file: File "/usr/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte

How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?

Thanks and best regards!

Post the output of file -bi [your_filename]. You'll get an encoding. After that provide the encoding argument to open(). — user5164080
– user5164080, Commented Aug 17, 2016 at 16:27

mic4ael · Accepted Answer · 2016-08-17 16:25:33Z

12

It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try

file = open('exampleFileName', 'r', encoding='latin-1')

answered Aug 17, 2016 at 16:25

mic4ael

8,4304 gold badges32 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

chivorotkiv Over a year ago

Do you know how to do the same when reading from command line? I use input() function, is there a way to configure its encoding or is there some other configurable function?

Reihan_amn Over a year ago

How did you figure out to use latin-1 encoding?

mic4ael Over a year ago

0xed is í characters which you can find in the latin-1 encoding

Reihan_amn Over a year ago

So confused! after unicode encoding came into the scene to cover all ~2 m code point, why latin-1 encoding is still here? shouldn't latin-1 encoding be a subset of UTF encoding? shouldn't all defined codes in latin-1 be now a part of UTF? if so, why UTF cannot support it? (sorry I am kinda new in this field)

Community · Accepted Answer · 2017-05-23 12:33:02Z

It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:

try: file = open("exampleFileName", "r") except UnicodeDecodeError: try: file = open("exampleFileName", "r", encoding="latin2") except: #...

And so on, until you test all the encodings from Standard Python Encodings.

So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.

UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.

Collectives™ on Stack Overflow

UnicodeDecodeError on python3 [duplicate]

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related