6

Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:

file = open("exampleFileName", "r") for line in file: pass 

The error message:

Traceback (most recent call last): File "example.py", line 34, in <module> example() File "example.py", line 16, in example for line in file: File "/usr/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte 

How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?

Thanks and best regards!

3
  • Possibly related to stackoverflow.com/questions/5552555/… Commented Aug 17, 2016 at 16:26
  • Post the output of file -bi [your_filename]. You'll get an encoding. After that provide the encoding argument to open(). Commented Aug 17, 2016 at 16:27
  • what does file -bi command does? Commented Mar 1, 2018 at 23:15

2 Answers 2

12

It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try

file = open('exampleFileName', 'r', encoding='latin-1') 
Sign up to request clarification or add additional context in comments.

4 Comments

Do you know how to do the same when reading from command line? I use input() function, is there a way to configure its encoding or is there some other configurable function?
How did you figure out to use latin-1 encoding?
0xed is í characters which you can find in the latin-1 encoding
So confused! after unicode encoding came into the scene to cover all ~2 m code point, why latin-1 encoding is still here? shouldn't latin-1 encoding be a subset of UTF encoding? shouldn't all defined codes in latin-1 be now a part of UTF? if so, why UTF cannot support it? (sorry I am kinda new in this field)
0

It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:

try: file = open("exampleFileName", "r") except UnicodeDecodeError: try: file = open("exampleFileName", "r", encoding="latin2") except: #... 

And so on, until you test all the encodings from Standard Python Encodings.

So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.

UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.