What is causing a UnicodeDecodeError when trying to read a text file?

Question

I am trying to execute this code snippet in python 3.8

 def load_rightprob(self, rightprob_file): ''' dictionary with # people keys with # actions ''' rightProb = {} for line in open(rightprob_file): items = line.strip().split("\t") if len(items) != len(self.action_qid_dict) + 1: continue pid = int(items[0])

but I get this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I tried for line in open(rightprob_file, **'rb'**): instead but I get challenges on the following line with this error:

TypeError: a bytes-like object is required, not 'str'

Can somebody please suggest how to fix this? I am reading from a .txt file where each line is an ID, followed by 377 columns representing probability values associated with this ID

Thanks.

We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also meta.stackoverflow.com/questions/379403/… — tripleee
– tripleee, Commented Jan 2, 2021 at 19:21
@tripleee a value of FF in byte 0 means it's 99% likely to be a BOM for UTF-16 little-endian. — Mark Ransom
– Mark Ransom, Commented Jan 2, 2021 at 19:26

Mark Ransom · Accepted Answer · 2021-01-03 00:05:56Z

It's very unusual for a text file to start with 0xff. Because of that, it's sometimes placed deliberately at the start of the file as part of a Byte Order Mark (BOM) for Unicode, particularly on Windows. As you can see in the table in the link, only two Unicode encodings have a BOM that starts with 0xff: UTF-16 or UTF-32, both little endian. Of the two UTF-16 is far more commonly encountered.

So open your file like this:

with open(rightprob_file, 'r', encoding='utf_16_le') as f: for line in f:

I added the with so that the file would be automatically closed when you're done, that was a bug in your original code.

The first character read from the file will be u'\ufeff' and can be thrown away or otherwise ignored.

Collectives™ on Stack Overflow

What is causing a UnicodeDecodeError when trying to read a text file?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related