python how to read bytes type data from file and convert it to utf-8?

Question

I need to read file content from test.txt and convert it to utf-8 encoding (to readable chinese).

it seems like an easy task, but using open(), codecs.open() etc, it always read the line as str type, instead of recognizing it as bytes.

with codecs.open(input_file, 'rb') as reader: for line in reader: print(type(line)) # if it is bytes #print(line.decode('utf-8'))

my input file content in test.txt is exactly like below, with b' prefix, marking it as bytes type:

b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd'

what I expect it as below, but needs to read the content from file:

>>> line = b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd' >>> print(line.decode('utf-8')) 啊 有 什 么 事 啊 有 什 么 事 给 我 打 电 话 啊 嗯 嗯 好 好 好 好 再 见 哎 再 见 嗯 好

how can i do it ? I googled a lot, but with no luck. please help.

The b in the file mode actually prevents the input to be UTF-8 decoded automatically. — Klaus D.
– Klaus D., Commented Jul 27, 2022 at 9:55
it isn't clear to me what the issue is. Are you telling me there is literally a b'\xe5 in the file? What does print(repr(open(file, 'rb').read()[:10])) give you? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jul 27, 2022 at 16:30
@juanpa.arrivillaga yes, it is literally b'\xe5 in the file. and the function you provided, gives me this: b"b'01941c05" — Phoenix Bai
– Phoenix Bai, Commented Jul 28, 2022 at 1:37
OK, that's the problem. You wrote the string representation of a bytes object to a file. Right now, you would have to eval it to recover the object (you can in this case only because of the way that string representation is implemented). But you should fix the source of this fundamental error — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jul 28, 2022 at 1:50
@juanpa.arrivillaga eval function does the work. now it works as expected. thank you so much! — Phoenix Bai
– Phoenix Bai, Commented Jul 28, 2022 at 2:35

Daweo · Accepted Answer · 2022-07-27 10:03:13Z

2

You should specify encoding as argument to open that is

import codecs with codecs.open("test.txt", encoding="utf-8") as reader: for line in reader: print(line)

answered Jul 27, 2022 at 10:03

Daweo

38.2k3 gold badges17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Nastor Over a year ago

That, and avoid using "b" in mode parameter when opening a file if you don't want it in binary.

juanpa.arrivillaga Over a year ago

this will read it as a string, and you can just use open

Phoenix Bai Over a year ago

reading it like this is no problem, but bytes type data is read as string, which makes it hard to decode to utf-8.

Collectives™ on Stack Overflow

python how to read bytes type data from file and convert it to utf-8?

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related