1

I need to read file content from test.txt and convert it to utf-8 encoding (to readable chinese).

it seems like an easy task, but using open(), codecs.open() etc, it always read the line as str type, instead of recognizing it as bytes.

with codecs.open(input_file, 'rb') as reader: for line in reader: print(type(line)) # if it is bytes #print(line.decode('utf-8')) 

my input file content in test.txt is exactly like below, with b' prefix, marking it as bytes type:

b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd'

what I expect it as below, but needs to read the content from file:

>>> line = b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd' >>> print(line.decode('utf-8')) 啊 有 什 么 事 啊 有 什 么 事 给 我 打 电 话 啊 嗯 嗯 好 好 好 好 再 见 哎 再 见 嗯 好 

how can i do it ? I googled a lot, but with no luck. please help.

5
  • 1
    The b in the file mode actually prevents the input to be UTF-8 decoded automatically. Commented Jul 27, 2022 at 9:55
  • 1
    it isn't clear to me what the issue is. Are you telling me there is literally a b'\xe5 in the file? What does print(repr(open(file, 'rb').read()[:10])) give you? Commented Jul 27, 2022 at 16:30
  • @juanpa.arrivillaga yes, it is literally b'\xe5 in the file. and the function you provided, gives me this: b"b'01941c05" Commented Jul 28, 2022 at 1:37
  • 1
    OK, that's the problem. You wrote the string representation of a bytes object to a file. Right now, you would have to eval it to recover the object (you can in this case only because of the way that string representation is implemented). But you should fix the source of this fundamental error Commented Jul 28, 2022 at 1:50
  • @juanpa.arrivillaga eval function does the work. now it works as expected. thank you so much! Commented Jul 28, 2022 at 2:35

1 Answer 1

2

You should specify encoding as argument to open that is

import codecs with codecs.open("test.txt", encoding="utf-8") as reader: for line in reader: print(line) 
Sign up to request clarification or add additional context in comments.

3 Comments

That, and avoid using "b" in mode parameter when opening a file if you don't want it in binary.
this will read it as a string, and you can just use open
reading it like this is no problem, but bytes type data is read as string, which makes it hard to decode to utf-8.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.