I need to read file content from test.txt and convert it to utf-8 encoding (to readable chinese).
it seems like an easy task, but using open(), codecs.open() etc, it always read the line as str type, instead of recognizing it as bytes.
with codecs.open(input_file, 'rb') as reader: for line in reader: print(type(line)) # if it is bytes #print(line.decode('utf-8')) my input file content in test.txt is exactly like below, with b' prefix, marking it as bytes type:
b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd'
what I expect it as below, but needs to read the content from file:
>>> line = b'\xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe5\x95\x8a \xe6\x9c\x89 \xe4\xbb\x80 \xe4\xb9\x88 \xe4\xba\x8b \xe7\xbb\x99 \xe6\x88\x91 \xe6\x89\x93 \xe7\x94\xb5 \xe8\xaf\x9d \xe5\x95\x8a \xe5\x97\xaf \xe5\x97\xaf \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\xa5\xbd \xe5\x86\x8d \xe8\xa7\x81 \xe5\x93\x8e \xe5\x86\x8d \xe8\xa7\x81 \xe5\x97\xaf \xe5\xa5\xbd' >>> print(line.decode('utf-8')) 啊 有 什 么 事 啊 有 什 么 事 给 我 打 电 话 啊 嗯 嗯 好 好 好 好 再 见 哎 再 见 嗯 好 how can i do it ? I googled a lot, but with no luck. please help.
bin the file mode actually prevents the input to be UTF-8 decoded automatically.b'\xe5in the file? What doesprint(repr(open(file, 'rb').read()[:10]))give you?bytesobject to a file. Right now, you would have toevalit to recover the object (you can in this case only because of the way that string representation is implemented). But you should fix the source of this fundamental error