How to decode a text in python3?

Question

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.

I tried the following

import codecs s = "Aur\xc3\xa9lien" codecs.decode(s, "urf-8") codecs.decode(bytes(s), "urf-8") codecs.decode(bytes(s, "utf-8"), "utf-8")

but none of them gives the correct result Aurélien.

How to do it correctly?

And is there no basic, general authoritative simple page that describes all these encodings for python?

s = "Aur\xc3\xa9lien"; b = bytes(s, 'latin-1'); print(b.decode('utf-8')) — user5386938
– user5386938, Commented Feb 4, 2021 at 15:23
Note: your "s" is not really a string, but a sequence of bytes, so you should precede it with a b. You are using a special feature of Python (which allow binary characters together Unicode sequence). — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Feb 4, 2021 at 15:48
I read that string from a file. How to precede an existing string with a 'b'? — Alex
– Alex, Commented Feb 4, 2021 at 15:51
How do you read the string from a file? You use probably a wrong open command. Which parameter do you use? Usually open read a text file, and you should have a unicode strings (with ev. replacement characters,). But on no normal case you get such "string". To have a binary string, just use 'b' in open — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Feb 4, 2021 at 16:02
Note: you should tag people when replying. (You have autotag, because you are the questioner, and I get it when someone reply to my answer). How do you read a csv file? Usually I use with open('file.cvs', encoding='utf8' as f: for line in f.readlines(): fields=line.split(','). But you may be using a module? csv module? How do you read the file? [long ago, in earlier 3.x versions csv was buggy regarding Unicode files] — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Feb 4, 2021 at 16:32

jgphilpott · Accepted Answer · 2021-02-04 16:15:42Z

3

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.

Try this:

import chardet s = "Aur\xc3\xa9lien" bs = b"Aur\xc3\xa9lien" encoding = chardet.detect(bs)["encoding"] str = s.encode(encoding).decode("utf-8") print(str)

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937

edited Feb 4, 2021 at 16:15

answered Feb 4, 2021 at 15:24

jgphilpott

6592 gold badges10 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Alex Over a year ago

How do I know the original string is encoded in 'latin1'?

jgphilpott Over a year ago

@Alex I have updated my answer to programmatically detect the encoding.

jgphilpott Over a year ago

@Alex I also added a link to help you detect the encoding if the text is from a file and not a string in code.

Giacomo Catenazzi Over a year ago

But the problem is not about detecting the encoding (it is clearly UTF-8). And a Python string in theory has no encoding. The problem is that the Python string has some characters as binary data, not interpreted as unicode code points (which it is an hidden/not very well know feature of Python [and most programmers should never see it]).

mhhabib · Accepted Answer · 2021-02-04 15:25:33Z

1

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

s = "Aur\xc3\xa9lien" s.encode('latin-1').decode('utf-8') print(s.encode('latin-1').decode('utf-8')) Output Aurélien

answered Feb 4, 2021 at 15:25

mhhabib

3,1471 gold badge19 silver badges35 bronze badges

5 Comments

Alex Over a year ago

How do I know it is 'latin-1'?

mhhabib Over a year ago

@Alex In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently, utf8 has more characters than latin1. Further, if you want to know more about it then you can go through this answer. stackoverflow.com/questions/2708958/…

Alex Over a year ago

But in the actual text there is one character é that is 8 bytes long. No? Sorry I do not understand. é=\xc3\xa9

Axe319 Over a year ago

é is actually 16 bits or 2 bytes long. You can see this for yourself by assigning it as bytes b = b'\xc3\xa9'. See that the length is 2 len(b). Get the decimal value of both bytes. byte1 = b[0] which is 195 and byte2 = b[1] is 169. Then formatting them as binary. print(f'{byte1:b}') returns '11000011' and print(f'{byte2:b}') returns '10101001'. Behind the scenes utf-8 is reading the binary bits and translating them to the characters they're decoded as. Sometimes in chunks of 8 bits. Sometimes more.

Axe319 Over a year ago

This answer explains it a lot better than I can stackoverflow.com/a/27939161/12479639. What it boils down to is, the 8 bits value that it reads from disk implies whether or not to include the next 8 bits or not.

Giacomo Catenazzi · Accepted Answer · 2021-02-04 15:58:04Z

Your string is not a Unicode sequence, so you should prefix it with b

import codecs b = b"Aur\xc3\xa9lien" b.decode('utf-8')

So you have the expected: 'Aurélien'.

If you want to use s, you should use mbcs, latin-1, mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.

I read that string from a file. How do I precede a that with a 'b'?
If you read the string from a file, you should write in your question, and how do you read the string. It is not normal to have such string reading data from a file. Really, it is far for default or expected behaviour of reading files
And in any case, the second part of the question tell you how to do, if you have a string with binary data.

Collectives™ on Stack Overflow

How to decode a text in python3?

3 Answers 3

4 Comments

5 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

3 Comments

Linked

Related