0

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.

I tried the following

import codecs s = "Aur\xc3\xa9lien" codecs.decode(s, "urf-8") codecs.decode(bytes(s), "urf-8") codecs.decode(bytes(s, "utf-8"), "utf-8") 

but none of them gives the correct result Aurélien.

How to do it correctly?

And is there no basic, general authoritative simple page that describes all these encodings for python?

6
  • s = "Aur\xc3\xa9lien"; b = bytes(s, 'latin-1'); print(b.decode('utf-8')) Commented Feb 4, 2021 at 15:23
  • Note: your "s" is not really a string, but a sequence of bytes, so you should precede it with a b. You are using a special feature of Python (which allow binary characters together Unicode sequence). Commented Feb 4, 2021 at 15:48
  • I read that string from a file. How to precede an existing string with a 'b'? Commented Feb 4, 2021 at 15:51
  • How do you read the string from a file? You use probably a wrong open command. Which parameter do you use? Usually open read a text file, and you should have a unicode strings (with ev. replacement characters,). But on no normal case you get such "string". To have a binary string, just use 'b' in open Commented Feb 4, 2021 at 16:02
  • 1
    Note: you should tag people when replying. (You have autotag, because you are the questioner, and I get it when someone reply to my answer). How do you read a csv file? Usually I use with open('file.cvs', encoding='utf8' as f: for line in f.readlines(): fields=line.split(','). But you may be using a module? csv module? How do you read the file? [long ago, in earlier 3.x versions csv was buggy regarding Unicode files] Commented Feb 4, 2021 at 16:32

3 Answers 3

3

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.

Try this:

import chardet s = "Aur\xc3\xa9lien" bs = b"Aur\xc3\xa9lien" encoding = chardet.detect(bs)["encoding"] str = s.encode(encoding).decode("utf-8") print(str) 

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937

Sign up to request clarification or add additional context in comments.

4 Comments

How do I know the original string is encoded in 'latin1'?
@Alex I have updated my answer to programmatically detect the encoding.
@Alex I also added a link to help you detect the encoding if the text is from a file and not a string in code.
But the problem is not about detecting the encoding (it is clearly UTF-8). And a Python string in theory has no encoding. The problem is that the Python string has some characters as binary data, not interpreted as unicode code points (which it is an hidden/not very well know feature of Python [and most programmers should never see it]).
1

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

s = "Aur\xc3\xa9lien" s.encode('latin-1').decode('utf-8') print(s.encode('latin-1').decode('utf-8')) Output Aurélien 

5 Comments

How do I know it is 'latin-1'?
@Alex In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently, utf8 has more characters than latin1. Further, if you want to know more about it then you can go through this answer. stackoverflow.com/questions/2708958/…
But in the actual text there is one character é that is 8 bytes long. No? Sorry I do not understand. é=\xc3\xa9
é is actually 16 bits or 2 bytes long. You can see this for yourself by assigning it as bytes b = b'\xc3\xa9'. See that the length is 2 len(b). Get the decimal value of both bytes. byte1 = b[0] which is 195 and byte2 = b[1] is 169. Then formatting them as binary. print(f'{byte1:b}') returns '11000011' and print(f'{byte2:b}') returns '10101001'. Behind the scenes utf-8 is reading the binary bits and translating them to the characters they're decoded as. Sometimes in chunks of 8 bits. Sometimes more.
This answer explains it a lot better than I can stackoverflow.com/a/27939161/12479639. What it boils down to is, the 8 bits value that it reads from disk implies whether or not to include the next 8 bits or not.
0

Your string is not a Unicode sequence, so you should prefix it with b

import codecs b = b"Aur\xc3\xa9lien" b.decode('utf-8') 

So you have the expected: 'Aurélien'.

If you want to use s, you should use mbcs, latin-1, mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.

3 Comments

I read that string from a file. How do I precede a that with a 'b'?
If you read the string from a file, you should write in your question, and how do you read the string. It is not normal to have such string reading data from a file. Really, it is far for default or expected behaviour of reading files
And in any case, the second part of the question tell you how to do, if you have a string with binary data.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.