Decoding bytes as unicode string

Question

The question is how to extract string, which represented as bytes (warning) in string? What I actually mean:

>>> s1 = '\\xd0\\xb1' # But this is NOT bytes of s1! s1 should be 'б'! '\\xd0\\xb1' >>> s1[0] '\\' >>> len(s1) # The problem is here: I thought I would see (2), but: 8 >>> type(s1) <class 'str'> >>> type(s1[0]) <class 'str'> >>> s1[0] == '\\' True

So how can i convert s1 to 'б' (cyrillic symbol - the real representation of '\xd0\xb1'). I already asked here a similiar question, but my bad was misunderstood of real represented view of s1 (i thought that '\' was '\', not the '\\').

Deck · Accepted Answer · 2013-11-26 06:45:28Z

4

>>> s1 = b'\xd0\xb1' >>> s1.decode("utf8") 'б' >>> len(s1) 2

answered Nov 26, 2013 at 6:45

Deck

1,9794 gold badges20 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nafiul Islam Over a year ago

Why do you put a b in there, why not r for raw string?

lvc Over a year ago

@GamesBrainiac because it isn't a raw string - the backslashes are meaningful. The b makes it a byte string. \xd0 is a single byte, with the value 0xD0. You can combine them (making it a raw byte string), but then you trigger the same error as the OP.

Nafiul Islam Over a year ago

I see. Thanks, I did not know that these were byte-strings. Much appreciated :) Come to the python chatroom sometimes, I'm sure we could all learn a lot from you :)

user3034492 Over a year ago

It could be a solution for problem, but s1 theoretically may be declared in side-code (other sources, came from internet, et cetera). The question is not how to convert '\xd0\xb1' with len == 2 to 'б', but how to convert '\\xd0\\xb1' with len == 8 to 'б'

pepr · Accepted Answer · 2013-11-27 13:54:32Z

Try the following code. Warning, it is only a proof of concept. When the text contains also characters written as non-escape sequences, the replacement must be done the more complicated way (I will show later when wanted). See the comments below.

import binascii s1 = '\\xd0\\xb1' print('s1 =', repr(s1), '=', list(s1)) # list() to emphasize what are the characters s2 = s1.replace('\\x', '') print('s2 =', repr(s2)) b = binascii.unhexlify(s2) print('b =', repr(b), '=', list(b)) s3 = b.decode('utf8') print('s3 =', ascii(s3)) with open('output.txt', 'w', encoding='utf-8') as f: f.write(s3)

It prints on concole:

c:\__Python\user\so20210201>py a.py s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1'] s2 = 'd0b1' b = b'\xd0\xb1' = [208, 177] s3 = '\u0431'

And it writes the character to the output.txt file.

The problem is that the question combines both unicode escaping and escaping binary values. In other words, the unicode string can contain some sequence that represents binary value somehow; however, you cannot force that binary value into the unicode string directly, because any unicode character is actually an abstract integer, and the integer can be represented in many ways (as a sequence of bytes).

If the unicode string contained escape sequences like \\n, it could be done differently, using the 'unicode_escape' prescription for bytes.decode(). However, in this case, you need both decoding from ascii escape sequences and then from utf-8.

Update: Here is a function for converting your kind of strings with other ascii characters (i.e. not only the escape sequences). The function use the finite automaton -- may look too complex at first (actually it is only verbose).

def userDecode(s): status = 0 lst = [] # result as list of bytes as ints xx = None # variable for one byte escape conversion for c in s: # unicode character print(status, ' c ==', c) ## just for debugging if status == 0: if c == '\\': status = 1 # escape sequence for a byte starts else: lst.append(ord(c)) # convert to integer elif status == 1: # x expected assert(c == 'x') status = 2 elif status == 2: # first nibble expected xx = c status = 3 elif status == 3: # second nibble expected xx += c lst.append(int(xx, 16)) # this is a hex representation of int status = 0 # Construct the bytes from the ordinal values in the list, and decode # it as UTF-8 string. return bytes(lst).decode('utf-8') if __name__ == '__main__': s = userDecode('\\xd0\\xb1whatever') print(ascii(s)) # cannot be displayed on console that does not support unicode with open('output.txt', 'w', encoding='utf-8') as f: f.write(s)

Look also inside the generated file. Remove the debug print. It displays the following on the console:

c:\__Python\user\so20210201>b.py 0 c == \ 1 c == x 2 c == d 3 c == 0 0 c == \ 1 c == x 2 c == b 3 c == 1 0 c == w 0 c == h 0 c == a 0 c == t 0 c == e 0 c == v 0 c == e 0 c == r '\u0431whatever'

You are welcome :) Anyway, how did you get the string with the escape sequences?
There is a Flask server. The message (string) is crypted by RSA key at server-side and returned as binary data ... in string (like s1 in example). It's taken using Requests package on the client-side. Bad news: i have no access to server sources so i can not change the format used to send crypted message. Update: there is miss of few things: 1. Message crypted by RSA key at server; 2. It is sended to client like binary data in string format (like s1); 3. It is recieved at client and decrypted; 4. The result is something like s1.
I see. Anyway, isn't it some "well known" (not to me) way of escaping the transfered content? If yes, there could be some module around for the purpose.

Collectives™ on Stack Overflow

Decoding bytes as unicode string

2 Answers 2

4 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Related