Timeline for Process escape sequences in a string in Python

Current License: CC BY-SA 2.5

20 events

when toggle format	what		by	license	comment
Aug 5, 2022 at 0:19	comment	added	user2357112		Tried using `codecs.decode(myString, 'unicode-escape')`, since `codecs.decode` accepts Unicode input directly. Turns out that still fails on input outside the ASCII range, in the exact same way Apalala pointed out the current version of the answer already fails.
Mar 22, 2022 at 3:24	comment	added	metatoaster		@DonovanBaarda no, there are no multi-byte `utf-8` representation of any unicode codepoints > 127 that produce `bytes` within the `ascii` range (0-127), as all multi-byte characters are in the range 128-255 (i.e. `0x80` - `0xff`) because the designers of unicode and utf-8 understood this exact issue. In other words, no, it is impossible to for `str.encode('utf-8')` to produce the `bytes` `b'\x5c'` (`0x5c`) from anything other than the unicode codepoint `U+005C`.
Mar 21, 2022 at 11:30	comment	added	Donovan Baarda		@metatoaster But isn't your solution still a bit fragile, since `s.encode('utf-8')` encodes the output in utf-8 and `decode('unicode_escape')` assumes the input is latin-1? Is it possible that the utf-8 encoding introduces some backslash bytes? It would probably work fine most of the time, but if the input string included a unicode character that when utf-8 encoded included a `0x5c` latin-1 backslash character, that backslash would get escaped, which would then probably break the final `decode('utf-8')`.
Dec 17, 2020 at 1:38	comment	added	Glen Whitney		Just wanted to note that metatoaster is correct, unicode_escape does need a latin-1 coded byte sequence, but it's not necessary to make two roundtrips between strings and byte sequences (see alternate answer for python3).
Dec 16, 2020 at 22:26	review	Suggested edits
Dec 17, 2020 at 0:13
Jul 10, 2018 at 2:41	comment	added	OpenAI stole this from rspeer		@metatoaster Oh, I see! Yes, that actually does work. Nice.
Jul 6, 2018 at 5:19	comment	added	metatoaster		@rspeer the whole string when being decoded as `unicode_escape` is `bytes`, which means it doesn't have any encoding, but `unicode_escape` is a valid codec which would produce the same `bytes` as `unicode` encoded in `latin1` from the input string. For ease of illustration please look at this example and see how that actually works through every single step (to ease the effort from having to manually try it on your end). Hence I said "redo the encode/decode bit".
Jul 6, 2018 at 3:39	comment	added	OpenAI stole this from rspeer		@metatoaster As stated in my answer, that doesn't work if your string contains any characters that aren't in latin-1.
May 25, 2018 at 9:01	comment	added	metatoaster		Since `latin1` is assumed by `unicode_escape`, redo the encode/decode bit, e.g. `s.encode('utf-8').decode('unicode_escape').encode('latin1').decode('utf8')`
Mar 28, 2016 at 3:26	comment	added	Christian Aichinger		Agreed with @Apalala: this is not good enough. Check out rseeper's answer below for a complete solution that works in Python2 and 3!
Jul 1, 2014 at 19:04	comment	added	Apalala		This solution is not good enough because it doesn't handle the case in which there are legit unicode characters in the original string. If you try: `>>> print("juancarlo\\tañez".encode('utf-8').decode('unicode_escape'))` You get: `juancarlo aÃ±ez`
May 14, 2013 at 8:44	comment	added	Chris		In Python 2.7, myStr.decode('unicode_escape') seems better than myStr.decode('string_escape'), because it will also unescape unicode \udddd escape sequences into actual unicode characters. For example, r"\u2014").decode('unicode_escape') yields u"\u2014". string_escape, in contrast, leaves unicode escapes untouched. Though note that (at least in my locale) while I can put non-ASCII unicode escapes in myStr, I can't put actual non-ASCII characters in myStr, or decode will give me "UnicodeEncodeError: 'ascii' codec can't encode character" problems.
Feb 13, 2013 at 16:22	review	Suggested edits
Feb 13, 2013 at 16:27
Feb 17, 2012 at 9:59	comment	added	Ning Sun		@dln385 Does it work with non-ascii characters? I have some non-ascii chars with \\t. In python2, string-escape just works for that. But in python3, the codec is removed. And the unicode-escape just escapes all non-ascii bytes and breaks my encoding.
Oct 26, 2010 at 6:29	history	edited	Jerub	CC BY-SA 2.5	added 97 characters in body
Oct 26, 2010 at 6:06	vote	accept	dln385
Oct 26, 2010 at 6:06	comment	added	dln385		In Python 3, the command needs to be `print(bytes(myString, "utf-8").decode("unicode_escape"))`
Oct 26, 2010 at 5:44	comment	added	dln385		@Nas Banov The documentation does make a small mention about that: `Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.`
Oct 26, 2010 at 5:18	comment	added	Nas Banov		hands down, the best solution! btw, by docs it should be "string_escape" (with underscore) but for some reason accepts anything in the pattern 'string escape', 'string@escape" and whatnot... basically `'string\W+escape'`
Oct 26, 2010 at 5:01	history	answered	Jerub	CC BY-SA 2.5

toggle format

Collectives™ on Stack Overflow

Timeline for Process escape sequences in a string in Python

Current License: CC BY-SA 2.5