Timeline for Process escape sequences in a string in Python
Current License: CC BY-SA 2.5
20 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Aug 5, 2022 at 0:19 | comment | added | user2357112 | Tried using codecs.decode(myString, 'unicode-escape'), since codecs.decode accepts Unicode input directly. Turns out that still fails on input outside the ASCII range, in the exact same way Apalala pointed out the current version of the answer already fails. | |
| Mar 22, 2022 at 3:24 | comment | added | metatoaster | @DonovanBaarda no, there are no multi-byte utf-8 representation of any unicode codepoints > 127 that produce bytes within the ascii range (0-127), as all multi-byte characters are in the range 128-255 (i.e. 0x80 - 0xff) because the designers of unicode and utf-8 understood this exact issue. In other words, no, it is impossible to for str.encode('utf-8') to produce the bytes b'\x5c' (0x5c) from anything other than the unicode codepoint U+005C. | |
| Mar 21, 2022 at 11:30 | comment | added | Donovan Baarda | @metatoaster But isn't your solution still a bit fragile, since s.encode('utf-8') encodes the output in utf-8 and decode('unicode_escape') assumes the input is latin-1? Is it possible that the utf-8 encoding introduces some backslash bytes? It would probably work fine most of the time, but if the input string included a unicode character that when utf-8 encoded included a 0x5c latin-1 backslash character, that backslash would get escaped, which would then probably break the final decode('utf-8'). | |
| Dec 17, 2020 at 1:38 | comment | added | Glen Whitney | Just wanted to note that metatoaster is correct, unicode_escape does need a latin-1 coded byte sequence, but it's not necessary to make two roundtrips between strings and byte sequences (see alternate answer for python3). | |
| Dec 16, 2020 at 22:26 | review | Suggested edits | |||
| Dec 17, 2020 at 0:13 | |||||
| Jul 10, 2018 at 2:41 | comment | added | OpenAI stole this from rspeer | @metatoaster Oh, I see! Yes, that actually does work. Nice. | |
| Jul 6, 2018 at 5:19 | comment | added | metatoaster | @rspeer the whole string when being decoded as unicode_escape is bytes, which means it doesn't have any encoding, but unicode_escape is a valid codec which would produce the same bytes as unicode encoded in latin1 from the input string. For ease of illustration please look at this example and see how that actually works through every single step (to ease the effort from having to manually try it on your end). Hence I said "redo the encode/decode bit". | |
| Jul 6, 2018 at 3:39 | comment | added | OpenAI stole this from rspeer | @metatoaster As stated in my answer, that doesn't work if your string contains any characters that aren't in latin-1. | |
| May 25, 2018 at 9:01 | comment | added | metatoaster | Since latin1 is assumed by unicode_escape, redo the encode/decode bit, e.g. s.encode('utf-8').decode('unicode_escape').encode('latin1').decode('utf8') | |
| Mar 28, 2016 at 3:26 | comment | added | Christian Aichinger | Agreed with @Apalala: this is not good enough. Check out rseeper's answer below for a complete solution that works in Python2 and 3! | |
| Jul 1, 2014 at 19:04 | comment | added | Apalala | This solution is not good enough because it doesn't handle the case in which there are legit unicode characters in the original string. If you try: >>> print("juancarlo\\tañez".encode('utf-8').decode('unicode_escape')) You get: juancarlo añez | |
| May 14, 2013 at 8:44 | comment | added | Chris | In Python 2.7, myStr.decode('unicode_escape') seems better than myStr.decode('string_escape'), because it will also unescape unicode \udddd escape sequences into actual unicode characters. For example, r"\u2014").decode('unicode_escape') yields u"\u2014". string_escape, in contrast, leaves unicode escapes untouched. Though note that (at least in my locale) while I can put non-ASCII unicode escapes in myStr, I can't put actual non-ASCII characters in myStr, or decode will give me "UnicodeEncodeError: 'ascii' codec can't encode character" problems. | |
| Feb 13, 2013 at 16:22 | review | Suggested edits | |||
| Feb 13, 2013 at 16:27 | |||||
| Feb 17, 2012 at 9:59 | comment | added | Ning Sun | @dln385 Does it work with non-ascii characters? I have some non-ascii chars with \\t. In python2, string-escape just works for that. But in python3, the codec is removed. And the unicode-escape just escapes all non-ascii bytes and breaks my encoding. | |
| Oct 26, 2010 at 6:29 | history | edited | Jerub | CC BY-SA 2.5 | added 97 characters in body |
| Oct 26, 2010 at 6:06 | vote | accept | dln385 | ||
| Oct 26, 2010 at 6:06 | comment | added | dln385 | In Python 3, the command needs to be print(bytes(myString, "utf-8").decode("unicode_escape")) | |
| Oct 26, 2010 at 5:44 | comment | added | dln385 | @Nas Banov The documentation does make a small mention about that: Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec. | |
| Oct 26, 2010 at 5:18 | comment | added | Nas Banov | hands down, the best solution! btw, by docs it should be "string_escape" (with underscore) but for some reason accepts anything in the pattern 'string escape', 'string@escape" and whatnot... basically 'string\W+escape' | |
| Oct 26, 2010 at 5:01 | history | answered | Jerub | CC BY-SA 2.5 |