convert string represented as unicode code points to utf-8 characters

Question

I have a file that contains ASCII lines like

"\u0627\u0644\u0625\u062f\u0627"

(including the quote marks). I want to output these lines with the actual UTF-8 characters, like

"الإدا"

(These happen to be Arabic, but a solution would presumably work fine for any Unicode code points, at least in the Basic plane.)

If I type in an ASCII string like that to the Python3 interpreter, say

s = '"\u0627\u0644\u0625\u062f\u0627"'

and then ask Python what the value of that variable is, it displays the string in the way I want:

'"الإدا"'

But if I readline() a file containing strings like that, and write each line back out, I just get the ASCII representation back out. In other words, this code:

for s in stdin.readlines(): stdout.write(s)

just gives me back an output file identical to the input file.

How do I convert the read-in string so it writes out as the UTF-8 (not just ASCII) output, including the non-ASCII UTF-8 characters?

I know I can parse the string and handle each \uXXXX sub-string individually using regex, slices and chr(int()). But surely there is a way to use Python's built-in handling of strings represented in this way, so I don't have to parse the strings myself, not to mention being faster. (And yes, if there are improperly represented \u strings in the input, I can deal with the resulting error msgs.)

Does setting the stdout encoding to UTF-8 help? stackoverflow.com/a/52372390/765091 — slothrop
– slothrop, Commented May 16, 2023 at 18:18
@slothrop: No, the output encoding already is set to UTF-8, but just to be sure I tried the reconfigure('utf-8') things and I get the same result. I think the problem with that solution is that ASCII is UTF-8. — Mike Maxwell
– Mike Maxwell, Commented May 16, 2023 at 18:31
Does this answer your question? Not knowing a whole unicode character python — JosefZ
– JosefZ, Commented May 16, 2023 at 18:40

Mark Tolonen · Accepted Answer · 2023-05-16 20:40:39Z

To convert a string of that content, encode as ASCII first to create a byte string, then decode with the 'unicode-escape' codec:

s = r'"\u0627\u0644\u0625\u062f\u0627"' print(s) print(s.encode('ascii').decode('unicode-escape'))

Output:

"\u0627\u0644\u0625\u062f\u0627" "الإدا"

Writing and reading a file that way:

with open('file.txt', 'w', encoding='unicode-escape') as f: f.write('"\u0627\u0644\u0625\u062f\u0627"') with open('file.txt', 'r', encoding='unicode-escape') as f: print(f.read())

Content of file:

"\u0627\u0644\u0625\u062f\u0627"

Output:

"الإدا"

Solutions to support surrogate escapes. They need to be converted to actual Unicode code points and the surrogatepass error handler allows that, but requires another encode/decode cycle.

s = r'"\ud83c\uddfa\ud83c\uddf8"' print(s) print(s.encode('ascii').decode('unicode-escape').encode('utf-16le', errors='surrogatepass').decode('utf-16le'))

Output:

"🇺🇸"

with open('file.txt', 'w', encoding='unicode-escape') as f: f.write('"\ud83c\uddfa\ud83c\uddf8"') with open('file.txt', encoding='unicode-escape') as f: data = f.read().encode('utf-16le', errors='surrogatepass').decode('utf-16le') print(data) print(ascii(data)) # To see the Unicode codepoints

Output:

"🇺🇸" '"\U0001f1fa\U0001f1f8"'

Thanks, that works and is simple. I guess the trick is the 'unicode-escape' arg!
I'll add for those coming here that it turns out my input file has Unicode surrogates, and Python chokes when I try to write those out. For now, I have a try...except to skip over those lines (and print out the exception). It's not clear to me whether the original file is erroneous in not having surrogate pairs, or whether encode...decode just doesn't handle them.
Example input string that triggers this error: \ud83c\uddfa\ud83c\uddf8
@MikeMaxwell Which part of the code triggered the error? I put that string in the file writing portion and it passed correctly on writing, but it wouldn't read it back.
Here's the code I'm using where the surrogate pairs error gets triggered [having trouble with inputting code, I'm using the back-tick but my newlines get removed): from sys import stdin, stdout for sLine in stdin.readlines(): stdout.write(sLine.encode('ascii').decode('unicode-escape'))

Collectives™ on Stack Overflow

convert string represented as unicode code points to utf-8 characters

1 Answer 1

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Linked

Related