3

I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c" 

How do I convert it to the following?

doc = "Hello my name is Ruth! I really like swimming and dancing" 

I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.

3
  • If the answer you linked didn't work, there's something you're not telling us. Commented May 16, 2017 at 21:32
  • i already tried re.sub(r'[^\x00-\x7F]+',' ', text). the code works, but nothing changed @MarkRansom Commented May 17, 2017 at 5:38
  • That's because strings don't update in-place, they're immutable. You need to take the return value of re.sub and assign it back to text. Commented May 17, 2017 at 14:00

1 Answer 1

9

You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c" >>> doc.encode('ascii', errors='ignore') b'Hello my name is Ruth ! I really like swimming and dancing ' 

If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii') 'Hello my name is Ruth ! I really like swimming and dancing' 
Sign up to request clarification or add additional context in comments.

5 Comments

i've already tried to encode, the code works but still nothing change. thanks for your reply.
my purpose is to clean unicode code from the tweet that i've streamed. I tried the code to my tweet.txt which is contain 10 tweets.
which one? @timgeb
the one in the answer.
the unicode code still appears after using tweet.encode('ascii', errors='ignore')

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.