Python convert unicode to ASCII

Question

I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:

u'Atl\xc3\xa9tico Madrid'

In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid". If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?

Martijn Pieters · Accepted Answer · 2014-08-05 17:51:19Z

You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:

>>> print u'Atl\xc3\xa9tico Madrid' AtlÃ©tico Madrid

Repair your string first:

>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8') u'Atl\xe9tico Madrid' >>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8') Atlético Madrid

and Unidecode will give you what you expected:

>>> import unidecode >>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid') 'AtlA(c)tico Madrid' >>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')) 'Atletico Madrid'

Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.

Your solution obviously worked, but after reading the last part, I realized that I could just add .decode("utf-8") right after the urllib2 request read(), so now I can simply run unidecode() without any other encode/decodes. Thanks!

Collectives™ on Stack Overflow

Python convert unicode to ASCII

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related