0

I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:

u'Atl\xc3\xa9tico Madrid' 

In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid". If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?

1 Answer 1

9

You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:

>>> print u'Atl\xc3\xa9tico Madrid' Atlético Madrid 

Repair your string first:

>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8') u'Atl\xe9tico Madrid' >>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8') Atlético Madrid 

and Unidecode will give you what you expected:

>>> import unidecode >>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid') 'AtlA(c)tico Madrid' >>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')) 'Atletico Madrid' 

Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.

Sign up to request clarification or add additional context in comments.

1 Comment

Your solution obviously worked, but after reading the last part, I realized that I could just add .decode("utf-8") right after the urllib2 request read(), so now I can simply run unidecode() without any other encode/decodes. Thanks!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.