0

I have some external data I need to import. How do I encode the input string as unicode/utf8?

Here is an example of a probematic line

>>>'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.encode("utf8")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 5: ordinal not in range(128)

1

3 Answers 3

3

.encode("utf8") expects the source to be a unicode string. You are using it with a "regular" string which has "ascii" encoding by default. You should do something like:

original_string.decode('original_encoding').encode('utf-8')

In your case my guess would be:

'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("iso8859-1").encode("utf8") 
Sign up to request clarification or add additional context in comments.

Comments

3

To convert bytes to a Unicode string use decode instead of encode.

Also that is not UTF-8. I guess it's Latin-1:

>>> print 'Compa\xf1\xeda Dominicana de Tel\xe9fonos, C. por A. - CODETEL'.decode("latin1")
Compañía Dominicana de Teléfonos, C. por A. - CODETEL

Comments

1

encode converts from a unicode string to a sequence of bytes. decode converts from a sequence of bytes to a unicode string. You want decode, because your data are already encoded.

More generally, if you're reading a string from an external source, you always want to decode, because there's no such thing as a "unicode string" out there in the world. There are only representations of that unicode string in various encodings. Unicode strings are like a Platonic ideal that can only be transmitted through the corporeal medium of encodings.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.