Python: decode('utf-8') breaking with odd characters

Question

I'm building a chat and it seems some strange characters are creeping into some of the messages... This is an excerpt from a dictionary containing a bunch of messages.

{'message': '"..." \x85 H.L. Mencken via Midas du Metropole #quotes',...}

notice the \x85, this is just an example, \x92 \x91, and others are all represented as well. As far as I can tell these are bad quotation marks and the like probably pasted in by someone.

This dictionary is run through the following...

simplejson.dumps(DICTIONARY, indent=4).encode('utf-8')

Which leads to this error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 157: invalid start byte

Which originates here in the simplejson module:

s = s.decode('utf-8')

I'm kind of lost here, how can I clean the original input so that I don't run into this problem?

when you say clean do you mean getting rid of all non-ascii characters, or just want to dumps to json? — Anzel
– Anzel, Commented Jan 22, 2015 at 23:55
Ideally the characters would come out translated and usable if possible... getting rid of them will probably ruin the intended presentation of messages. But yeah, the main concern is getting the json.dumps to actually work. — Christopher Reid
– Christopher Reid, Commented Jan 22, 2015 at 23:58
you can try json.dumps(DICTIONARY, ensure_ascii=False, encoding='utf-8'), or use latin-1, to encode and dumps to json — Anzel
– Anzel, Commented Jan 23, 2015 at 0:00

ErikR · Accepted Answer · 2015-01-23 01:51:48Z

1

Try transforming each value in the dictionary with:

v = v.decode('iso-8859-1')

before passing it to simplejson.

Update: This also works:

simplejson.dumps(DICTIONARY, encoding='iso-8859-1', indent=4)

Some other things to try:

print simplejson.dumps(DICTIONARY, encoding='cp1252')

You will see \u2026 for the \x85 character, but this is the correct Unicode code point for that character.

edited Jan 23, 2015 at 1:51

answered Jan 23, 2015 at 0:05

ErikR

52.2k9 gold badges78 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Christopher Reid Over a year ago

Only problem with this is that it changes lots of non-problem characters into strange things... some types of commas turn into "â" and π turns into "Ï"

ErikR Over a year ago

What are the \x?? codes for those characters? What code page are you using?

Christopher Reid Over a year ago

Well the most recent break was from … which as I stated is \x85... it's mostly common punctuation in ANSI Hex otherwise... I'm guessing because people are pasting things. I'm not explicitly using any code page as far as I know... Just encoding in utf-8 as you can see from my example

ErikR Over a year ago

try: print simplejson.dumps(DICTIONARY, encoding='iso-8859-1').encode('iso-8859-1')

Christopher Reid Over a year ago

hmmm, returns strings in question in the form ""\u00e2\u0080\u00a6"

|

Mark Tolonen · Accepted Answer · 2015-01-23 02:17:16Z

The input strings are encoded in cp1252. .decode them to Unicode strings before serializing to json:

>>> D = {'message': '\x85 H.L. Mencken via \x91Midas\x92 du Metropole'.decode('cp1252')} >>> D {'message': u'\u2026 H.L. Mencken via \u2018Midas\u2019 du Metropole'} >>> import json >>> print(json.dumps(D)) {"message": "\u2026 H.L. Mencken via \u2018Midas\u2019 du Metropole"}

This one might raise a UnicodeEncodeError if your terminal doesn't support the characters in its default code page, but it demonstrates that the above serialization has the correct Unicode codepoints.

>>> print(json.dumps(D,ensure_ascii=False)) {"message": "… H.L. Mencken via ‘Midas’ du Metropole"}

Collectives™ on Stack Overflow

Python: decode('utf-8') breaking with odd characters

2 Answers 2

7 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Related