0

I'm building a chat and it seems some strange characters are creeping into some of the messages... This is an excerpt from a dictionary containing a bunch of messages.

{'message': '"..." \x85 H.L. Mencken via Midas du Metropole #quotes',...} 

notice the \x85, this is just an example, \x92 \x91, and others are all represented as well. As far as I can tell these are bad quotation marks and the like probably pasted in by someone.

This dictionary is run through the following...

simplejson.dumps(DICTIONARY, indent=4).encode('utf-8') 

Which leads to this error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 157: invalid start byte 

Which originates here in the simplejson module:

s = s.decode('utf-8') 

I'm kind of lost here, how can I clean the original input so that I don't run into this problem?

3
  • when you say clean do you mean getting rid of all non-ascii characters, or just want to dumps to json? Commented Jan 22, 2015 at 23:55
  • Ideally the characters would come out translated and usable if possible... getting rid of them will probably ruin the intended presentation of messages. But yeah, the main concern is getting the json.dumps to actually work. Commented Jan 22, 2015 at 23:58
  • you can try json.dumps(DICTIONARY, ensure_ascii=False, encoding='utf-8'), or use latin-1, to encode and dumps to json Commented Jan 23, 2015 at 0:00

2 Answers 2

1

Try transforming each value in the dictionary with:

v = v.decode('iso-8859-1') 

before passing it to simplejson.

Update: This also works:

simplejson.dumps(DICTIONARY, encoding='iso-8859-1', indent=4) 

Some other things to try:

print simplejson.dumps(DICTIONARY, encoding='cp1252') 

You will see \u2026 for the \x85 character, but this is the correct Unicode code point for that character.

Sign up to request clarification or add additional context in comments.

7 Comments

Only problem with this is that it changes lots of non-problem characters into strange things... some types of commas turn into "â" and π turns into "Ï"
What are the \x?? codes for those characters? What code page are you using?
Well the most recent break was from which as I stated is \x85... it's mostly common punctuation in ANSI Hex otherwise... I'm guessing because people are pasting things. I'm not explicitly using any code page as far as I know... Just encoding in utf-8 as you can see from my example
try: print simplejson.dumps(DICTIONARY, encoding='iso-8859-1').encode('iso-8859-1')
hmmm, returns strings in question in the form ""\u00e2\u0080\u00a6"
|
1

The input strings are encoded in cp1252. .decode them to Unicode strings before serializing to json:

>>> D = {'message': '\x85 H.L. Mencken via \x91Midas\x92 du Metropole'.decode('cp1252')} >>> D {'message': u'\u2026 H.L. Mencken via \u2018Midas\u2019 du Metropole'} >>> import json >>> print(json.dumps(D)) {"message": "\u2026 H.L. Mencken via \u2018Midas\u2019 du Metropole"} 

This one might raise a UnicodeEncodeError if your terminal doesn't support the characters in its default code page, but it demonstrates that the above serialization has the correct Unicode codepoints.

>>> print(json.dumps(D,ensure_ascii=False)) {"message": "… H.L. Mencken via ‘Midas’ du Metropole"} 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.