I'm building a chat and it seems some strange characters are creeping into some of the messages... This is an excerpt from a dictionary containing a bunch of messages.
{'message': '"..." \x85 H.L. Mencken via Midas du Metropole #quotes',...} notice the \x85, this is just an example, \x92 \x91, and others are all represented as well. As far as I can tell these are bad quotation marks and the like probably pasted in by someone.
This dictionary is run through the following...
simplejson.dumps(DICTIONARY, indent=4).encode('utf-8') Which leads to this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 157: invalid start byte Which originates here in the simplejson module:
s = s.decode('utf-8') I'm kind of lost here, how can I clean the original input so that I don't run into this problem?
json.dumps(DICTIONARY, ensure_ascii=False, encoding='utf-8'), or uselatin-1, to encode and dumps to json