1

I'm downloading files from S3 that contains JSON (like) data which I intend to parse into a Pandas dataframe using pd.read_json.

My problem is that the files dumped into the S3 bucket use an 'octal escape' formatting for non english characters but Python/Pandas objects to the fact that an escape for the \ character is also included.

An example would be the string: "destination":"Provence-Alpes-C\\303\\264te d\'Azur"

Which prints as:

enter image description here

If I manually remove one of the \ characters then Python happily interprets the string and it prints as:

enter image description here

There is some good stuff in this thread and although .decode('string_escape') works well on an individual snippet, when its part of the much longer string comprising thousands of records then it doesn't work.

I believe that I need a clever way to replace the \\ with \ but for well documented reasons, .replace('\\', '\') doesn't work.

In order to get the files to work at all I used a regex to remove all \ followed by a number: re.sub(r'\\(?=[0-9])', '', g) - I'm thinking that an adaptation of this might be the way forward but the number needs to be dynamic as I don't know what it will be (i.e. using \3 and \2 for the example above isn't going to work')

Help appreciated.

3
  • Thanks Martin, this works well. Very rarely I'm getting an error - chr() arg not in range(256). I've tried to pinpoint exactly where this is falling over although its tough as these files are massive!! Commented Aug 25, 2015 at 15:48
  • See the gist I linked in my second comment. Commented Aug 25, 2015 at 15:57
  • I've been thinking of re-opening actually, because you have octal escapes for UTF-8 bytes, not for Unicode codepoints. Related, not the same. Commented Aug 25, 2015 at 16:02

1 Answer 1

2

Rather than have Python interpret \ooo octal escapes, repair the JSON with a regular expression, then parse it as JSON. I did so before in similar circumstances

Your data has UTF-8 bytes escaped to octal \ooo sequences, so you are looking for a more limited range of values here:

import re invalid_escape = re.compile(r'\\([1-3][0-7]{2}|[1-7][0-7]?)') # octal digits from 1 up to FF def replace_with_codepoint(match): return chr(int(match.group(0)[1:], 8)) def repair(brokenjson): return invalid_escape.sub(replace_with_codepoint, brokenjson) 

Demo:

>>> import json >>> sample = '{"destination":"Provence-Alpes-C\\303\\264te d\'Azur"}' >>> repair(sample) '{"destination":"Provence-Alpes-C\xc3\xb4te d\'Azur"}' >>> json.loads(repair(sample)) {u'destination': u"Provence-Alpes-C\xf4te d'Azur"} >>> print json.loads(repair(sample))['destination'] Provence-Alpes-Côte d'Azur 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.