0

I have a json file with size in 500 MB with structure like

{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]} 

Issue is when i parse this string with pandas read_json, I get error

Unexpected character found when decoding object value

Issue is happening because of double quotes " in before John in first elecment of contact array.

Need help in escaping this quote either with sed/awk/python so that i can directly load the file in pandas.

I am fine with ignoring contact with name as Jack as well but i can not ignore the complete row.

3
  • 1
    Is it possible for you to have this quoted at generation time of JSON ? You almost need to write a parser for this, since it can get ugly pretty fast: "name":"Jack "The big,ugly ripper" Johnson", "number":... Writing a regexp for that would be fairly hard. Btw, in your example, you need to have a double quote AFTER John: "name":"Jack "John"" Commented Jun 9, 2021 at 8:19
  • Data which i am getting is not generated by me or it can not be changed at source. So can you give me some regex to process it. Commented Jun 9, 2021 at 8:23
  • I just want to add a backslash in case any double quote comes in between name. Commented Jun 9, 2021 at 8:26

2 Answers 2

1

In general here is no robust way to fix this, nor to ignore any part of the row nor even to ignore the whole row because if a quoted string can contain quotes and can also contain :s and ,s then a messed up string can look exactly like a valid set of fields.

Having said that, if we target only the "name" field and assume that the associated string cannot contain ,s then with GNU awk for the 3rd arg to match() we can do:

$ cat tst.awk { while ( match($0,/(.*"name":")(([^,"]*"[^,"]*)+)(".*)/,a) ) { gsub(/"/,"",a[2]) $0 = a[1] a[2] a[4] } print } 

$ awk -f tst.awk file {"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]} {"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]} 

I ran the above on this input file:

$ cat file {"user": "[email protected]","contact":[{"name":"Jack "O'Malley","number":"+1 23456789"},{"name":"Jack "Jill "Smith-Jones","number":"+1 232324789"}]} {"user": "[email protected]","contact":[{"name":"X Æ A-12 "Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]} 

and you can do the same using any awk:

$ cat tst.awk { while ( match($0,/"name":"([^,"]*"[^,"]*)+"/) ) { tgt = substr($0,RSTART+8,RLENGTH-9) gsub(/"/,"",tgt) $0 = substr($0,1,RSTART+7) tgt substr($0,RSTART+RLENGTH-1) } print } 

$ awk -f tst.awk file {"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]} {"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]} 
Sign up to request clarification or add additional context in comments.

Comments

0

You can try removing that extra " using regex ("[\s\w]*)(")([\s\w]*").

The regex try to match any string with spaces/alphabets followed by a quote and then followed again by spaces/alphabets and removing that additional quote.

This will work for the given problem but for more complex patterns, you may have to tweak the regex.

Complete code:

import re text_json = '{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}' text_json = re.sub(r'("[\s\w]*)(")([\s\w]*")', r"\1\3", text_json ) print(text_json) 

Output:

'{"user": "[email protected]","contact":[{"name":"Jack John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}' 

1 Comment

Would the correct (desired) output here not be {"name":"Jack \"John\"","number":"+1 23456789"} ? I believe original example missed an end quote after John.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.