Awk or Sed Command to Fix Bad JSON formatting?

Question

Okay, so I've got over a hundred JSON files with predictable bad formatting in several places per file.

Instead of using [ ] to indicate an array, they use { } instead.

For example:

"grid": { "C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6" },

Each file has multiple arrays in it with this problem, each with a different key.

I came up with this to fix the above example, but it isn't very universal:

sed 's/^\t\t"grid": {/^\t\t"grid: [/; s/"E6" },$/"E6" ],/' myfile.json

I also tried writing a more complicated awk script, something along these lines:

awk -i '/grid/ { gsub("{",{["); gsub("}","]") print $0 }' myfile.json

But it replaced the contents of myfile.json to be only the row that contained the string "grid".

Is there a reliable one-liner to fix this issue?

Does it have to be universal? Are you continuously going to get a new batch of these files from somewhere that need repair? If so, is there no way to fix the problem upstream, where they are generated? — Kaz
– Kaz, Commented Jul 14, 2022 at 0:07
@Kaz, no, I'm creating all the JSON files myself by hand converting non-digital records. So I am the "upsteam" problem, and I have already fixed it. :) I am just trying to save work for myself in fixing the existing problems. — OnlyDean
– OnlyDean, Commented Jul 14, 2022 at 0:14
Awk doesn't print anything by default. However, if a condition-action pair omits the action, then the { print } action is implicit. Note that gsub doesn't print anything. — Kaz
– Kaz, Commented Jul 14, 2022 at 0:17
Newer versions of GNU Awk don't have an -i option for "in place"; rather, there is -i <include-file>. You can use -i inplace to load the inplace extension. The semantics is still the same; the Awk produces output and that output replaces the file. If nothing is printed, the file will be empty. — Kaz
– Kaz, Commented Jul 14, 2022 at 0:18
not sure if this will give the the completely fixed output you need, but change to awk -i '{ ....code ...}1' file The added 1 will get the other lines to print. Also agree about using -i, better to get in the habit of awk 'code' file >file.fix && mv file.fix file. There is no system economy in using inplace options, there is alway a temp file created. Good luck — shellter
– shellter, Commented Jul 14, 2022 at 0:19

Daweo · Accepted Answer · 2022-07-14 08:02:21Z

I propose following GNU AWK solution, let file.json content be

{"hello": 1, "grid": {"C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6"}, "something": "else"}

then

awk 'BEGIN{FPAT=".";OFS=""}/grid/&&match($0,/\{[^}]*\}/){$RSTART="[";$(RSTART+RLENGTH-1)="]"}{print}' file.json

gives output

{"hello": 1, "grid": ["C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6"], "something": "else"}

Explanation: firstly I inform GNU AWK that field is any single character (.) and output field separator (OFS) is empty string (without that there would be unwanted spaces in output) then for each line with grid in it and containing literal { followed by zero or more (*) non (^) } and literal }, I replace first ($RSTART) character of what was matched using [ and last ($(RSTART+RLENGTH-1)) character of what was matched using ], for each line, altered or not, I print it. Note that I use match function rather than using just regular expression as I then use RSTART and RLENGTH which are set by this variable. Note that return value of match is used as part of condition so if there will be grid in line but not {...} then said line will remain unchanged.

(tested in gawk 4.2.1)

Fails for broken arrays split over multiple lines, and ones with a single element: ideone.com/zo3Fr3

ufopilot · Accepted Answer · 2022-07-14 10:30:58Z

#!/bin/bash FILE="test.json" JSON="$(sed -E 's/([}{])/\n\1\n/g' $FILE)" while :; do JQTEST=$(jq '.' <<<"$JSON" 2>&1|grep "Objects must consist of key:value pairs at line") rc=$? if [ $rc -eq 0 ]; then LINE=$(sed -E "s/.* line ([0-9]+), .*/\1/" <<<"$JQTEST") COL=$(sed -E "s/.* column ([0-9]+)$/\1/" <<<"$JQTEST") [ "$COL" -ne 1 ] && LINE=$((LINE-1)) JSON=$(sed -E "$LINE s/\{/[/; $LINE s/}/]/" <<<"$JSON") else jq '.' <<<"$JSON" # > "new_${FILE}" or "${FILE}" break fi done $ cat test.json { "grid1": {"C1", "D1", "E1", "C2"}, "grid2": {"C1", "D1", "E1", "C2"}, "grid3": {"C1", "D1", "E1", "C2"} } $ script.sh { "grid1": [ "C1", "D1", "E1", "C2" ], "grid2": [ "C1", "D1", "E1", "C2" ], "grid3": [ "C1", "D1", "E1", "C2" ] }

tripleee · Accepted Answer · 2022-07-14 10:01:23Z

How's this? (Update: probably scroll down to the Perl version near the end.)

sed -e 's/{\(\([0-9.]\+\|false\|true\|null\|"[^"]*"\) *[,}]\)/[\1/g' \ -e 's/\([,[] *\([0-9.]\+\|false\|true\|null\|"[^"]*"\)\)}/\1]/g' file

In other words, if the thing after {"thing" or before "thing"} is a comma or a curly brace (and not a colon, like you would expect in a proper JSON dictionary), switch the curly to a square bracket. (In the second expression, we will already have replaced any opening curly with a square one, so look for that instead.)

The regex could be made less fugly if your sed supports -E or -r, but unfortunately, this non-standard option is not portable. (In brief, it lets you use the ERE regex dialect instead of BRE, where you mind-numbingly have to backslash grouping parentheses etc.)

Unfortunately, it requires the curly to be on the same line as the contents of the array. Also, like any regex solution, it's not easily able to distinguish between (what looks like JSON inside) a quoted string and actual JSON.

Demo: https://ideone.com/PoZguV

I suppose the same approach could be extended to examine lines which start or end with a lone curly brace, but I'd switch to Awk or Perl for that. In fact, Perl's "slurp mode" perl -0777 could probably handle the entire input file in one go with minor modifications to the regexes.

perl -0777 -pe ' s/\{(\s*(?:[0-9.]+|false|true|null|"[^"]*")\s*[,}])/[$1/g; s/([,[]\s*(?:[0-9.]+|false|true|null|"[^"]*")\s*)\}/$1]/g' file.json

This removes any reliance on newlines for analyzing the file, since we read all of it into memory, and rely on \s to match any whitespace, including newlines. If you want to modify the file in-place, Perl also supports the -i option, like some versions of sed.

Demo: https://ideone.com/0C4gPt

Collectives™ on Stack Overflow

Awk or Sed Command to Fix Bad JSON formatting?

3 Answers 3

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Related