Return to Revisions

3 of 13

added 184 characters in body

edited Sep 30, 2023 at 16:25

586.2k
96
1.1k
1.7k

How to process JSON with strings containing invalid UTF-8

A large (and growing) number of utilities on Unix-like systems seem to be choosing the JSON format for data interchange¹, even though JSON strings cannot represent arbitrary file paths, process names, command line arguments and more generally C strings².

For instance, many util-linux, Linux LVM, systemd utilities, curl, GNU parallel, ripgrep, sqlite3, tree... can output data in JSON format which can then allegedly be parsed programmatically "reliably".

But if some of the strings they're meant to output (like file names) contain text not encoded in UTF-8, that seems to all fall apart.

I see different type of behaviour across utilities in that case:

those that transform the bytes that can't be decoded either by replacing them with a replacement character such as ? (like exiftool) or U+FFFD (�) or using some form of encoding sometimes in a non-reversible way ("\\x80" for instance in column)
those that switch to a different representation like from "json-string" to [65, 234] array of bytes in journalctl or from {"text":"foo"} to {"bytes":"base64-encoded"} in rg.
those that handle it in a bogus way like curl
and a great majority that just dump those bytes that don't make up valid UTF-8 as-is, that is with JSON strings containing invalid UTF-8.

Most util-linux utilities are in the last category. For example, with lsfd:

$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l {$ "lsfd": [$ {$ "name": "/home/chazelas/tmp/\200"$ }$ ]$ }$

That means they output invalid UTF-8, and therefore invalid JSON.

Now, though strictly invalid, that output is still unambiguous and could in theory be post-processed³.

However, I've checked a lot of the JSON processing utilities and none of them were able to process that. They either:

error out with a decoding error
replace those bytes with U+FFFD
fail some miserable way or another

I feel like I'm missing something. Surely when that format was chosen, that must have been taken into account?

TL;DR

So my questions are:

does that JSON format with strings not properly UTF-8 encoded (with some byte values >= 0x80 that don't form part of valid UTF-8-encoded character) have a name?
Are there any tools or programming language modules (preferably perl, but I'm open to others) that can process that format reliably?
Or can that format be converted to/from valid JSON so it can be processed by JSON processing utilities such as jq, json_xs, mlr... Preferably in a way that preserves valid JSON strings and without losing information?

Additional info

Below is the state of my own investigations. That's just supporting data you might find useful. That's just a quick dump, commands are in zsh syntax and were run on a Debian unstable system. Sorry for the mess.

lsfd (and most util-linux utilities): outputs raw:

$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l {$ "lsfd": [$ {$ "name": "/home/chazelas/\200"$ }$ ]$ }$

column: Escapes ambiguously:

$ printf '%s\n' $'St\351phane' 'St\xe9phane' $'a\0b' | column -JC name=firstname { "table": [ { "firstname": "St\\xe9phane" },{ "firstname": "St\\xe9phane" },{ "firstname": "a" } ] }

Switching to a locale using latin1 (or any single-byte charset covering the whole byte range) helps to get a raw format instead:

$ printf '%s\n' $'St\351phane' $'St\ue9phane' | LC_ALL=C.iso88591 column -JC name=firstname | sed -n l {$ "table": [$ {$ "firstname": "St\351phane"$ },{$ "firstname": "St\303\251phane"$ }$ ]$ }$

journalctl: array of bytes:

$ logger $'St\xe9phane' $ journalctl -r -o json | jq 'select(._COMM == "logger").MESSAGE' [ 83, 116, 233, 112, 104, 97, 110, 101 ]

curl: bogus

$ printf '%s\r\n' 'HTTP/1.0 200' $'Test: St\xe9phane' '' | socat -u - tcp-listen:8000,reuseaddr & $ curl -w '%{header_json}' http://localhost:8000 {"test":["St\uffffffe9phane"] }

Could have made sense with \U except now unicode is restricted to codepoints up to \U0010FFFF only.

cvtsudoers: raw

$ printf 'Defaults secure_path="/home/St\351phane/bin"' | cvtsudoers -f json | sed -n l {$ "Defaults": [$ {$ "Options": [$ { "secure_path": "/home/St\351phane/bin" }$ ]$ }$ ]$ }$

dmesg: raw

$ printf 'St\351phane\n' | sudo tee /dev/kmsg $ sudo dmesg -J | sed -n /phane/l "msg": "St\351phane"$

exiftool: changes bytes to ?

$ exiftool -j $'St\xe9phane.txt' [{ "SourceFile": "St?phane.txt", "ExifToolVersion": 12.65, "FileName": "St?phane.txt", "Directory": ".", "FileSize": "0 bytes", "FileModifyDate": "2023:09:30 10:04:21+01:00", "FileAccessDate": "2023:09:30 10:04:26+01:00", "FileInodeChangeDate": "2023:09:30 10:04:21+01:00", "FilePermissions": "-rw-r--r--", "Error": "File is empty" }]

lsar: interprets byte values as if they were unicode codepoints for tar:

$ tar cf f.tar $'St\xe9phane.txt' $'St\ue9phane.txt' $ lsar --json f.tar| grep FileNa "XADFileName": "Stéphane.txt", "XADFileName": "StÃ©phane.txt",

For zip: URI-encoding

$ bsdtar --format=zip -cf a.zip St$'\351'phane.txt Stéphane.txt $ lsar --json a.zip | grep FileNa "XADFileName": "St%e9phane.txt", "XADFileName": "Stéphane.txt",

lsipc: raw

$ ln -s /usr/lib/firefox-esr/firefox-esr $'St\xe9phane' $ ./$'St\xe9phane' -new-instance $ lsipc -mJ | grep -a phane | sed -n l "command": "./St\351phane -new-instance"$ "command": "./St\351phane -new-instance"$

GNU parallel: raw

$ parallel --results -.json echo {} ::: $'\xe9' | sed -n l { "Seq": 1, "Host": ":", "Starttime": 1696068481.231, "JobRuntime": 0\ .001, "Send": 0, "Receive": 2, "Exitval": 0, "Signal": 0, "Command": \ "echo '\351'", "V": [ "\351" ], "Stdout": "\351\\u000a", "Stderr": ""\ }$

rg: switches from "text":"..." to "bytes":"base64..."

$ echo $'St\ue9phane' | rg --json '.*' {"type":"begin","data":{"path":{"text":"<stdin>"}}} {"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"text":"Stéphane\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"Stéphane"},"start":0,"end":9}]}} {"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":137546,"human":"0.000138s"},"searches":1,"searches_with_match":1,"bytes_searched":10,"bytes_printed":235,"matched_lines":1,"matches":1}}} {"data":{"elapsed_total":{"human":"0.002445s","nanos":2445402,"secs":0},"stats":{"bytes_printed":235,"bytes_searched":10,"elapsed":{"human":"0.000138s","nanos":137546,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

$ echo $'St\xe9phane' | LC_ALL=C rg --json '.*' {"type":"begin","data":{"path":{"text":"<stdin>"}}} {"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"bytes":"U3TpcGhhbmUK"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"St"},"start":0,"end":2},{"match":{"text":"phane"},"start":3,"end":8}]}} {"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":121361,"human":"0.000121s"},"searches":1,"searches_with_match":1,"bytes_searched":9,"bytes_printed":275,"matched_lines":1,"matches":2}}} {"data":{"elapsed_total":{"human":"0.002471s","nanos":2471435,"secs":0},"stats":{"bytes_printed":275,"bytes_searched":9,"elapsed":{"human":"0.000121s","nanos":121361,"secs":0},"matched_lines":1,"matches":2,"searches":1,"searches_with_match":1}},"type":"summary"}

Interesting "x-user-defined" encoding:

$ echo $'St\xe9\xeaphane' | rg -E x-user-defined --json '.*' | jq -a .data.lines.text null "St\uf7e9\uf7eaphane\n" null null

With characters in private-use area for non-ASCII text. https://www.w3.org/International/docs/encoding/#x-user-defined

sqlite3: raw

$ sqlite3 -json a.sqlite3 'select * from a' | sed -n l [{"a":"a"},$ {"a":"\351"}]$

tree: raw

$ tree -J | sed -n l [$ {"type":"directory","name":".","contents":[$ {"type":"file","name":"\355\240\200\355\260\200"},$ {"type":"file","name":"a.zip"},$ {"type":"file","name":"f.tar"},$ {"type":"file","name":"St\303\251phane.txt"},$ {"type":"link","name":"St\351phane","target":"/usr/lib/firefox-es\ r/firefox-esr"},$ {"type":"file","name":"St\351phane.txt"}$ ]}$ ,$ {"type":"report","directories":1,"files":6}$ ]$

lslocks: raw

$ lslocks --json | sed -n /phane/l "path": "/home/chazelas/1/St\351phane.txt"$

JSON processing tools

jsesc: accepts but transforms to U+FFFD

$ jsesc -j $'\xe9' "\uFFFD"

jq: accepts, transforms to U+FFFD but bogus:

$ print '"a\351b"' | jq -a . "a\ufffd" $ print '"a\351bc"' | jq -a . "a\ufffdbc"

gojq: same without the bug

$ echo '"\xe9ab"' | gojq -j . | uconv -x hex \uFFFD\u0061\u0062

json_pp: accepts, transforms to U+FFFD

$ print '"a\351b"' | json_pp -json_opt ascii,pretty "a\ufffdb"

json_xs: same

$ print '"a\351b"' | json_xs | uconv -x hex \u0022\u0061\uFFFD\u0062\u0022\u000A

Same with -e:

$ print '"\351"' | PERL_UNICODE= json_xs -t none -e 'printf "%x\n", ord($_)' fffd

jshon: error

$ printf '{"file":"St\351phane"}' | jshon -e file -u json read error: line 1 column 11: unable to decode byte 0xe9 near '"St'

json5: accepts, transforms to U+FFFD

$ echo '"\xe9"' | json5 | uconv -x hex \u0022\uFFFD\u0022

jc: error

$ echo 'St\xe9phane' | jc --ls jc: Error - ls parser could not parse the input data. If this is the correct parser, try setting the locale to C (LC_ALL=C). For details use the -d or -dd option. Use "jc -h --ls" for help.

mlr: accepts, converts to U+FFFD

$ echo '{"f":"St\xe9phane"}' | mlr --json cat | sed -n l [$ {$ "f": "St\357\277\275phane"$ }$ ]$

vd: error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

JSON::Parse: error

$ echo '"\xe9"'| perl -MJSON::Parse=parse_json -l -0777 -ne 'print parse_json($_)' JSON error at line 1, byte 3/4: Unexpected character '"' parsing string starting from byte 1: expecting bytes in range 80-bf: 'x80-\xbf' at -e line 1, <> chunk 1.

jo: error

$ echo '\xe9' | jo -a jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed. zsh: done echo '\xe9' | zsh: IOT instruction jo -a

Can use base64:

$ echo '\xe9' | jo a=@- jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed. zsh: done echo '\xe9' | zsh: IOT instruction jo a=@- $ echo '\xe9' | jo a=%- {"a":"6Qo="}

jsed: accept and transform to U+FFFD

$ echo '{"a":"\xe9"}' | ./jsed get --path a | uconv -x hex \uFFFD%

^{¹ See zgrep -li json ${(s[:])^"$(man -w)"}/man[18]*/*(N) for a list of commands that may be processing JSON.}

^{² And C strings cannot represent arbitrary JSON strings as C strings contrary to JSON strings can't contain NULs}

^{³ Though its handling could become problematic as concatenating two such strings could end up forming valid characters and break some assumptions.}

json

asked Sep 30, 2023 at 13:40

Stéphane Chazelas

586.2k
96
1.1k
1.7k