Return to Revisions

9 of 9

added 275 characters in body

edited Oct 8, 2023 at 10:50

586.2k
96
1.1k
1.7k

In python3 (at least version 3.11.5 where I'm testing this) and its json module, the behaviour is similar to that of perl and its JSON modules. The input/output is decoded/encoded outside of the python module, and in this case as per the locale's charset, though the character encoding can be overridden with the PYTHONIOENCODING environment variable.

The C and C.UTF-8 locales (contrary to other locales using UTF-8 as the charset) seem to be a special case, where input/output is decoded/encode in UTF-8 (even though the charset of the C locale is invariably ASCII), but bytes that don't form part of valid UTF-8 are decoded with codepoints in the range 0xDC80 to 0xDCFF (those code points landing in those used for the second half of the UTF-16 surrogate pairs, so not valid character code points which makes them safe to use here).

The same can be achieved without changing the locale by setting

PYTHONIOENCODING=utf-8:surrogateescape

Then we can process JSON meant to be overall encoded in UTF-8 but that may contain strings that are not UTF-8.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(hex(ord(_)))' 0xdce9

0xe9 byte decoded as character 0xdce9.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(_)' | od -An -vtx1 e9 0a

0xdce9 is encoded back to the 0xe9 byte on output.

Example processing the output of lsfd:

$ exec 3> $'\x80\xff' $ lsfd -Jp "$$" | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) for e in _["lsfd"]: if e["assoc"] == "3": print(e["name"])' | sed -n l /home/chazelas/tmp/\200\377$

Note: if generating some JSON on output, you'll want to pass ensure_ascii=False or otherwise for bytes that can't be decoded into UTF-8, you'd get:

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(json.dumps(_))' "\udce9"

Which most things outside of python would reject.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) print(json.dumps(_, ensure_ascii=False))' | sed -n l "\351"$

Also, as noted in the question, if you have two JSON strings that are the result of an UTF-8 encoded string split in the middle of a character, concatenating them in JSON will not merge those byte sequences into a character until they're encoded back to UTF-8:

$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) myname = _["a"] + _["b"]; print(len(myname), myname)' 9 Stéphane

My name has been reconstituted OK on output, but note how the length is incorrect as myname contains \udcc3 and \udca9 escape characters rather than a reconstituted \u00e9 character.

You can force that merging by going through encode and decode steps using the IO encoding:

$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json,sys _ = json.load(sys.stdin) myname = (_["a"] + _["b"]).encode(sys.stdout.encoding,sys.stdout.errors).decode(sys.stdout.encoding,sys.stdout.errors) print(len(myname), myname)' 8 Stéphane

In any case, it's also possible to encode/decode in latin1 like in perl so that the character values match the byte values in the strings by calling it in a locale that uses that charset or using PYTHONIOENCODING=latin1.

vd (visidata), though written in python3, when input is coming from stdin doesn't seem to honour $PYTHONIOENCODING, and in the C or C.UTF-8 locales doesn't seem to be doing that surrogate escaping (see this issue), but calling it with --encoding=latin1 with version 2.5 or newer (where that issue was fixed) or in a locale that uses the latin1 charset seems to work so you can do:

lsfd -J | binary jq .lsfd | LC_CTYPE=C.iso88591 vd -f json

For a visual lsfd that doesn't crash if there are command or file names in the output of lsfd -J that are not UTF-8 encoded text.

When passing the JSON file as the path of a file as argument, then it seems to decode the input as per the --encoding and --encoding-errors options which by default are utf-8 and surrogateescape respectively, and honour the locale's charset for output.

So, in a shell with process substitution support such as ksh, zsh, bash (or rc, es, akanga with a different syntax), you can just do:

vd -f json <(lsfd -J | binary jq .lsfd)

However, I find it sometimes fails randomly for non-regular files such as those pipes (see that other issue). Using a format with one json perl line (jsonl) works better:

vd -f jsonl <(lsfd -J | binary jq -c '.lsfd[]')

Or use the =(...) form of process substitution in zsh (or (...|psub -f) in fish, same as (...|psub) in current versions) that uses a temp file instead of a pipe:

vd -f json =(lsfd -J | binary jq .lsfd)

answered Oct 3, 2023 at 5:58

Stéphane Chazelas

586.2k
96
1.1k
1.7k