In python3 (at least version 3.11.5 where I'm testing this) and its json module, the behaviour is similar to that of perl and its JSON modules. The input/output is decoded/encoded outside of the python module, and in this case as per the locale's charset, though the character encoding can be overridden with the PYTHONIOENCODING environment variable.
The C and C.UTF-8 locales (contrary to other locales using UTF-8 as the charset) seem to be a special case, where input/output is decoded/encode in UTF-8 (even though the charset of the C locale is invariably ASCII), but bytes that don't form part of valid UTF-8 are decoded with codepoints in the range 0xDC80 to 0xDCFF (those code points landing in those used for the second half of the UTF-16 surrogate pairs, so not valid character code points which makes them safe to use here).
The same can be achieved without changing the locale by setting
PYTHONIOENCODING=utf-8:surrogateescape Then we can process JSON meant to be overall encoded in UTF-8 but that may contain strings that are not UTF-8.
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(hex(ord(_)))' 0xdce9 0xe9 byte decoded as character 0xdce9.
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(_)' | od -An -vtx1 e9 0a 0xdce9 is encoded back to the 0xe9 byte on output.
Example processing the output of lsfd:
$ exec 3> $'\x80\xff' $ lsfd -Jp "$$" | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) for e in _["lsfd"]: if e["assoc"] == "3": print(e["name"])' | sed -n l /home/chazelas/tmp/\200\377$ Note: if generating some JSON on output, you'll want to pass ensure_ascii=False or otherwise for bytes that can't be decoded into UTF-8, you'd get:
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys; _ = json.load(sys.stdin); print(json.dumps(_))' "\udce9" Which most things outside of python would reject.
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) print(json.dumps(_, ensure_ascii=False))' | sed -n l "\351"$ Also, as noted in the question, if you have two JSON strings that are the result of an UTF-8 encoded string split in the middle of a character, concatenating them in JSON will not merge those byte sequences into a character until they're encoded back to UTF-8:
$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json, sys _ = json.load(sys.stdin) myname = _["a"] + _["b"]; print(len(myname), myname)' 9 Stéphane My name has been reconstituted OK on output, but note how the length is incorrect as myname contains \udcc3 and \udca9 escape characters rather than a reconstituted \u00e9 character.
You can force that merging by going through encode and decode steps using the IO encoding:
$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c ' import json,sys _ = json.load(sys.stdin) myname = (_["a"] + _["b"]).encode(sys.stdout.encoding,sys.stdout.errors).decode(sys.stdout.encoding,sys.stdout.errors) print(len(myname), myname)' 8 Stéphane In any case, it's also possible to encode/decode in latin1 like in perl so that the character values match the byte values in the strings by calling it in a locale that uses that charset or using PYTHONIOENCODING=latin1.
vd (visidata), though written in python3, when input is coming from stdin doesn't seem to honour $PYTHONIOENCODING, and in the C or C.UTF-8 locales doesn't seem to be doing that surrogate escaping (see this issue), but calling it with --encoding=latin1 with version 2.5 or newer (where that issue was fixed) or in a locale that uses the latin1 charset seems to work so you can do:
lsfd -J | binary jq .lsfd | LC_CTYPE=C.iso88591 vd -f json For a visual lsfd that doesn't crash if there are command or file names in the output of lsfd -J that are not UTF-8 encoded text.
When passing the JSON file as the path of a file as argument, then it seems to decode the input as per the --encoding and --encoding-errors options which by default are utf-8 and surrogateescape respectively, and honour the locale's charset for output.
So, in a shell with process substitution support such as ksh, zsh, bash (or rc, es, akanga with a different syntax), you can just do:
vd -f json <(lsfd -J | binary jq .lsfd) However, I find it sometimes fails randomly for non-regular files such as those pipes (see that other issue). Using a format with one json perl line (jsonl) works better:
vd -f jsonl <(lsfd -J | binary jq -c '.lsfd[]') Or use the =(...) form of process substitution in zsh (or (...|psub -f) in fish, same as (...|psub) in current versions) that uses a temp file instead of a pipe:
vd -f json =(lsfd -J | binary jq .lsfd)