In `python3` (at least version 3.11.5 where I'm testing this) and its `json` module, the behaviour is similar to that of `perl` and its JSON modules. The input/output is decoded/encoded outside of the python module, and in this case as per the locale's charset, though the character encoding can be overridden with [the `PYTHONIOENCODING` environment variable](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING).
The C and C.UTF-8 locales (contrary to other locales using UTF-8 as the charset) seem to be a special case, where input/output is
decoded/encode in UTF-8 (even though the charset of the C locale is invariably ASCII), but bytes that don't form part of valid UTF-8 are decoded with codepoints in the range 0xDC80 to 0xDCFF (those code points landing in those used for the second half of the UTF-16 surrogate pairs, so not valid character code points which makes them safe to use here).
The same can be achieved without changing the locale by setting
```
PYTHONIOENCODING=utf-8:surrogateescape
```
Then we can process JSON meant to be overall encoded in UTF-8 but that may contain strings that are not UTF-8.
```
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys; _ = json.load(sys.stdin); print(hex(ord(_)))'
0xdce9
```
0xe9 byte decoded as character 0xdce9.
```
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys; _ = json.load(sys.stdin); print(_)' | od -An -vtx1
e9 0a
```
0xdce9 is encoded back to the 0xe9 byte on output.
Example processing the output of `lsfd`:
```
$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
for e in _["lsfd"]:
if e["assoc"] == "3":
print(e["name"])' | sed -n l
/home/chazelas/tmp/\200\377$
```
Note: if generating some JSON on output, you'll want to pass `ensure_ascii=False` or otherwise for bytes that can't be decoded into UTF-8, you'd get:
```
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys; _ = json.load(sys.stdin); print(json.dumps(_))'
"\udce9"
```
Which most things outside of `python` would reject.
```
$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
print(json.dumps(_, ensure_ascii=False))' | sed -n l
"\351"$
```
Also, as noted in the question, if you have two JSON strings that are the result of an UTF-8 encoded string split in the middle of a character, concatenating them in JSON will not merge those byte sequences into a character until they're encoded back to UTF-8:
```
$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
myname = _["a"] + _["b"]; print(len(myname), myname)'
9 Stéphane
```
My name has been reconstituted OK on output, but note how the length is incorrect as `myname` contains `\udcc3` and `\udca9` escape characters rather than a reconstituted `\u00e9` character.
You can force that merging by going through `encode` and `decode` steps using the IO encoding:
```
$ printf '{"a":"St\xc3","b":"\xa9phane"}' |
PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json,sys
_ = json.load(sys.stdin)
myname = (_["a"] + _["b"]).encode(sys.stdout.encoding,sys.stdout.errors).decode(sys.stdout.encoding,sys.stdout.errors)
print(len(myname), myname)'
8 Stéphane
```
In any case, it's also possible to encode/decode in latin1 like in `perl` so that the character values match the byte values in the strings by calling it in a locale that uses that charset or using `PYTHONIOENCODING=latin1`.
`vd` (visidata), though written in `python3` doesn't seem to honour `$PYTHONIOENCODING`, and in the C or C.UTF-8 locales doesn't seem to be doing that surrogate escaping, but calling it in a locale that uses the latin1 charset seems to work as in:
```
lsfd -J | binary jq .lsfd | LC_CTYPE=C.iso88591 vd -f json
```
For a visual `lsfd` that doesn't crash if there are command or file names that are not UTF-8 encoded text.