5

I've run into a bit of a weird behaviour that I don't fully understand with ls and Chinese filenames. I'm running macOS 13.6.1 with SIP enabled (no core OS modifications), MacPorts installed, and US English as the primary language.

First, run this little script in a blank folder to make some test files:

import random random.seed(42) for i in range(30): n = random.randrange(3, 8) fn = "".join(random.choice("一二三") for _ in range(n)) open(fn, "w") 

This makes 30 files named with random combinations of the characters 一二三 (one, two, three).

Next, I run ls -l on my Mac (version "macOS 13.5" according to the manpage):

% ls -l total 8 -rw-r--r--@ 1 brx staff 164 Nov 25 02:41 test.py -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一三一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二三二一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一三三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二三二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二一一二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二三二一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一一二二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三三二三二二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二一二二一 

The files are clearly sorted by filename length, but are otherwise not sorted within identical lengths, as if ls is treating all Chinese characters as being exactly equivalent.

LANG is set to en_US.UTF-8 (and no LC_* variable is set), so maybe this is just a problem with the English sorting?

% LANG=zh_CN.utf-8 ls -l total 8 -rw-r--r--@ 1 brx staff 164 11 25 02:41 test.py -rw-r--r--+ 1 brx staff 0 11 25 02:41 一一三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三三一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二一二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一一三一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三三一三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三三三三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三二三一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三二二三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二一一一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二一三三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二三三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三一三一一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三二一三二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三三二三一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一二一一三三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一二三二一一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二一一三三二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二三二三二三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二二一一二一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二二三二一二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二一一一一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二一一一二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一三二三三三一 -rw-r--r--+ 1 brx staff 0 11 25 02:41 一二一三二三三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 三一一二二二三 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二三三二三二二 -rw-r--r--+ 1 brx staff 0 11 25 02:41 二三二一二二一 

Right, maybe this is just the Mac built-in ls being crappy; let's try GNU Coreutils (from MacPorts, ls (GNU coreutils) 9.4):

% gls -l total 4 -rw-r--r--+ 1 brx staff 164 Nov 25 02:41 test.py -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一三一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二三二一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一三三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二三二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二一一二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二三二一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一一二二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三三二三二二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二一二二一 % LANG=zh_CN.utf-8 gls -l 总计 4 -rw-r--r--+ 1 brx staff 164 1125日 02:41 test.py -rw-r--r--+ 1 brx staff 0 1125日 02:41 一一三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三三一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二一二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一一三一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三三一三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三三三三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三二三一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三二二三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二一一一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二一三三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二三三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三一三一一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三二一三二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三三二三一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一二一一三三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一二三二一一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二一一三三二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二三二三二三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二二一一二一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二二三二一二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二一一一一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二一一一二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一三二三三三一 -rw-r--r--+ 1 brx staff 0 1125日 02:41 一二一三二三三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 三一一二二二三 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二三三二三二二 -rw-r--r--+ 1 brx staff 0 1125日 02:41 二三二一二二一 

Besides the humourously broken date display from GNU Coreutils, nothing changes. The only thing that kinda seems to work is C.utf-8:

% LANG=C.utf-8 ls -l total 8 -rw-r--r--@ 1 brx staff 164 Nov 25 02:41 test.py -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ??????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ??????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ??????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ???????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ????????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 ?????????????????? % LANG=C.utf-8 gls -l total 4 -rw-r--r--+ 1 brx staff 164 Nov 25 02:41 test.py -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一一三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二一一一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一三二三三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二一三二三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 一二三二一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一一二二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三一三一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三一三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三三三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二一三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二三一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 三二二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一一三三二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一三三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二一二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三三二三二二 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二一二二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二三二三二三 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二一一二一 -rw-r--r--+ 1 brx staff 0 Nov 25 02:41 二二三二一二 

What's going on here? Are locales broken on my Mac?

EDIT: To clarify the expected behaviour: I would expect ls to sort the characters in any reasonable sorting order; some reasonable orders would be Unicode codepoint (一, 三, 二), numerical or stroke-count order (一, 二, 三), or Pinyin order (二, 三, 一, corresponding to "er, san, yi").

Some extra information to answer comments (in en_US.UTF-8 locale):

  • the order stays the same when piped to sort or gsort with or without -u.

  • there is actually no C.utf-8 locale on my system which explains why I get the same output as in the C locale there, an order by byte value and each of the byte rendered as ?.

  • expr '一二三' '<' '三一二', expr '一二三' '>' '三一二', expr '一二三' = '三一二' return 1, 0 and 0 respectively whether that's with macos expr or GNU expr.

  • perl -MPOSIX -le 'print strcoll@ARGV' -- '一' '二' outputs -140 whether that's with the perl shipped with macos or MacPorts.

  • perl -MPOSIX -le 'print strcoll@ARGV' -- '一一' '二' outputs 19968

  • the encodings of those characters look like $'\344\270\200\344\270\200\344\270\211' for 一一三 as reported by gls in the C locale, so it looks like they are properly encoded in UTF-8.

  • the output of perl -MPOSIX -le 'print unpack "H*", strxfrm$_ for @ARGV' -- '一' '一一' '二' is:

    303034323030303030346c32 30303432303034323030303030346c3230346c32 303034323030303030346e3e 
  • the output of perl -MPOSIX -le 'print "$_\t" . unpack "H*", strxfrm$_ for <*>' is

    一一三一 303034323030343230303432303034323030303030346c3230346c3230346c3b30346c32 一一三 3030343230303432303034323030303030346c3230346c3230346c3b 一三三二三一 3030343230303432303034323030343230303432303034323030303030346c3230346c3b30346c3b30346e3e30346c3b30346c32 一三二一一一一 303034323030343230303432303034323030343230303432303034323030303030346c3230346c3b30346e3e30346c3230346c3230346c3230346c32 一三二一一一二 303034323030343230303432303034323030343230303432303034323030303030346c3230346c3b30346e3e30346c3230346c3230346c3230346e3e 一三二一 303034323030343230303432303034323030303030346c3230346c3b30346e3e30346c32 一三二三三三一 303034323030343230303432303034323030343230303432303034323030303030346c3230346c3b30346e3e30346c3b30346c3b30346c3b30346c32 一三二三三 30303432303034323030343230303432303034323030303030346c3230346c3b30346e3e30346c3b30346c3b 一三二 3030343230303432303034323030303030346c3230346c3b30346e3e 一二一一三三 3030343230303432303034323030343230303432303034323030303030346c3230346e3e30346c3230346c3230346c3b30346c3b 一二一三二三三 303034323030343230303432303034323030343230303432303034323030303030346c3230346e3e30346c3230346c3b30346e3e30346c3b30346c3b 一二三二一一 3030343230303432303034323030343230303432303034323030303030346c3230346e3e30346c3b30346e3e30346c3230346c32 三一一二二二三 303034323030343230303432303034323030343230303432303034323030303030346c3b30346c3230346c3230346e3e30346e3e30346e3e30346c3b 三一三一一 30303432303034323030343230303432303034323030303030346c3b30346c3230346c3b30346c3230346c32 三三一三 303034323030343230303432303034323030303030346c3b30346c3b30346c3230346c3b 三三一 3030343230303432303034323030303030346c3b30346c3b30346c32 三三三三 303034323030343230303432303034323030303030346c3b30346c3b30346c3b30346c3b 三二一三二 30303432303034323030343230303432303034323030303030346c3b30346e3e30346c3230346c3b30346e3e 三二三一 303034323030343230303432303034323030303030346c3b30346e3e30346c3b30346c32 三二二三 303034323030343230303432303034323030303030346c3b30346e3e30346e3e30346c3b 二一一一 303034323030343230303432303034323030303030346e3e30346c3230346c3230346c32 二一一三三二 3030343230303432303034323030343230303432303034323030303030346e3e30346c3230346c3230346c3b30346c3b30346e3e 二一三三 303034323030343230303432303034323030303030346e3e30346c3230346c3b30346c3b 二一二 3030343230303432303034323030303030346e3e30346c3230346e3e 二三三二三二二 303034323030343230303432303034323030343230303432303034323030303030346e3e30346c3b30346c3b30346e3e30346c3b30346e3e30346e3e 二三二一二二一 303034323030343230303432303034323030343230303432303034323030303030346e3e30346c3b30346e3e30346c3230346e3e30346e3e30346c32 二三二三二三 3030343230303432303034323030343230303432303034323030303030346e3e30346c3b30346e3e30346c3b30346e3e30346c3b 二二一一二一 3030343230303432303034323030343230303432303034323030303030346e3e30346e3e30346c3230346c3230346e3e30346c32 二二三二一二 3030343230303432303034323030343230303432303034323030303030346e3e30346e3e30346c3b30346e3e30346c3230346e3e test.py 303033563030333830303355303033563030314d303033523030335f30303030303033563030333830303355303033563030314d303033523030335f 
1
  • Comments have been moved to chat since the isue seems to have been resolved and the relevant information now added to either the question or an answer. Commented Nov 27, 2023 at 11:03

2 Answers 2

6

You'll notice that among the strings of same length, there appears to be a relative order of those characters, so they're not totally treated as being equivalent. It's not like the 🧚🧛🧜 which have no defined order in GNU libc locales and where you get random order there in most UTF-8 locales such as your en_US.UTF-8:

$ ls -1 🧚🧜🧛 🧜🧛🧚 🧚🧚🧚 🧜🧚🧜 🧛🧚🧛 🧛🧚🧜 🧚🧜🧚 🧚🧚🧜 🧜🧜🧚 🧚🧚🧜🧚 🧜🧜🧚🧚 🧜🧛🧛🧜 🧜🧛🧜🧚 🧛🧚🧜🧜 🧜🧛🧚🧜🧛 🧜🧜🧛🧜🧚 🧚🧚🧚🧛🧛 🧜🧜🧜🧚🧜 🧚🧚🧜🧜🧛 🧜🧜🧛🧜🧛 🧚🧚🧜🧛🧚 🧜🧛🧚🧛🧛 🧜🧚🧜🧚🧜 🧚🧛🧜🧛🧚🧚 🧛🧛🧚🧚🧛🧚 🧜🧛🧚🧛🧛🧚 🧛🧛🧜🧜🧜🧚 🧚🧛🧚🧚🧜🧜 🧛🧜🧛🧜🧛🧜 test.py $ ls -1 | sort -u 🧚🧜🧛 🧚🧚🧜🧚 🧜🧛🧚🧜🧛 🧚🧛🧜🧛🧚🧚 test.py 

What you got is the same kind of sorting order one gets when sorting strings made of characters that have the same primary collation weight but different subsequent weights.

For instance, in most locales, e, E, É and é have the same primary weights for good reason. This is how Stéphane or STÉPHANE can sort before Stephen even though Stephane sorts before Stéphane for instance.

EeE eeé eéE éée Eeee eeée Eeéé eéEe éEEé éEée ééeé éééé eéEéé éEeéE éeéee EEeeEe eEeeéé EeeééE eEéEee EEéEeE EéEéEé eééEée eEeéEéé eéEeeee eéEeeeE EéEeEEe eéEééée EééEéEE éeeEEEé 

(note that all of e, E and é have the same primary weight, e and E also have the same secondary weight)

From your result of strxfrm(), which looks like it's actually ASCII text which we can decode to:

一一三 004200420042000004l204l204l; 一三二 004200420042000004l204l;04n> 三三一 004200420042000004l;04l;04l2 二一二 004200420042000004n>04l204n> 一一三一 0042004200420042000004l204l204l;04l2 一三二一 0042004200420042000004l204l;04n>04l2 三三一三 0042004200420042000004l;04l;04l204l; 三三三三 0042004200420042000004l;04l;04l;04l; 三二三一 0042004200420042000004l;04n>04l;04l2 三二二三 0042004200420042000004l;04n>04n>04l; 二一一一 0042004200420042000004n>04l204l204l2 二一三三 0042004200420042000004n>04l204l;04l; 一三二三三 00420042004200420042000004l204l;04n>04l;04l; 三一三一一 00420042004200420042000004l;04l204l;04l204l2 三二一三二 00420042004200420042000004l;04n>04l204l;04n> 一三三二三一 004200420042004200420042000004l204l;04l;04n>04l;04l2 一二一一三三 004200420042004200420042000004l204n>04l204l204l;04l; 一二三二一一 004200420042004200420042000004l204n>04l;04n>04l204l2 二一一三三二 004200420042004200420042000004n>04l204l204l;04l;04n> 二三二三二三 004200420042004200420042000004n>04l;04n>04l;04n>04l; 二二一一二一 004200420042004200420042000004n>04n>04l204l204n>04l2 二二三二一二 004200420042004200420042000004n>04n>04l;04n>04l204n> 一三二一一一一 0042004200420042004200420042000004l204l;04n>04l204l204l204l2 一三二一一一二 0042004200420042004200420042000004l204l;04n>04l204l204l204n> 一三二三三三一 0042004200420042004200420042000004l204l;04n>04l;04l;04l;04l2 一二一三二三三 0042004200420042004200420042000004l204n>04l204l;04n>04l;04l; 三一一二二二三 0042004200420042004200420042000004l;04l204l204n>04n>04n>04l; 二三三二三二二 0042004200420042004200420042000004n>04l;04l;04n>04l;04n>04n> 二三二一二二一 0042004200420042004200420042000004n>04l;04n>04l204n>04n>04l2 

you can see that 0042 is likely its representation of the primary weight of those 一二三 characters and it's the same for all 3. Then there's likely a 0000 separator and by the look of it only one additional (secondary) weight which is 04n>, 04l; and 04l2 for 二 (U+4E8C), 三 (U+4E09), and 一 (U+4E00) respectively¹.

Why their collation order is defined like that I don't know. It's not the case on GNU systems where in most locales, the primary weights for U+4E00 to U+9FA5 are different and in sequence of their Unicode code point. Nor is it the case on FreeBSD 12.4-RELEASE-p5 at least.

It's also possible (even likely) that what we're seeing above is that those characters have an undefined primary weight and the 0042 we're seeing is the secondary weight. Which would explain why we seem to see only two weights per character in the strxfrm() result.

That means that in the first pass of comparing strings that happen to contain those characters, those characters are just ignored for the purpose of comparison. That's normally the case for blank or punctuation characters where you don't want them to have a strong influence on the order. For instance foo-bar would sort between fooargh and football and the secondary and further weights of - would only be used to determine the relative order of foo-bar and foo+bar for instance.

Apple might have decided that since not everybody agrees on the order of those characters, we might as well ignore them.


¹ Interesting to note (though it doesn't shed any light on this issue) is that 19968 is 0x4E00 suggesting the last weight is based on the code point. 04l2, 04l;, 04n> and even 0042 in the strxfrm strings seem to be numbers in some sort of base 64 with 0123...lmno as the digits corresponding to those weights offset by 258 (42 in that base).

1
2

I found out where macOS stores locale files, in /usr/share/locale, and to my surprise this is how the zh_CN.UTF-8 locale is defined:

% ls -la /usr/share/locale/zh_CN.UTF-8 total 8 drwxr-xr-x 8 root wheel 256 Oct 12 04:10 . drwxr-xr-x 209 root wheel 6688 Oct 12 04:10 .. lrwxr-xr-x 1 root wheel 28 Oct 12 04:10 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE lrwxr-xr-x 1 root wheel 17 Oct 12 04:10 LC_CTYPE -> ../UTF-8/LC_CTYPE drwxr-xr-x 3 root wheel 96 Oct 12 04:10 LC_MESSAGES -r--r--r-- 2 root wheel 36 Oct 12 04:10 LC_MONETARY lrwxr-xr-x 1 root wheel 25 Oct 12 04:10 LC_NUMERIC -> ../zh_CN.eucCN/LC_NUMERIC -r--r--r-- 2 root wheel 408 Oct 12 04:10 LC_TIME 

LC_COLLATE is symlinked to /usr/share/locale/la_LN.US-ASCII/LC_COLLATE. This file is just over 2KB in size, and while it is a binary file (rather than the textual format used by some other systems), it's pretty clearly defining collation for just 256 bytes:

% xxd /usr/share/locale/la_LN.US-ASCII/LC_COLLATE 00000000: 312e 3141 0a00 0000 0000 0101 0102 0000 1.1A............ 00000010: 0102 ffff fefe 0000 0000 0000 0000 0000 ................ 00000020: 0000 0000 0000 0000 0002 0000 0002 0000 ................ 00000030: 0003 0000 0003 0000 0004 0000 0004 0000 ................ 00000040: 0005 0000 0005 0000 0006 0000 0006 0000 ................ 00000050: 0007 0000 0007 0000 0008 0000 0008 0000 ................ 00000060: 0009 0000 0009 0000 000a 0000 000a 0000 ................ 00000070: 000b 0000 000b 0000 000c 0000 000c 0000 ................ 00000080: 000d 0000 000d 0000 000e 0000 000e 0000 ................ 00000090: 000f 0000 000f 0000 0010 0000 0010 0000 ................ ... 00000760: 00e9 0000 00e9 0000 00ea 0000 00ea 0000 ................ 00000770: 00eb 0000 00eb 0000 00ec 0000 00ec 0000 ................ 00000780: 00ed 0000 00ed 0000 00ee 0000 00ee 0000 ................ 00000790: 00ef 0000 00ef 0000 00f0 0000 00f0 0000 ................ 000007a0: 00f1 0000 00f1 0000 00f2 0000 00f2 0000 ................ 000007b0: 00f3 0000 00f3 0000 00f4 0000 00f4 0000 ................ 000007c0: 00f5 0000 00f5 0000 00f6 0000 00f6 0000 ................ 000007d0: 00f7 0000 00f7 0000 00f8 0000 00f8 0000 ................ 000007e0: 00f9 0000 00f9 0000 00fa 0000 00fa 0000 ................ 000007f0: 00fb 0000 00fb 0000 00fc 0000 00fc 0000 ................ 00000800: 00fd 0000 00fd 0000 00fe 0000 00fe 0000 ................ 00000810: 00ff 0000 00ff 0000 0100 0000 0100 0000 ................ 00000820: 0101 0000 0101 ...... 

So the problem seems to be that macOS simply does not define collation ordering for any Chinese characters at all (or really anything beyond the first 256 codepoints), even in Chinese locales.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.