Revision f0c21ac2-98ac-4b15-8ab4-2afa8056834b

`ls` by default sorts files based on the locale collation order of their name as if by using the `strcoll()` standard function¹.

On GNU systems or any system using the GNU libc, the source definition of the `en_US` locales can be seen in `$prefix/share/i18n/locales/en_US`.

In there, in the `LC_COLLATE` section that defines collation order, you'll see:

```
copy "iso14651_t1"
```

(itself having a `copy "iso14651_t1_common"`).

That's based on (an older version of) [*table 1* found in appendix of the ISO 14651 international standard](https://standards.iso.org/iso-iec/14651/ed-6/en/).

Most other locales use that, that's not limited to `en_US`.

In there (see [here](https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/iso14651_t1_common;h=227400cc4ea4362d0e82464a803ec534a0cfe3a6;hb=74f59e9271cbb4071671e5a474e7d4f1622b186f) for the one found in the recently released glibc 2.41, though the file hasn't changed since 2018), you'll find:

```
% Third-level weight assignments
[...]
<MIN>
[...]
<CAP>
[...]
% First-level weight assignments
[...]
<S0030> % DIGIT ZERO
<S0031> % DIGIT ONE
<S0032> % DIGIT TWO
[...]
<S0067> % LATIN SMALL LETTER G
[...]
order_start <SPECIAL>;forward;backward;forward;forward,position
[...]
<U002D> IGNORE;IGNORE;IGNORE;<U002D> % HYPHEN-MINUS
<U002E> IGNORE;IGNORE;IGNORE;<U002E> % FULL STOP
[...]
<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
[...]
<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
[...]
<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
[...]
<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
[...]
<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
```

In that order.

Those last few lines define the weight for each collating element:

```
<collating-element> <weight1>;<weight2>;<weight3>;<weight4> % comment
```

You'll notice that *hyphen* and *full stop*, like most punctuation characters have `IGNORE` as the primary, secondary, ternary weights, only the fourth last-resort one is defined, while ASCII decimal digits and letters have all of them.

When comparing `abc.zml-1.gz` against `abc.zml-12.gz`, the comparison will be first done based on the primary weights. As that of `-` and `.` is `IGNORE`, that will be as if comparing `abczml1gz` to `abczml12gz`, and the primary weight of `2` comes before that of `g`.

If we were comparing `abc.zml-1.gz` to `abc.zml-1Gz`, all the primary and secondary weights would be the same, so the determination would be done based on the ternary weight comparing `<MIN><MIN><MIN><MIN><MIN><MIN><MIN><MIN><MIN>` to `<MIN><MIN><MIN><MIN><MIN><MIN><MIN><CAP><MIN>` (taking the ternary weights of each character where `.`'s and `-`'s are still `IGNORED` so removed), and `<MIN>` coming before `<CAP>`, the one with a lowercase `g` coming first.

When comparing `abc.zml-1.gz` to `abc-zml-1.gz`, we'd have to go up to the fourth weight.

That's meant to mimic the ordering done in the user's locale as done for instance in a dictionary, where punctuations, case, diacritics are generally ignored in the first instance, but can be used to refine the order the rest being equal (in which case, some locales prefer lower case before small caps and before uppercase, some acute accents before grave accents...)

With `LC_COLLATE=C`, on GNU systems, regardless of the value of `LC_CTYPE`, `strcoll()` resorts to `strcmp()`² that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8 has that property that its encoding sorts by byte value the same as the characters it encodes by code point, but that's generally not the case for other multibyte encodings³. With `LC_COLLATE=C`, the relative order of two characters can be different between locales.

In a UTF-8 locale:

```
$ mkdir UTF-8 ISO8859-15 GB18030
$ locale charmap
UTF-8
$ touch UTF-8/{é,€}
$ locale | grep COLLATE
LC_COLLATE="en_US.UTF-8"
$ ls UTF-8
€ é
```

`€` like all currency symbols comes before digits and letters in `iso14651_t1_common`.

```
$ LC_COLLATE=C ls UTF-8
é €
```

The UTF-8 encoding of `é` (U+00E9) is 0xc3 0xa9 and those byte values sort before that of `€` (U+20AC) which is 0xe2 0x82 0xac.

But if you do the same in a locale that uses ISO8859-15 where the encoding of `é` is 0xe9 and that of `€` 0xa4 or GB18030 where it's 0xa8 0xa6 and 0xa2 0xe3:

```
$ LANG=en_US.iso885915 luit
$ locale charmap
ISO-8859-15
$ touch ISO8859-15/{é,€}
$ ls ISO8859-15
€ é
```

Still based on ISO14651, but:

```
$ LC_COLLATE=C ls ISO8859-15
€ é
```

Same in GB18030:

```
$ LANG=zh_CN.gb18030 luit
$ locale charmap
GB18030
$ touch GB1803{é,€}
$ ls GB18030
€ é
$ LC_COLLATE=C ls GB18030
€ é
```


---

Note that the GNU implementation of `ls` has a `-v` / `--sort=version` that performs a version sort and the GNU implementation of `sort` `-V`/`--version-sort` which can help to order things numerically. See also the `n` glob qualifier of `zsh`.

For example, in zsh:

```
print -rC1 -- *.gz(n)
```

Will `print` `r`aw on `1` `C`olumn the list of non-hidden file names ending in `.gz` `n`umerically (sequences of decimal digits are compared numerically and the reset based on collation order).

0-padding all numbers to the same width ensures things sort lexically any numerically the same regardless of the locale.

In `zsh` 0-pad all numbers in all file names in the current working directory to a length of 3 digits (beware it also *truncates* longer numbers to 3 digits) can be done with:

```
autoload zmv
zmv '*' '${f//<->/${(l[3][0])MATCH}}'
```

---
<sup>¹ And in the case of the GNU implementation of `ls`, it [does call `strcoll()`](https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c?h=v9.6#n3775).</sup>

<sup>² See [it in the code when _NL_COLLATE_NRULES is 0](https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcoll_l.c;h=72388f0322a763bfee952c2bb6d9e91c53bee01e;hb=74f59e9271cbb4071671e5a474e7d4f1622b186f#l269), which [it is for the C locale](https://sourceware.org/git/?p=glibc.git;a=blob;f=locale/C-collate.c;h=4c1245e5e8989bb46d9b0c348593c9015d5640bc;hb=74f59e9271cbb4071671e5a474e7d4f1622b186f#l33).</sup>

<sup>³ Even though in general in the GNU C library, the `wchar_t` values for characters in locales using multibyte encoding is chosen to be the Unicode code point.</sup>