ls by default sorts files based on the locale collation order of their name as if by using the strcoll() standard function¹.
In theWith CLC_COLLATE=C locale, on GNU systems, the order is based onregardless of the character value. If of LC_CTYPE uses a multibyte encoding such as, (but not limitedstrcoll() resorts to) strcmp()² that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.
UTF-8, has that will be based onproperty that its encoding sorts by byte value the Unicodesame as the characters it encodes by code point (from U+0000 to U+10FFFF;, but that's generally not all systems do the same)case for other multibyte encodings³. If not (including with LC_ALL=C which impliesWith LC_CTYPE=CLC_COLLATE=C) based on, the byte valuerelative order of two characters can be different between locales. For instance
In a UTF-8 locale:
$ mkdir UTF-8 ISO8859-15 GB18030 $ locale charmap UTF-8 $ touch UTF-8/{é,€} $ locale | grep COLLATE LC_COLLATE="en_US.UTF-8" $ ls UTF-8 € é
€ like all currency symbols comes before digits and letters in (U+20AC) would sort afteriso14651_t1_common.
$ LC_COLLATE=C ls UTF-8 é €
The UTF-8 encoding of é (U+00E9) with LC_CTYPE=en_US.UTF-8 LC_COLLATE=C butis 0xc3 0xa9 and those byte values sort before with LC_CTYPE=en_US.is0885915 LC_COLLATE=C becausethat of € there(U+20AC) which is encoded as0xe2 0x82 0xac.
But if you do the same in a locale that uses ISO8859-15 where the encoding of 0xA4é is 0xe9 and that of é€ as 0xE9.0xa4 or GB18030 where it's 0xa8 0xa6 and 0xa2 0xe3:
$ printfLANG=en_US.iso885915 '\u20ac\n\u00e9\n'luit $ |locale iconvcharmap ISO-8859-15 $ touch ISO8859-t15/{é,€} $ iso885915ls |ISO8859-15 € LC_CTYPE=en_US.iso885915 é
Still based on ISO14651, but:
$ LC_COLLATE=C sortls |ISO8859-15 € iconv -fé
Same in GB18030:
$ iso885915LANG=zh_CN.gb18030 luit €$ locale charmap GB18030 $ touch GB1803{é,€} $ printfls '\u20ac\n\u00e9\n'GB18030 € | LC_CTYPE=en_US.UTF-8é $ LC_COLLATE=C sort éls GB18030 € é
UTF-8 has that property that its encoding sorts by byte value the same as its characters by code point, so for UTF-8 encoding text LC_CTYPE=en_US.UTF-8 LC_COLLATE=C and LC_CTYPE=C LC_COLLATE=C (or LC_ALL=C) should give the same result. The latter would also be a lot less expensive to do, and cope better in the face of not properly encoded text.
¹ And in the case of the GNU implementation of ls, it does call strcoll().
² See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.
³ Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.