Skip to main content
added 20 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
% Third-level weight assignments [...] <MIN> [...] <CAP> [...]  
% First-level weight assignments [...] <S0030> % DIGIT ZERO <S0031> % DIGIT ONE <S0032> % DIGIT TWO [...] <S0067> % LATIN SMALL LETTER G [...]  
order_start <SPECIAL>;forward;backward;forward;forward,position [...] <U002D> IGNORE;IGNORE;IGNORE;<U002D> % HYPHEN-MINUS <U002E> IGNORE;IGNORE;IGNORE;<U002E> % FULL STOP [...] <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO [...] <U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE [...] <U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO [...] <U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G [...] <U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G 
% Third-level weight assignments [...] <MIN> [...] <CAP> [...] % First-level weight assignments [...] <S0030> % DIGIT ZERO <S0031> % DIGIT ONE <S0032> % DIGIT TWO [...] <S0067> % LATIN SMALL LETTER G [...] order_start <SPECIAL>;forward;backward;forward;forward,position [...] <U002D> IGNORE;IGNORE;IGNORE;<U002D> % HYPHEN-MINUS <U002E> IGNORE;IGNORE;IGNORE;<U002E> % FULL STOP [...] <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO [...] <U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE [...] <U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO [...] <U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G [...] <U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G 
% Third-level weight assignments [...] <MIN> [...] <CAP> [...]  
% First-level weight assignments [...] <S0030> % DIGIT ZERO <S0031> % DIGIT ONE <S0032> % DIGIT TWO [...] <S0067> % LATIN SMALL LETTER G [...]  
order_start <SPECIAL>;forward;backward;forward;forward,position [...] <U002D> IGNORE;IGNORE;IGNORE;<U002D> % HYPHEN-MINUS <U002E> IGNORE;IGNORE;IGNORE;<U002E> % FULL STOP [...] <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO [...] <U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE [...] <U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO [...] <U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G [...] <U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G 
added 1186 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k

ls by default sorts files based on the locale collation order of their name as if by using the strcoll() standard function¹the standard strcoll() function¹.

On GNU systems or any system using the GNU libc, the source definition of the en_US locales can be seen in $prefix/share/i18n/locales/en_US².

With LC_COLLATE=C, on GNU systems, regardless of the value of LC_CTYPE, strcoll() resorts to strcmp()²³ that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8 has that property that its encoding sorts by byte value the same as the characters it encodes by code point, but that's generally not the case for other multibyte encodings³encodings4. With LC_COLLATE=C, the relative order of two characters can be different between locales.

In a locale using the UTF-8 localecharmap:

The UTF-8 encoding of é (U+00E9) is 0xc3 0xa9 and those byte values sort before thatthose of (U+20AC) whichwhose encoding is 0xe2 0x82 0xac.

Will print raw on 1 Column the list of non-hidden file names ending in .gz numerically sorted (sequences of decimal digits are compared numerically and the reset based on collation order).

² and a en_US.UTF-8 or en_US.iso885915 locale for instance if enabled on a system will have been compiled with something like localedef -i "$prefix/share/i18n/locales/en_US" -f "$prefix/usr/share/i18n/charmaps/UTF-8.gz" or localedef -i "$prefix/share/i18n/locales/en_US" -f "$prefix/usr/share/i18n/charmaps/UTF-8.gz".

²³ See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.

³4 Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.

ls by default sorts files based on the locale collation order of their name as if by using the strcoll() standard function¹.

On GNU systems or any system using the GNU libc, the source definition of the en_US locales can be seen in $prefix/share/i18n/locales/en_US.

With LC_COLLATE=C, on GNU systems, regardless of the value of LC_CTYPE, strcoll() resorts to strcmp()² that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8 has that property that its encoding sorts by byte value the same as the characters it encodes by code point, but that's generally not the case for other multibyte encodings³. With LC_COLLATE=C, the relative order of two characters can be different between locales.

In a UTF-8 locale:

The UTF-8 encoding of é (U+00E9) is 0xc3 0xa9 and those byte values sort before that of (U+20AC) which is 0xe2 0x82 0xac.

Will print raw on 1 Column the list of non-hidden file names ending in .gz numerically (sequences of decimal digits are compared numerically and the reset based on collation order).

² See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.

³ Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.

ls by default sorts files based on the locale collation order of their name as if by using the standard strcoll() function¹.

On GNU systems or any system using the GNU libc, the source definition of the en_US locales can be seen in $prefix/share/i18n/locales/en_US².

With LC_COLLATE=C, on GNU systems, regardless of the value of LC_CTYPE, strcoll() resorts to strcmp()³ that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8 has that property that its encoding sorts by byte value the same as the characters it encodes by code point, but that's generally not the case for other multibyte encodings4. With LC_COLLATE=C, the relative order of two characters can be different between locales.

In a locale using the UTF-8 charmap:

The UTF-8 encoding of é (U+00E9) is 0xc3 0xa9 and those byte values sort before those of (U+20AC) whose encoding is 0xe2 0x82 0xac.

Will print raw on 1 Column the list of non-hidden file names ending in .gz numerically sorted (sequences of decimal digits are compared numerically and the reset based on collation order).

² and a en_US.UTF-8 or en_US.iso885915 locale for instance if enabled on a system will have been compiled with something like localedef -i "$prefix/share/i18n/locales/en_US" -f "$prefix/usr/share/i18n/charmaps/UTF-8.gz" or localedef -i "$prefix/share/i18n/locales/en_US" -f "$prefix/usr/share/i18n/charmaps/UTF-8.gz".

³ See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.

4 Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.

added 1186 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k

ls by default sorts files based on the locale collation order of their name as if by using the strcoll() standard function¹.

In theWith CLC_COLLATE=C locale, on GNU systems, the order is based onregardless of the character value. If of LC_CTYPE uses a multibyte encoding such as, (but not limitedstrcoll() resorts to) strcmp()² that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8, has that will be based onproperty that its encoding sorts by byte value the Unicodesame as the characters it encodes by code point (from U+0000 to U+10FFFF;, but that's generally not all systems do the same)case for other multibyte encodings³. If not (including with LC_ALL=C which impliesWith LC_CTYPE=CLC_COLLATE=C) based on, the byte valuerelative order of two characters can be different between locales. For instance

In a UTF-8 locale:

$ mkdir UTF-8 ISO8859-15 GB18030 $ locale charmap UTF-8 $ touch UTF-8/{é,€} $ locale | grep COLLATE LC_COLLATE="en_US.UTF-8" $ ls UTF-8 € é 

like all currency symbols comes before digits and letters in (U+20AC) would sort afteriso14651_t1_common.

$ LC_COLLATE=C ls UTF-8 é € 

The UTF-8 encoding of é (U+00E9) with LC_CTYPE=en_US.UTF-8 LC_COLLATE=C butis 0xc3 0xa9 and those byte values sort before with LC_CTYPE=en_US.is0885915 LC_COLLATE=C becausethat of there(U+20AC) which is encoded as0xe2 0x82 0xac.

But if you do the same in a locale that uses ISO8859-15 where the encoding of 0xA4é is 0xe9 and that of é as 0xE9.0xa4 or GB18030 where it's 0xa8 0xa6 and 0xa2 0xe3:

$ printfLANG=en_US.iso885915 '\u20ac\n\u00e9\n'luit $ |locale iconvcharmap ISO-8859-15 $ touch ISO8859-t15/{é,€} $ iso885915ls |ISO8859-15 € LC_CTYPE=en_US.iso885915 é 

Still based on ISO14651, but:

$ LC_COLLATE=C sortls |ISO8859-15 € iconv -fé 

Same in GB18030:

$ iso885915LANG=zh_CN.gb18030 luit $ locale charmap GB18030 $ touch GB1803{é,€} $ printfls '\u20ac\n\u00e9\n'GB18030 € | LC_CTYPE=en_US.UTF-8é $ LC_COLLATE=C sort éls GB18030 é 

UTF-8 has that property that its encoding sorts by byte value the same as its characters by code point, so for UTF-8 encoding text LC_CTYPE=en_US.UTF-8 LC_COLLATE=C and LC_CTYPE=C LC_COLLATE=C (or LC_ALL=C) should give the same result. The latter would also be a lot less expensive to do, and cope better in the face of not properly encoded text.


¹ And in the case of the GNU implementation of ls, it does call strcoll().

² See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.

³ Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.

ls by default sorts files based on the locale collation order of their name.

In the C locale, on GNU systems, the order is based on the character value. If LC_CTYPE uses a multibyte encoding such as (but not limited to) UTF-8, that will be based on the Unicode code point (from U+0000 to U+10FFFF; not all systems do the same). If not (including with LC_ALL=C which implies LC_CTYPE=C) based on the byte value. For instance (U+20AC) would sort after é (U+00E9) with LC_CTYPE=en_US.UTF-8 LC_COLLATE=C but before with LC_CTYPE=en_US.is0885915 LC_COLLATE=C because there is encoded as 0xA4 and é as 0xE9.

$ printf '\u20ac\n\u00e9\n' | iconv -t iso885915 | LC_CTYPE=en_US.iso885915 LC_COLLATE=C sort | iconv -f iso885915  é $ printf '\u20ac\n\u00e9\n' | LC_CTYPE=en_US.UTF-8 LC_COLLATE=C sort é

UTF-8 has that property that its encoding sorts by byte value the same as its characters by code point, so for UTF-8 encoding text LC_CTYPE=en_US.UTF-8 LC_COLLATE=C and LC_CTYPE=C LC_COLLATE=C (or LC_ALL=C) should give the same result. The latter would also be a lot less expensive to do, and cope better in the face of not properly encoded text.

ls by default sorts files based on the locale collation order of their name as if by using the strcoll() standard function¹.

With LC_COLLATE=C, on GNU systems, regardless of the value of LC_CTYPE, strcoll() resorts to strcmp()² that is it sorts based on the byte value of the encoding of the text, without bothering to decode the bytes into characters.

UTF-8 has that property that its encoding sorts by byte value the same as the characters it encodes by code point, but that's generally not the case for other multibyte encodings³. With LC_COLLATE=C, the relative order of two characters can be different between locales.

In a UTF-8 locale:

$ mkdir UTF-8 ISO8859-15 GB18030 $ locale charmap UTF-8 $ touch UTF-8/{é,€} $ locale | grep COLLATE LC_COLLATE="en_US.UTF-8" $ ls UTF-8 € é 

like all currency symbols comes before digits and letters in iso14651_t1_common.

$ LC_COLLATE=C ls UTF-8 é € 

The UTF-8 encoding of é (U+00E9) is 0xc3 0xa9 and those byte values sort before that of (U+20AC) which is 0xe2 0x82 0xac.

But if you do the same in a locale that uses ISO8859-15 where the encoding of é is 0xe9 and that of 0xa4 or GB18030 where it's 0xa8 0xa6 and 0xa2 0xe3:

$ LANG=en_US.iso885915 luit $ locale charmap ISO-8859-15 $ touch ISO8859-15/{é,€} $ ls ISO8859-15 € é 

Still based on ISO14651, but:

$ LC_COLLATE=C ls ISO8859-15 € é 

Same in GB18030:

$ LANG=zh_CN.gb18030 luit $ locale charmap GB18030 $ touch GB1803{é,€} $ ls GB18030 € é $ LC_COLLATE=C ls GB18030 é 

¹ And in the case of the GNU implementation of ls, it does call strcoll().

² See it in the code when _NL_COLLATE_NRULES is 0, which it is for the C locale.

³ Even though in general in the GNU C library, the wchar_t values for characters in locales using multibyte encoding is chosen to be the Unicode code point.

added 285 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
Loading
added 74 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
Loading
added 500 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
Loading
added 82 characters in body
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
Loading
Source Link
Stéphane Chazelas
  • 586.2k
  • 96
  • 1.1k
  • 1.7k
Loading