Return to Revisions

3 of 8

added 500 characters in body

edited Feb 23 at 12:23

586.2k
96
1.1k
1.7k

ls by default sorts files based on the locale collation order of their name.

On GNU systems or any system using the GNU libc, the source definition of the en_US locales can be seen in $prefix/share/i18n/locales/en_US.

In there, in the LC_COLLATE section that defines collation order, you'll see:

copy "iso14651_t1"

(itself having a copy "iso14651_t1_common").

That's based on (an older version of) table 1 found in appendix of the ISO 14651 international standard.

Most other locales use that, that's not limited to en_US.

In there, you'll find:

% Third-level weight assignments [...] <MIN> [...] <CAP> [...] % First-level weight assignments [...] <S0030> % DIGIT ZERO <S0031> % DIGIT ONE <S0032> % DIGIT TWO [...] <S0067> % LATIN SMALL LETTER G [...] order_start <SPECIAL>;forward;backward;forward;forward,position [...] <U002D> IGNORE;IGNORE;IGNORE;<U002D> % HYPHEN-MINUS <U002E> IGNORE;IGNORE;IGNORE;<U002E> % FULL STOP [...] <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO [...] <U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE [...] <U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO [...] <U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G [...] <U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G

In that order.

Those last few lines define the weight for each collating element:

<collating-element> <weight1>;<weight2>;<weight3>;<weight4> % comment

You'll notice that hyphen and full stop, like most punctuation characters have IGNORE as the primary, secondary, ternary weights, only the fourth last-resort one is defined, while ASCII decimal digits and letters have all of them.

When comparing abc.zml-1.gz against abc.zml-12.gz, the comparison will be first done based on the primary weights. As that of - and . is IGNORE, that will be as if comparing abczml1gz to abczml12gz, and the primary weight of 2 comes before that of g.

If we were comparing abc.zml-1.gz to abc.zml-1Gz, all the primary and secondary weights would be the same, so the determination would be done based on the ternary weight comparing <MIN><MIN><MIN><MIN><MIN><MIN><MIN><MIN><MIN> to `` (.and-still ignored), andcoming before, the one with lowercase g` coming first.

When comparing abc.zml-1.gz to abc-zml-1.gz, we'd have to go up to the fourth weight.

That's meant to mimic the ordering done in the user's locale as done for instance in a dictionary, where punctuations, case, diacritics are generally ignored in the first instance, but can be used to refine the order the rest being equal (in which case, some locales prefer lower case before small caps and before uppercase, some acute accents before grave accents...)

In the C locale, on GNU systems, the order is based on the character value. If LC_CTYPE uses a multibyte encoding such as (but not limited to) UTF-8, that will be based on the Unicode code point (from U+0000 to U+10FFFF; not all systems do the same). If not (including with LC_ALL=C which implies LC_CTYPE=C) based on the byte value. For instance € (U+20AC) would sort after é (U+00E9) with LC_CTYPE=en_US.UTF-8 LC_COLLATE=C but before with LC_CTYPE=en_US.is0885915 LC_COLLATE=C because € there is encoded as 0xA4 and é` as 0xE9.

$ printf '\u20ac\n\u00e9\n' | iconv -t iso885915 | LC_CTYPE=en_US.iso885915 LC_COLLATE=C sort | iconv -f iso885915 € é $ printf '\u20ac\n\u00e9\n' | LC_CTYPE=en_US.UTF-8 LC_COLLATE=C sort é €

UTF-8 has that property that its encoding sorts by byte value the same as its characters by code point, so for UTF-8 encoding text LC_CTYPE=en_US.UTF-8 LC_COLLATE=C and LC_CTYPE=C LC_COLLATE=C (or LC_ALL=C) should give the same result. The latter would also be a lot less expensive to do, and cope better in the face of not properly encoded text.

Note that the GNU implementation of ls has a -v / --sort=version that performs a version sort and the GNU implementation of sort -V/--version-sort which can help to order things numerically. See also the n glob qualifier of zsh.

For example, in zsh:

print -rC1 -- *.gz(n)

Will print raw on 1 Column the list of non-hidden file names ending in .gz numerically (sequences of decimal digits are compared numerically and the reset based on collation order).

0-padding all numbers to the same width ensures things sort lexically any numerically the same regardless of the locale.

In zsh 0-pad all numbers in all file names in the current working directory to a length of 3 digits (beware it also truncates longer numbers to 3 digits) can be done with:

autoload zmv zmv '*' '${f//<->/${(l[3][0])MATCH}}'

answered Feb 23 at 11:59

Stéphane Chazelas

586.2k
96
1.1k
1.7k