Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:
unset LANGUAGE LC_ALL echo B | LC_COLLATE=en_US grep '[a-z]' Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-z includes everything that's between a and z in the collation order, including the capital letters B through Z.
All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.
What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?
(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_US locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)
On two Debian machines, one where B is in [a-z] and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE is
collate-nrules=4 collate-rulesets="" collate-symb-hash-sizemb=1 collate-codeset="ISO-8859-1" and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is
collate-nrules=4 collate-rulesets="" collate-symb-hash-sizemb=2039 collate-codeset="UTF-8"
en_USis generated, though.Clocale is used as a fallback, and its collation order is straight byte values, soBwon't be matched. Test in a locale that appears in the output oflocale -a.printf '%s' $(printf '%s\n' {a..z} {A..Z} | sort); echoin both systems?