Does (should) LC_COLLATE affect character ranges?

Question

Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:

unset LANGUAGE LC_ALL echo B | LC_COLLATE=en_US grep '[a-z]'

Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-z includes everything that's between a and z in the collation order, including the capital letters B through Z.

All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.

What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?

(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_US locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)

On two Debian machines, one where B is in [a-z] and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE is

collate-nrules=4 collate-rulesets="" collate-symb-hash-sizemb=1 collate-codeset="ISO-8859-1"

and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is

collate-nrules=4 collate-rulesets="" collate-symb-hash-sizemb=2039 collate-codeset="UTF-8"

Doesn't reproduce on a Debian Lenny instance I've had handy. Didn't check if en_US is generated, though. — alex
– alex, Commented Jul 2, 2011 at 13:35
@alex If the locale isn't generated, the C locale is used as a fallback, and its collation order is straight byte values, so B won't be matched. Test in a locale that appears in the output of locale -a. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Jul 2, 2011 at 13:43
Note that en_US is NOT the same as en_US.utf8, and typically means en_US.iso-8859-1, depending on exactly what you have installed. If en_US (with no suffix) doesn't appear in the output of locale -a you don't actually have this locale. What does LC_COLLATE=en_US locale -k LC_COLLATE show? — Neil Mayhew
– Neil Mayhew, Commented Jul 4, 2011 at 13:26
This has since turned up in a practical rather than theoretical question here: Why are capital letters included in a range of lower-case letters in an awk regex? — Caleb
– Caleb, Commented Aug 24, 2011 at 21:22
Could you provide the output of printf '%s' $(printf '%s\n' {a..z} {A..Z} | sort); echo in both systems? — user232326
– user232326, Commented May 15, 2018 at 22:24

Neil Mayhew · Accepted Answer · 2018-02-14 17:59:03Z

5

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).

Instead, use a named character class:

echo B | grep '[[:lower:]]'

This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

For example, if you need to find a particular byte value, use the C locale, which is always available:

echo B | LANG=C grep '[a-z]'

If this doesn't work as expected, it really is a bug.

edited Feb 14, 2018 at 17:59

answered Jul 3, 2011 at 22:27

Neil Mayhew

7553 silver badges10 bronze badges

I know that, it isn't what I asked. I'm specifically asking about what an explicit range means, and why different distributions (even with GNU libc and GNU grep) have different behaviors. (Downvoted because even though what you say is correct, it's irrelevant.)

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2011-07-03 22:47:55 +00:00
Commented Jul 3, 2011 at 22:47
2

My point is that the meaning of an explicit range is locale-dependent, and different systems are not required to define their locales the same way, so this is not a bug. Technically, you are abusing the system, so you shouldn't be surprised at getting "undefined" behaviour. Also, several people have commented that they can't reproduce the behaviour on their Debian systems, so there seems to be something unusual about your system(s).

Neil Mayhew
– Neil Mayhew

2011-07-04 13:17:43 +00:00
Commented Jul 4, 2011 at 13:17
1

I know that the behavior of ranges depends on the locales. I'm asking how, and surprised that different systems using Glibc (and, it turns out, even different installations of the same Debian release) have different behaviors. I've added the output of locale -k to my question; it's identical on two Debian machines, one where B is in the range and one where it isn't. BTW I'm not root on either machine (so it's not something peculiar that I do as an admin).

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2011-07-04 19:46:49 +00:00
Commented Jul 4, 2011 at 19:46
echo "Baü" | LC_COLLATE=C grep -o '[[:lower:]]' returns a AND ü while echo "Baü" | LC_COLLATE=C grep -o '[a-z]' returns only a. In my eyes, "lower" is not really what the OP wanted

Daniel Alder
– Daniel Alder

2018-01-24 12:47:16 +00:00
Commented Jan 24, 2018 at 12:47
1

My original point still stands, though: don't use ranges unless you're in the C locale. I believe this is relevant to the OP, who was looking to report a bug. If you're not in the C locale, the results of using ranges are highly unpredictable and therefore can't ever be considered a bug. On the other hand, if you need to find a particular byte value, just use the C locale. My secondary point was that if you really do want to search for lowercase letters in a locale, use a character class. Even though the OP may not have been looking for this, others might if they find this question.

Neil Mayhew
– Neil Mayhew

2018-02-14 17:54:10 +00:00
Commented Feb 14, 2018 at 17:54

Add a comment |

Peter Eisentraut · Accepted Answer · 2011-07-04 20:57:45Z

Ranges in regular expressions should observe the collation setting. Here is the relevant standard: http://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html (look for "range expressions"). So echo B | LC_COLLATE=en_US grep '[a-z]' should output B given a sensible definition of the respective locale. I can't explain why this sometimes doesn't work for you, but I would be very surprised if I encountered this on a non-ancient system that is properly installed and configured.

echo B | LC_COLLATE=en_US.utf8 grep '[a-z]' Doesn't print anything on Ubuntu 12.04 with grep 2.10. Doesn't print anything on Centos 6.5 with grep 2.6.3. Does work on Debian 6.0.8 with grep 2.6.3. — Ian D. Allen
– Ian D. Allen, Commented Jan 27, 2014 at 6:20
Note when testing this, you may need to unset LC_ALL first so it's not overriding LC_COLLATE. — SpinUp __ A Davis
– SpinUp __ A Davis, Commented Jul 13, 2022 at 17:49
None of my non-ancient systems (Debian, Ubuntu, macOS) output anything for echo B | LC_COLLATE=en_US grep '[a-z]'. — SpinUp __ A Davis
– SpinUp __ A Davis, Commented Jul 13, 2022 at 18:06

Stack Exchange Network

Does (should) LC_COLLATE affect character ranges?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Does (should) LC_COLLATE affect character ranges?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions