Why does [a-z] asterisk match numbers?

Question

I have 3 directories at current path.

$ls a_0db_data a_clean_0db_data a_clean_data $ls a_*_data a_0db_data: a_clean_0db_data: a_clean_data: $ls a_[a-z]*_data a_clean_0db_data: a_clean_data:

I expected last ls command to match only a_clean_data. Why did it also match the one containing 0?

bash --version GNU bash, version 4.2.24(1)-release (i686-pc-linux-gnu)

See this question for more on the difference between a regular expression and a glob. — terdon
– terdon ♦, Commented Sep 10, 2014 at 14:24
So the fact that a_*_data matched` any of this files didn't surprise you? — Cthulhu
– Cthulhu, Commented Sep 10, 2014 at 16:37

l0b0 · Accepted Answer · 2014-09-10 15:26:12Z

The [a-z] part isn't what matches the number; it's the *. You may be confusing shell globbing and regular expressions.

Tools like grep accept various flavours of regexes (basic by default, -E for extended, -P for Perl regex)

E.g. (-v inverts the match)

$ ls a_[a-z]*_data | grep -v "[0-9]" a_clean_data

If you want to use a bash regex, here is an example on how to test if the variable $ref is an integer:

re='^[0-9]+$' if ! [[ $ref =~ $re ]] ; then echo "error" fi

How to use bash regex then? (see tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html) — user13107
– user13107, Commented Sep 10, 2014 at 8:39

umläute · Accepted Answer · 2014-09-11 08:20:29Z

So the problem is: why does a_[a-z]*_data match a_clean_0db_data?

This can be broken down into four parts:

a_ matches the beginning of a_clean_0db_data, leaving clean_0db_data to be matched
[a-z] matches any character in the range a-z (e.g. c), leaving lean_0db_data to be matched
* matches any number of characters, e.g. lean_0db
_data matches the trailing _data

In regular expressions, [a-z]* would mean any number of characters (including zero) in the range of a..z, but you are dealing with shell globbing, not with regular expressions.

If you want regular expressions, a few find implementations have a -regex predicate for that:

find . -maxdepth 1 -regex "^.*/a_[a-z]*_data$"

The -maxdepth is only here to limit the search-results to the folder you are in. The regular expression matches the entire filename, therefore I have added a ^.*/ to match the path-portion

Stéphane Chazelas · Accepted Answer · 2014-09-10 09:16:11Z

* in shell patterns matches 0 or more characters. It's not to be confused with the * regular expression operator that means 0 or more of the preceding atom.

There is no equivalent of regexp * in basic shell patterns. However, various shells have extensions for that.

ksh has *(something):
```
ls a_*([a-z])_data 
```
you can have the same in bash with shopt -s extglob or zsh with setopt kshglob:
```
shopt -s extglob ls a_*([a-z])_data 
```
In zsh with extendedglob enabled, # is equivalent to regexp *:
```
setopt extendedglob ls a_[a-z]#_data 
```
In recent versions of ksh93, you can also use regular expressions in globs. Here with extended regular expressions:
```
ls ~(E:a_[a-z]*_data) 
```

Note that [a-z] matches different things depending on the current locale. It generally matches only the 26 a to z latin non-accented letters in the C locale. In other locales, it generally matches more, and doesn't always make sense. To match a letter in your locale, you may prefer [[:alpha:]].

Could you give an example of [a-z] matching more that the 26 letters matched in the C locale? What I remember from when I last looked at this, all encodings practically used in Unix variants had ISO-646 as a base (then the upper 128 codes where used differently, directly for characters in encodings like the ISO-8859-X, combined in encodings like UTF-8 or the EUC family). Even AIX hadn't EBCDIC locales (at least as available to me). I remember trying to find if POSIX/UNIX standards demanded it, but I don't remember the result. — AProgrammer
– AProgrammer, Commented Sep 10, 2014 at 9:40
@AProgrammer, that's independent of the encoding, that's based on sort order (LC_COLLATE). [a-z] generally includes é or í (but not necessarily ź) in the the locales where the charset have them, whether the codepoint in that encoding is between that of a and z or not. Only the C locale guarantees a sort order based on codepoint value. See this other answer for more details. — Stéphane Chazelas
– Stéphane Chazelas, Commented Sep 10, 2014 at 9:46
Ok, what I missed was that the range was interpreted according to the current collation sequence. — AProgrammer
– AProgrammer, Commented Sep 10, 2014 at 11:29

Stack Exchange Network

Why does [a-z] asterisk match numbers?

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Why does [a-z] asterisk match numbers?

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions