13

I have 3 directories at current path.

$ls a_0db_data a_clean_0db_data a_clean_data $ls a_*_data a_0db_data: a_clean_0db_data: a_clean_data: $ls a_[a-z]*_data a_clean_0db_data: a_clean_data: 

I expected last ls command to match only a_clean_data. Why did it also match the one containing 0?

bash --version GNU bash, version 4.2.24(1)-release (i686-pc-linux-gnu) 
3
  • 2
    See this question for more on the difference between a regular expression and a glob. Commented Sep 10, 2014 at 14:24
  • 4
    So the fact that a_*_data matched` any of this files didn't surprise you? Commented Sep 10, 2014 at 16:37
  • @Cthulhu you got me! Commented Sep 11, 2014 at 6:59

3 Answers 3

29

The [a-z] part isn't what matches the number; it's the *. You may be confusing shell globbing and regular expressions.

Tools like grep accept various flavours of regexes (basic by default, -E for extended, -P for Perl regex)

E.g. (-v inverts the match)

$ ls a_[a-z]*_data | grep -v "[0-9]" a_clean_data 

If you want to use a bash regex, here is an example on how to test if the variable $ref is an integer:

re='^[0-9]+$' if ! [[ $ref =~ $re ]] ; then echo "error" fi 
2
21

So the problem is: why does a_[a-z]*_data match a_clean_0db_data?

This can be broken down into four parts:

  • a_ matches the beginning of a_clean_0db_data, leaving clean_0db_data to be matched

  • [a-z] matches any character in the range a-z (e.g. c), leaving lean_0db_data to be matched

  • * matches any number of characters, e.g. lean_0db

  • _data matches the trailing _data

In regular expressions, [a-z]* would mean any number of characters (including zero) in the range of a..z, but you are dealing with shell globbing, not with regular expressions.

If you want regular expressions, a few find implementations have a -regex predicate for that:

find . -maxdepth 1 -regex "^.*/a_[a-z]*_data$" 

The -maxdepth is only here to limit the search-results to the folder you are in. The regular expression matches the entire filename, therefore I have added a ^.*/ to match the path-portion

11

* in shell patterns matches 0 or more characters. It's not to be confused with the * regular expression operator that means 0 or more of the preceding atom.

There is no equivalent of regexp * in basic shell patterns. However, various shells have extensions for that.

  • ksh has *(something):

    ls a_*([a-z])_data 
  • you can have the same in bash with shopt -s extglob or zsh with setopt kshglob:

    shopt -s extglob ls a_*([a-z])_data 
  • In zsh with extendedglob enabled, # is equivalent to regexp *:

    setopt extendedglob ls a_[a-z]#_data 
  • In recent versions of ksh93, you can also use regular expressions in globs. Here with extended regular expressions:

    ls ~(E:a_[a-z]*_data) 

Note that [a-z] matches different things depending on the current locale. It generally matches only the 26 a to z latin non-accented letters in the C locale. In other locales, it generally matches more, and doesn't always make sense. To match a letter in your locale, you may prefer [[:alpha:]].

3
  • Could you give an example of [a-z] matching more that the 26 letters matched in the C locale? What I remember from when I last looked at this, all encodings practically used in Unix variants had ISO-646 as a base (then the upper 128 codes where used differently, directly for characters in encodings like the ISO-8859-X, combined in encodings like UTF-8 or the EUC family). Even AIX hadn't EBCDIC locales (at least as available to me). I remember trying to find if POSIX/UNIX standards demanded it, but I don't remember the result. Commented Sep 10, 2014 at 9:40
  • 1
    @AProgrammer, that's independent of the encoding, that's based on sort order (LC_COLLATE). [a-z] generally includes é or í (but not necessarily ź) in the the locales where the charset have them, whether the codepoint in that encoding is between that of a and z or not. Only the C locale guarantees a sort order based on codepoint value. See this other answer for more details. Commented Sep 10, 2014 at 9:46
  • Ok, what I missed was that the range was interpreted according to the current collation sequence. Commented Sep 10, 2014 at 11:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.