3

I'm trying to determine the root cause of why this find command is not working; it shouldn't match the file called this_should_not_match below:

$ > find . -type f -name "*[^ -~]*" ./__º╚t ./this_should_not_match ./__╞_u ./__¡VW ./__▀√Z ./__εè_ ./__∙Σ_ ./__Σ_9 ./__Σhm ./__φY_ 

My shell is Bash 3.2

1 Answer 1

5

Ranges only work reliably and portably in the C locale. In other locales, you get some variation, but generally [x-y] gets you the characters (actually collating elements, it could even match sequences of characters) that sort after x and before y in some sort order which is often obscure and not always the same as sort would use.

In the C locale (see What does “LC_ALL=C” do?), characters are bytes and ranges are based on the code point of the characters (on byte values).

LC_ALL=C find . -type f -name "*[^ -~]*" 

on an ASCII-based system (most of them; POSIX doesn't guarantee the C locale to use ASCII charset, but in practice, unless you're on some EBCDIC based special IBM mainframe OS (but then you'd know about it), you'll be using ASCII) would list regular files whose name contains bytes other than those between 32 and 126.

Also note that in a multi-byte character locale (like UTF-8 ones, the norm nowadays), the * may not even match all file names as on some systems, it will fail to match sequences of bytes that don't form valid characters.

6
  • The pattern is varied in shells. With LC_ALLL=C "$shell" -c 'case a in [!" "-~]) echo yes;; esac', only bash, ksh and yash say yes. Commented Mar 8, 2016 at 14:35
  • This works. Looks like $LC_ALL wasn't set to anything in my case Commented Mar 8, 2016 at 14:40
  • @Zaid In which case it should default to C, shouldn't it? Commented Mar 8, 2016 at 14:43
  • @Murphy I wouldn't know - I've never had to deal with locales before Commented Mar 8, 2016 at 14:45
  • 1
    @cuonglm, zsh ranges are based on code-points. In UTF-8, bytes that don't form part of valid characters are mapped to would-be-code-points DC80 to DCFF (not valid characters), something to bear in mind when using a range that spans that like [$'\u1000-\U10000']. You'd want to use [$'\u1000-\ud7ff\ue000-\U10000'] to exclude the invalid [$'\ud800-\udfff'] range. You can also match them with byte ranges like [$'\x00-\xFF'] but note that it won't match on bytes that are otherwise part of multi-byte characters (you should only use byte-ranges in single-byte character locales). Commented Mar 8, 2016 at 15:28

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.