Why is this find command not returning filenames containing non-ASCII characters only?

Question

I'm trying to determine the root cause of why this find command is not working; it shouldn't match the file called this_should_not_match below:

$ > find . -type f -name "*[^ -~]*" ./__º╚t ./this_should_not_match ./__╞_u ./__¡VW ./__▀√Z ./__εè_ ./__∙Σ_ ./__Σ_9 ./__Σhm ./__φY_

My shell is Bash 3.2

Stéphane Chazelas · Accepted Answer · 2016-03-08 15:18:37Z

Ranges only work reliably and portably in the C locale. In other locales, you get some variation, but generally [x-y] gets you the characters (actually collating elements, it could even match sequences of characters) that sort after x and before y in some sort order which is often obscure and not always the same as sort would use.

In the C locale (see What does “LC_ALL=C” do?), characters are bytes and ranges are based on the code point of the characters (on byte values).

LC_ALL=C find . -type f -name "*[^ -~]*"

on an ASCII-based system (most of them; POSIX doesn't guarantee the C locale to use ASCII charset, but in practice, unless you're on some EBCDIC based special IBM mainframe OS (but then you'd know about it), you'll be using ASCII) would list regular files whose name contains bytes other than those between 32 and 126.

Also note that in a multi-byte character locale (like UTF-8 ones, the norm nowadays), the * may not even match all file names as on some systems, it will fail to match sequences of bytes that don't form valid characters.

The pattern is varied in shells. With LC_ALLL=C "$shell" -c 'case a in [!" "-~]) echo yes;; esac', only bash, ksh and yash say yes. — cuonglm
– cuonglm, Commented Mar 8, 2016 at 14:35
This works. Looks like $LC_ALL wasn't set to anything in my case — Zaid
– Zaid, Commented Mar 8, 2016 at 14:40
@Murphy I wouldn't know - I've never had to deal with locales before — Zaid
– Zaid, Commented Mar 8, 2016 at 14:45
@cuonglm, zsh ranges are based on code-points. In UTF-8, bytes that don't form part of valid characters are mapped to would-be-code-points DC80 to DCFF (not valid characters), something to bear in mind when using a range that spans that like [$'\u1000-\U10000']. You'd want to use [$'\u1000-\ud7ff\ue000-\U10000'] to exclude the invalid [$'\ud800-\udfff'] range. You can also match them with byte ranges like [$'\x00-\xFF'] but note that it won't match on bytes that are otherwise part of multi-byte characters (you should only use byte-ranges in single-byte character locales). — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 8, 2016 at 15:28

Stack Exchange Network

Why is this find command not returning filenames containing non-ASCII characters only?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why is this find command not returning filenames containing non-ASCII characters only?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions