3

I want to find all filenames in a directory tree that contain extended ASCII characters (0x80-0xFF). I thought that I could do this like this:

find . -regex '.*[\x80-\xFF]+.*' 

but instead it matches everything. Alternatively I tried to look for files that contained any character not in a standard set of a-z, A-Z, 0-9, hyphen or period.

find . -regex '.*[^- a-zA-Z0-9]+.*' 

Obviously I'm misunderstanding a fundamental aspect here.

Examples of the files in my tree:

./file 1/file - 1 - A2.mkv ./file 1/file - 1 - A2.nfo ./tést/tést - 2 - 2.mkv ./français/français - 2 -3.mkv 

I'm using find (GNU findutils) 4.7.0, under Ubuntu 20.04.

2
  • 1
    The second regex matches the filenames with français and tést if you add . and / to the negated character class: '.*[^- a-zA-Z0-9./]+.*' (and you could remove the +). Commented Jun 19, 2020 at 1:04
  • @Freddy: That works, although I had to add a more lengthy list for my real world application. But it's functional and gets me what I needed. Commented Jun 19, 2020 at 1:26

2 Answers 2

1
$ tree . |-- file 1 | |-- file - 1 - A2.mkv | `-- file - 1 - A2.nfo |-- français | `-- français - 2 -3.mkv `-- tést `-- tést - 2 - 2.mkv 3 directories, 4 files 
$ LC_ALL=C find . -name '*[![:print:]]*' ./tést ./tést/tést - 2 - 2.mkv ./français ./français/français - 2 -3.mkv 

This set the locale for the find command to the standard POSIX locale. The print character class contains characters that are part of the character classes alpha, digit, punct, and the space character is also included. This means that the test -name '*[![:print:]]*' would be true for any filename that contains a character that is not in the print class.

If you want to not find names with various other space characters (tabs etc.), use [![:graph:][:space:]] as the test (the only difference between print and graph is that graph does not contain the space character).

1
  • I don't know if this is desirable for the OP or not, but this would also include ASCII filenames like that created by touch $'\x01'. Interesting to note that it prints it in the terminal as ./?, but it actually writes the proper name when piped or redirected to a non-terminal, with GNU's find, at least. Commented Jun 24, 2020 at 17:07
1

Kusalananda's answer also includes filenames with ASCII control characters. That may be desirable, but in case it's not, here's a solution based on Kusalananda's that more exactly answers the question:

LC_ALL=C find . -name $'*[\x80-\xff]*' 

Example use:

$ touch foo bár $'baz\x01' $ ls bár 'baz'$'\001' foo $ LC_ALL=C find . -name $'*[\x80-\xff]*' ./b??r $ LC_ALL=C find . -name $'*[\x80-\xff]*' | od -tx1z 0000000 2e 2f 62 c3 a1 72 0a >./b..r.< 0000007 

The difference with what you tried is that, here, the shell is the one interpreting the hex escape sequences instead of leaving it to find. Also, the LC_ALL=C is needed probably because otherwise . in regex or * in globs would match those bytes as part of other characters.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.