0

I have a list of duplicate files on my hard disk. I'm having a hard time to check if a file is not on the list. Grepping

grep $1 $2 > /dev/null || echo $1 

works. But I can't get it to work in the -exec-Part of the find command.

find 250G_EXT4/ -type f -exec grep "{}" duplicates_sorted.txt \> /dev/null \|\| echo {} \; 

The messages are

grep: >: Datei oder Verzeichnis nicht gefunden (File or directory not found) grep: ||: Datei oder Verzeichnis nicht gefunden ... grep: echo: Datei oder Verzeichnis nicht gefunden ... 

Has anybody a clue to get the escape sequence right? Or maybe a different idea?

3
  • -exec takes a single command and its arguments; you can't include redirections, which are processed by a shell, not passed to the exec system call, or ||, which is a shell operator for conditional execution. Commented Apr 27, 2014 at 13:54
  • Does your file paths contain spaces (or some other naughty character)? Commented Apr 27, 2014 at 15:12
  • It does contain spaces and what else naughty things could be found in a filename like umlauts Commented Apr 27, 2014 at 15:18

4 Answers 4

5

Why not simply

find | grep -vFf duplicates_sorted.txt - 

This should be a lot faster as well.

(The -F flag specifies literal matching, i.e. no regex matching. Otherwise a.c would match abc, etc.)

find -exec takes a single command; that single command can be a shell with an arbitrarily complex script passed to it:

find -exec sh -c 'grep -q "$1" file || echo "$1"' dummy {} \; 

The first argument to sh -c is used as $0 so we pass in a dummy placeholder value.

Sign up to request clarification or add additional context in comments.

4 Comments

Although possibly also -x, which specifies "match whole lines"; otherwise a.c also matches bodega.c.
But the output from find will contain a path component which probably isn't in the file.
+1; good to know how to pass additional parameters to sh -c and to be aware of the associated $0 pitfall - though you can always just use {} directly inside the command string.
Didn't quite grasp the usage of sh but after a minute it seems comprehensible. Thanks for that nice solution!
3

-exec take a single command and its arguments. > /dev/null is not an argument, but a redirection that the shell processes before running grep. Likewise, || is not an argument, but a shell operator used to determine whether or not to run echo depending on the exit status of grep. To answer your exact question, you need to pass your command list as an argument to sh -c.

find 250G_EXT4/ -type f \ -exec sh -c 'grep "{}" duplicates_sorted.txt > /dev/null || echo "{}"' \; 

1 Comment

+1, but you should double-quote the {} instances to protect them from shell expansions. (Note that this is only necessary inside a command string used with sh -c, not when using {} as a direct, separate argument).
0

First of all I would use -q grep option and don't bother myself with stream redirection. Also I would consider using fgrep (or grep -F) instead of grep and use -x option to match entire plain strings not regular expressions. And finally I would avoid shell piping.

The resulting command should look like this:

find /path/to/dir -type f -exec grep -v -q -x -F {} /path/to/duplicates.txt \; -print 

Or something similar depending on your needs.

3 Comments

This finds those entries which are in the file, not the ones which aren't.
Well probably your answer is better especially that running grep only once on entire find output. But there could be corner cases when a file name contains "bad characters". Not sure if both our answers can correctly handle them.
grep is line-oriented anyway, so this could not work if file names contained newlines.
0

If your goal is to find duplicate files (i.e. files having the same content, independently of their name) I would proceed differently.

I would first compute a checksum, perhaps simply with md5sum, for each file, and sort them by their checksum, e.g.

find 250G_EXT4/ -type f -exec md5sum '{}' \; \ | sort > /tmp/md5sumlist.txt 

then I would handle those few entries having the same md5 checksum, and use cmp to compare their content.

1 Comment

Well, it already is a list based on md5sums. The handling is the issue.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.