What's the quickest way to find duplicated files? [duplicate]

Question

I found this command used to find duplicated files but it was quite long and made me confused.

For example, if I remove -printf "%s\n", nothing came out. Why was that? Besides, why have they used xargs -I{} -n1?

Is there any easier way to find duplicated files?

[4a-o07-d1:root/798]#find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate 0bee89b07a248e27c83fc3d5951213c1 ./test1.txt 0bee89b07a248e27c83fc3d5951213c1 ./test2.txt

By "quick", do you mean quickest to type, or quickest to finish? If you want the latter, it will pay to partition by file sizes prior to computing and partitioning by MD5 hashes. — reinierpost
– reinierpost, Commented Apr 20, 2016 at 10:22
Sorry i think i didn't make it clear. I want to use command line with least complicated to find duplicated files. — The One
– The One, Commented Apr 22, 2016 at 1:07
jdupes -Q only compares hashes, and -T only compares the first block of file data which is extremely fast but risky. — qwr
– qwr, Commented Feb 27, 2024 at 0:08

heemayl · Accepted Answer · 2016-04-20 11:37:42Z

142

You can make it shorter:

find . ! -empty -type f -exec md5sum {} + | sort | uniq -w32 -dD

Do md5sum of found files on the -exec action of find and then sort and do uniq to get the files having same the md5sum separated by newline.

edited Apr 20, 2016 at 11:37

answered Apr 20, 2016 at 4:42

heemayl

58.1k9 gold badges129 silver badges144 bronze badges

21

This is not the quickest. For several GB large files, there's no need to hash it whole. You can hash first N kB and then do a full one if same hash is found.

Ondra Žižka
– Ondra Žižka

2017-06-02 00:20:09 +00:00
Commented Jun 2, 2017 at 0:20
2

What does -dD mean to uniq?

Tim
– Tim

2018-07-23 02:52:42 +00:00
Commented Jul 23, 2018 at 2:52
1

@Tim I see you've asked a new question already and got the answer :)

heemayl
– heemayl

2018-07-23 15:42:45 +00:00
Commented Jul 23, 2018 at 15:42
6

This approach was too slow for me. Took >90 minutes to process >380 GB of JPG and MOV files in a nested directory. Used ls -lTR plus the following POSIX awk script to process the same data in 72 seconds: github.com/taltman/scripts/blob/master/unix_utils/…

taltman
– taltman

2021-01-10 02:58:26 +00:00
Commented Jan 10, 2021 at 2:58
2

@OndraŽižka true, and even faster is to first compare file size without even opening the file. If only 2 files have the same size, just compare them (cmp) and if more than 2 have the same size, then do checksums. But why re-invent the wheel, just use one of the other tools that already does that :)

TheAmigo
– TheAmigo

2021-04-12 15:33:03 +00:00
Commented Apr 12, 2021 at 15:33

| Show 1 more comment

Rahul · Accepted Answer · 2016-04-20 04:36:28Z

104

You can use fdupes. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

You can call it like fdupes -r /path/to/dup/directory and it will print out a list of dupes.

Update

You can give it try to fslint also. After setting up fslint, go to cd /usr/share/fslint/fslint && ./fslint /path/to/directory

edited Apr 20, 2016 at 4:36

answered Apr 20, 2016 at 4:28

Rahul

14k4 gold badges46 silver badges56 bronze badges

Not so sure why i was able to install the fdupes on my CentOS 7. [root@ip-10-0-7-125 ~]# yum install fdupes Loaded plugins: fastestmirror ftp.iij.ad.jp/pub/linux/centos/7.2.1511/os/x86_64/repodata/…: [Errno 14] curl#7 - "Failed to connect to 2001:240:bb8f::1:70: Network is unreachable" Trying other mirror. mirror.vastspace.net/centos/7.2.1511/os/x86_64/repodata/…: [Errno 12] Timeout on mirror.vastspace.net/centos/7.2.1511/os/x86_64/repodata/…: (28, 'Connection timed out after 30001 milliseconds') Trying other mirror.

The One
– The One

2016-04-20 05:35:16 +00:00
Commented Apr 20, 2016 at 5:35
1

@TheOne : have you turned on rpmforge repo before running yum install fdupes ?

Rahul
– Rahul

2016-04-20 05:41:39 +00:00
Commented Apr 20, 2016 at 5:41
Seems like there was something wrong with my CentOS. I even couldn't use yum update command.

The One
– The One

2016-04-20 05:54:32 +00:00
Commented Apr 20, 2016 at 5:54
@theone that looks like another question. Does it need to be here?

ctrl-alt-delor
– ctrl-alt-delor

2021-01-16 09:35:26 +00:00
Commented Jan 16, 2021 at 9:35
1

fdupes doesn't consider hardlinks a duplicates, which is great.

Victor Sergienko
– Victor Sergienko

2021-08-16 17:42:57 +00:00
Commented Aug 16, 2021 at 17:42

Add a comment |

Community · Accepted Answer · 2017-04-13 12:37:03Z

In case you want to understand the original command, let's go though that step by step.

find -not -empty -type f

Find all non-empty files in the current directory or any of its subdirectories.

 -printf "%s\n"

Print its size. If you drop these arguments, it will print paths instead, breaking subsequent steps.

 | sort -rn

Sort numerically (-n), in reverse order (-r). Sorting in ascending order and comparing as strings not numbers should work just as well, though, so you may drop the -rn flags.

 | uniq -d

Look for duplicate consecutive rows and keep only those.

 | xargs -I{} -n1

For each line of input (i.e. each size that occurs more than once), execute the following command, but replace {} by the size. Execute the command once for each line of input, as opposed to passing multiple inputs to a single invocation.

 find -type f -size {}c -print0

This is the command to run for each size: Find files in the current directory which match that size, given in characters (c) or more precisely bytes. Print all the matching file names, separated by null bytes instead of newlines so filenames which contain newlines are treated correctly.

 | xargs -0 md5sum

For each of these null-separated names, compute the MD5 checksum of said file. This time we allow passing multiple files to a single invocation of md5sum.

 | sort

Sort by checksums, since uniq only considers consecutive lines.

 | uniq -w32 --all-repeated=separate

Find lines which agree in their first 32 bytes (the checksum; after that comes the file name). Print all members of such runs of duplicates, with distinct runs separated by newlines.

Compared to the simpler command suggested by heemayl, this has the benefit that it will only checksum files which have another file of the same size. It pays for that with repeated find invocations, thus traversing the directory tree multiple times. For those reasons, this command is particularly well-suited for directories with few but big files, since in those cases avoiding a checksum call may be more important than avoiding repeated tree traversal.

Which original command? It might be worth adding the full command at the top of the answer... — Potherca
– Potherca, Commented Apr 6 at 14:16

Stack Exchange Network

What's the quickest way to find duplicated files? [duplicate]

3 Answers 3

Linked

Hot Network Questions

What's the quickest way to find duplicated files? [duplicate]

3 Answers 3

Linked

Related

Hot Network Questions