What is the fastest way to get unique file hash using Java?

Question

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files. I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever employ file hashing as a check for accuracy. Probably because hashing takes time.

While I walk the file trees, I will be storing three pieces of information in a mySQL database, which will be:

Full file path
File Size
Hash Signature

Because for my purposes, a file will be considered a duplicate if all of these conditions are met:

The file name is the same
The file size is the same
The hash signature is the same.

Given the first two conditions being true, condition three does NOT need to be incredibly accurate in terms of hashing algorithms.

Once the tree walks are all done, I will then search the database for matching file hashes and then check the other conditions...

I know that MD5 seems to be the 'defacto-standard' for generating unique file hash signatures, but it is costly as far as time goes, and in my project, I will be generating a hash signature for millions of files and don't want to wait several days for the process to finish.

So based on my requirements, what would be the fastest way to generate a file hash signature in Java that would be good enough to use as a final validation that the two files are indeed duplicates?

Thank you

Update: After some thought and the discussion below, I've decided to slightly alter my method so that I only perform a deeper comparison between files after the first two conditions are met. Meaning I'll walk the tree and create the database entries, then do the deeper computations once the filename and the size are equal, and I'll be exploring a checksum method as opposed to hashing.

When you say the hash doesn't have to be accurate, do you mean you can tolerate false positives or false negatives? — shmosel
– shmosel, Commented Jun 21, 2022 at 20:15
have you checked this: stackoverflow.com/questions/5632604/… they use a checksum — wake-0
– wake-0, Commented Jun 21, 2022 at 20:16
Why not hash only files that turn out to be having the same name and same size? It might save a lot of time if there aren't many duplicates to begin with. — akarnokd
– akarnokd, Commented Jun 21, 2022 at 20:18
@shmosel - My thinking in terms of "accuracy" was that an algorithm like MD5 strives to produce a truly unique signature for any given file, which I assume makes that method more costly to use in terms of clock cycles.. and given that in my case, the filename and the file size being the same, I assume that the same level of "accuracy" in the hash signature would not require extreme detail - but as I write this, my thinking might be incorrect since the files could be very similar, perhaps that kind of detail would be necessary... Im not entirely sure about this. — Michael Sims
– Michael Sims, Commented Jun 21, 2022 at 21:00
@akarnokd - It's funny that you made this comment because literally two seconds after clicking submit on this question, that thought occude to me to just record the tree in the database without hashing, then only hash the files when the first two conditions are met ... it still would benefit from the fastest method, however when it comes to hashing the files. — Michael Sims
– Michael Sims, Commented Jun 21, 2022 at 21:01

shmosel · Accepted Answer · 2022-06-21 20:21:21Z

2

I have recently been researching a similar problem and ended up with a similar set of conditions. I decided to try MurmurHash3 as it seems purpose built for this application. It is not cryptographically secure, which is not needed in this scenario, but seems to be very light weight.

Apache has an implementation in their commons-codec package.

edited Jun 21, 2022 at 20:21

shmosel

51k7 gold badges81 silver badges147 bronze badges

answered Jun 21, 2022 at 20:14

SephB

6273 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

shmosel Over a year ago

I have a feeling that the I/O cost of reading the file would make the choice of algorithm insignificant. But maybe there's a way to sample bytes instead of reading the whole thing.

SephB Over a year ago

I agree, hashing the first block of the file maybe enough to get a unique hash depending on the file type.

SephB Over a year ago

In my use case I have to read the file to perform multiple qc checks on it anyway, so the read cost is already incurred.

Michael Sims Over a year ago

@SephB - When you say it might be just as accurate to hash the first block of the file depending on the type, what kinds of different types would make that method potentially inaccurate, and also if you're just reading the first block of a file, wouldn't it be faster to simply compare the blocks against one another instead of hashing them?

SephB Over a year ago

@MichaelSims Many file types start with fixed content at the start of the file, each byte of fixed content is going to decrease the amount of variable bytes in the first block. Essentially decreasing the amount of data you are hashing. Yes, it is always faster to not hash than hash if you are reading the files at the same time. If you are processing the files at different times storing and loading a hash saves having to read the file again.

Collectives™ on Stack Overflow

What is the fastest way to get unique file hash using Java?

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related