2

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files. I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever employ file hashing as a check for accuracy. Probably because hashing takes time.

While I walk the file trees, I will be storing three pieces of information in a mySQL database, which will be:

  • Full file path
  • File Size
  • Hash Signature

Because for my purposes, a file will be considered a duplicate if all of these conditions are met:

  • The file name is the same
  • The file size is the same
  • The hash signature is the same.

Given the first two conditions being true, condition three does NOT need to be incredibly accurate in terms of hashing algorithms.

Once the tree walks are all done, I will then search the database for matching file hashes and then check the other conditions...

I know that MD5 seems to be the 'defacto-standard' for generating unique file hash signatures, but it is costly as far as time goes, and in my project, I will be generating a hash signature for millions of files and don't want to wait several days for the process to finish.

So based on my requirements, what would be the fastest way to generate a file hash signature in Java that would be good enough to use as a final validation that the two files are indeed duplicates?

Thank you

Update: After some thought and the discussion below, I've decided to slightly alter my method so that I only perform a deeper comparison between files after the first two conditions are met. Meaning I'll walk the tree and create the database entries, then do the deeper computations once the filename and the size are equal, and I'll be exploring a checksum method as opposed to hashing.

7
  • When you say the hash doesn't have to be accurate, do you mean you can tolerate false positives or false negatives? Commented Jun 21, 2022 at 20:15
  • have you checked this: stackoverflow.com/questions/5632604/… they use a checksum Commented Jun 21, 2022 at 20:16
  • Why not hash only files that turn out to be having the same name and same size? It might save a lot of time if there aren't many duplicates to begin with. Commented Jun 21, 2022 at 20:18
  • @shmosel - My thinking in terms of "accuracy" was that an algorithm like MD5 strives to produce a truly unique signature for any given file, which I assume makes that method more costly to use in terms of clock cycles.. and given that in my case, the filename and the file size being the same, I assume that the same level of "accuracy" in the hash signature would not require extreme detail - but as I write this, my thinking might be incorrect since the files could be very similar, perhaps that kind of detail would be necessary... Im not entirely sure about this. Commented Jun 21, 2022 at 21:00
  • @akarnokd - It's funny that you made this comment because literally two seconds after clicking submit on this question, that thought occude to me to just record the tree in the database without hashing, then only hash the files when the first two conditions are met ... it still would benefit from the fastest method, however when it comes to hashing the files. Commented Jun 21, 2022 at 21:01

1 Answer 1

2

I have recently been researching a similar problem and ended up with a similar set of conditions. I decided to try MurmurHash3 as it seems purpose built for this application. It is not cryptographically secure, which is not needed in this scenario, but seems to be very light weight.

Apache has an implementation in their commons-codec package.

Sign up to request clarification or add additional context in comments.

5 Comments

I have a feeling that the I/O cost of reading the file would make the choice of algorithm insignificant. But maybe there's a way to sample bytes instead of reading the whole thing.
I agree, hashing the first block of the file maybe enough to get a unique hash depending on the file type.
In my use case I have to read the file to perform multiple qc checks on it anyway, so the read cost is already incurred.
@SephB - When you say it might be just as accurate to hash the first block of the file depending on the type, what kinds of different types would make that method potentially inaccurate, and also if you're just reading the first block of a file, wouldn't it be faster to simply compare the blocks against one another instead of hashing them?
@MichaelSims Many file types start with fixed content at the start of the file, each byte of fixed content is going to decrease the amount of variable bytes in the first block. Essentially decreasing the amount of data you are hashing. Yes, it is always faster to not hash than hash if you are reading the files at the same time. If you are processing the files at different times storing and loading a hash saves having to read the file again.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.