If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!
What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.
As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.
So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:
- Different size, not byte-for-byte equal. Simples!
- If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
- Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).
Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.
Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.
If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.
Edit:
It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.
If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.
not 100% match the two files with the same hashAre you sure? Do you know MD5, SHA2, SHA-224, SHA-256, SHA-384, SHA-512? and their probabilities?