2

I need to compare the contents of very large files. Speed of the program is important. I need 100% match.I read a lot of information but did not find the optimal solution. I am haveconsidering two choices and both problems.

  1. Compare whole file byte by byte - not fast enough for large files.
  2. File Comparison using Hashes - not 100% match the two files with the same hash.

What would you suggest? Maybe I could make use of threads? Could MemoryMappedFile be helpful?

11
  • What have you tried on your own for example have you executed a google search, try that first.. Commented Aug 24, 2012 at 21:13
  • do you need to see diffs or just enough confirmation about a fact that those 2 files are equal or different. Commented Aug 24, 2012 at 21:13
  • How large is your large file? What is the exact time in nanoseconds a comparision may take? Commented Aug 24, 2012 at 21:13
  • What are you trying to accomplish? Determine if two files are the same? Or find duplicate files in a large set of files? Or are you comparing in an effort to "sort" a list of files based on content? Commented Aug 24, 2012 at 21:15
  • 2
    not 100% match the two files with the same hash Are you sure? Do you know MD5, SHA2, SHA-224, SHA-256, SHA-384, SHA-512? and their probabilities? Commented Aug 24, 2012 at 21:15

7 Answers 7

9

If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function!

What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time.

As a rule, the only short-cut on proving equality is proving identity. In OO code that would be showing two objects where in fact the same object. The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on.

So we're left short-cutting on finding mis-matches. Does nothing to increase our passes, but makes our fails faster:

  1. Different size, not byte-for-byte equal. Simples!
  2. If you will examine the same file more than once, then hash it and record the hash. Different hash, guaranteed not equal. The reduction in files that need a one-to-one comparison is massive.
  3. Many file formats are likely to have some areas in common. Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low).

Then there's the matter of making the actual comparison as fast as possible. Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet.

Threading can help. One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe.

If you do have more than one thread examining the same files, have them work far from each other. E.g. if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). The latter is much more likely to have false sharing issues than the former, so don't do that.

Edit:

It also depends on just what you're doing with the files. You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case.

If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it.

Sign up to request clarification or add additional context in comments.

Comments

2

Serial Compare

Test File Size(s): 118 MB
Duration: 579 ms
Equal? true

 static bool Compare(string filePath1, string filePath2) { using (FileStream file = File.OpenRead(filePath1)) { using (FileStream file2 = File.OpenRead(filePath2)) { if (file.Length != file2.Length) { return false; } int count; const int size = 0x1000000; var buffer = new byte[size]; var buffer2 = new byte[size]; while ((count = file.Read(buffer, 0, buffer.Length)) > 0) { file2.Read(buffer2, 0, buffer2.Length); for (int i = 0; i < count; i++) { if (buffer[i] != buffer2[i]) { return false; } } } } } return true; } 


Parallel Compare

Test File Size(s): 118 MB
Duration: 340 ms
Equal? true

 static bool Compare2(string filePath1, string filePath2) { bool success = true; var info = new FileInfo(filePath1); var info2 = new FileInfo(filePath2); if (info.Length != info2.Length) { return false; } long fileLength = info.Length; const int size = 0x1000000; Parallel.For(0, fileLength / size, x => { var start = (int)x * size; if (start >= fileLength) { return; } using (FileStream file = File.OpenRead(filePath1)) { using (FileStream file2 = File.OpenRead(filePath2)) { var buffer = new byte[size]; var buffer2 = new byte[size]; file.Position = start; file2.Position = start; int count = file.Read(buffer, 0, size); file2.Read(buffer2, 0, size); for (int i = 0; i < count; i++) { if (buffer[i] != buffer2[i]) { success = false; return; } } } } }); return success; } 


MD5 Compare

Test File Size(s): 118 MB
Duration: 702 ms
Equal? true

 static bool Compare3(string filePath1, string filePath2) { byte[] hash1 = GenerateHash(filePath1); byte[] hash2 = GenerateHash(filePath2); if (hash1.Length != hash2.Length) { return false; } for (int i = 0; i < hash1.Length; i++) { if (hash1[i] != hash2[i]) { return false; } } return true; } static byte[] GenerateHash(string filePath) { MD5 crypto = MD5.Create(); using (FileStream stream = File.OpenRead(filePath)) { return crypto.ComputeHash(stream); } } 

tl;dr Compare byte segments in parallel to determine if two files are equal.

1 Comment

Parallel compare doesn't really work. Although it finds similar files, all bytes are not the same. If I compare two files that are marked duplicates using your algorithm outside parallel for, I get difference in some bytes.
1

Why not both?

Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. This allows maximal speed with guaranteed 100% match confidence.

6 Comments

Once you've done the hash, you've already read the whole file. Why do it again?
@ZaidMasud The hash could be stored though.
Depending on the number of comparisons the chance of a collision in a hash (even MD5 which is not cryptographically secure) can be unacceptably high. For an explanation -- here is a link blog.mischel.com/2012/04/13/hash-codes-are-not-unique
@jtm001 as a quick-to-false, cryptographic security is irrelevant and the necessary collision safety just has to be enough that false collisions don't cause more unnecessary checks than they prevent, which even CRC would be likely to be enough for.
@JonHanna -- agreed completely -- the point was intended to address the comment "not 100% match the two files with the same hash Are you sure?"
|
1

There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data.

So a there are two things you'll want to address:

  • Concurrency - Make sure you're reading data at the same time you're checking it.
  • Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files)

The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck.

Comments

1

Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. Can you give some more background on what you are trying to accomplish? How big are the 'big' files? How often do you have to compare them?

Comments

1

If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense. I might try something like the following:

1) group files by size. Files with different sizes clearly can't be identical. This information is very inexpensive to retrieve. If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2.

2) Within each size group generate a hash of the first n bytes of the file. Identify a reasonable n that will likely detect differences. Many files have identical headers, so you wan't to make sure n is greater that that header length. Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3.

3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. Depending on the number of files, and the nature of the file contents, you might try different approaches. Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small.

Comments

0

To calculate a hash, the entire file needs to be read.

How about opening both files together, and comparing them chunk by chunk?

Pseudo code:

open file A open file B while file A has more data { if next chunk of A != next chunk of B return false } return true 

This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. You should set up a test that varies the chunk size to determine the right size for optimal performance.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.