What's the fastest way to tell whether two strings or binary files are different?

Question

I'm writing a unit test and need to compare a result file to a golden file. What's the easiest way to do so?

So far I have (for Linux environment):

int result = system("diff file1 file2");

They are different if result != 0.

There are various standard options of diff to suppress output. Use them, if you call it through system. — pmr
– pmr, Commented Feb 27, 2013 at 17:45
Absolute fastest, if these are big files, may be check that they are the same length, then mmap() them and call memcmp(). — chrisaycock
– chrisaycock, Commented Feb 27, 2013 at 17:53

Dave S · Accepted Answer · 2013-02-27 18:34:35Z

If you want a pure c++ solution, I would do something like this

#include <algorithm> #include <iterator> #include <string> #include <fstream> template<typename InputIterator1, typename InputIterator2> bool range_equal(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2) { while(first1 != last1 && first2 != last2) { if(*first1 != *first2) return false; ++first1; ++first2; } return (first1 == last1) && (first2 == last2); } bool compare_files(const std::string& filename1, const std::string& filename2) { std::ifstream file1(filename1); std::ifstream file2(filename2); std::istreambuf_iterator<char> begin1(file1); std::istreambuf_iterator<char> begin2(file2); std::istreambuf_iterator<char> end; return range_equal(begin1, end, begin2, end); }

It avoids reading the entire file into memory, and stops as soon as the files are different (or at end of file). The range_equal because std::equal doesn't take a pair of iterators for the second range, and isn't safe if the second range is shorter.

Can you explain why as end you use an unitialized iterator? OP mentions binary files, would it make sense using std::ios::binary? P.S.: I would note this is not the fastest, as it checks one byte at the time also for big files. But as a simple solution seems excellent.
@Antonio An uninitialized std::istreambuf_iterator is the end iterator. For performance, the code assumes that your stream is doing the buffering (for example, in many implementations of of std::ifstream, the underlying stream is buffered).
How about md5 compare? Does the computing of md5 also read the entire file? So it doesn't faster than directly compare byte-chunk of the files?

Antonio · Accepted Answer · 2018-07-09 16:32:58Z

Developing from DaveS's answer, and as first thing checking file size:

#include <fstream> #include <algorithm> bool compare_files(const std::string& filename1, const std::string& filename2) { std::ifstream file1(filename1, std::ifstream::ate | std::ifstream::binary); //open file at the end std::ifstream file2(filename2, std::ifstream::ate | std::ifstream::binary); //open file at the end const std::ifstream::pos_type fileSize = file1.tellg(); if (fileSize != file2.tellg()) { return false; //different file size } file1.seekg(0); //rewind file2.seekg(0); //rewind std::istreambuf_iterator<char> begin1(file1); std::istreambuf_iterator<char> begin2(file2); return std::equal(begin1,std::istreambuf_iterator<char>(),begin2); //Second argument is end-of-range iterator }

(I wonder if before rewinding, fileSize could be used to create a more efficient end of stream iterator, which, by knowing the stream length, would allow std::equal to process more bytes at the time).

Why not simply if (file2.tellg() != file2.tellg())? How storing first in the fileSize helps?
@iammilind Thanks for pointing out that, there was an error in the post script after the code, now it should be clear why I wanted to underline that being the size of the file.

gbjbaanb · Accepted Answer · 2013-02-27 17:50:21Z

one way to prevent reading both files is to pre-compute the golden file into a hash, eg a md5. Then you only have to check the test file. Note, this may be slower than just reading both files!

Alternatively, layer your checking - look at the file sizes, if they're different then the files are different and you can avoid a lengthy read-and-compare operation.

Ophir Carmi · Accepted Answer · 2023-11-14 13:16:09Z

This should work:

#include <string> #include <fstream> #include <streambuf> #include <iterator> bool equal_files(const std::string& a, const std::string& b) { std::ifstream stream{a}; std::string file1{std::istreambuf_iterator<char>(stream), std::istreambuf_iterator<char>()}; stream = std::ifstream{b}; std::string file2{std::istreambuf_iterator<char>(stream), std::istreambuf_iterator<char>()}; return file1 == file2; }

I suspect this to be not as fast as diff, but it avoids calling system. It should be sufficient for a test-case, though.

Lufi · Accepted Answer · 2013-02-27 18:19:03Z

Might be an overkill but you could build a table of hashes SHA-256 using boost/bimap and boost/scope_exit.

Here is a video how to do this by Stephan T Lavavej (starts at 8.15): http://channel9.msdn.com/Series/C9-Lectures-Stephan-T-Lavavej-Advanced-STL/C9-Lectures-Stephan-T-Lavavej-Advanced-STL-5-of-n

For more info about algorithm: http://en.wikipedia.org/wiki/SHA-2

Collectives™ on Stack Overflow

What's the fastest way to tell whether two strings or binary files are different?

5 Answers 5

3 Comments

2 Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

Comments

1 Comment

Comments

Linked

Related