18

I have a MemoryStream containing the bytes of a PNG-encoded image, and want to check if there is an exact duplicate of that image data in a directory on disk. The first obvious step is to only look for files that match the exact length, but after this I'd like to know what's the most efficient way to compare the memory against the files. I'm not very experienced working with streams.

I had a couple thoughts on the matter:

First, if I could get a hash code for the file, it would (presumably) be more efficient to compare hash codes rather than every byte of the image. Similarly, I could compare just some of the bytes of the image, giving a "close-enough" answer.

And then of course I could just compare the entire stream, but I don't know how quick that would be.

What's the best way to compare a MemoryStream to a file? Byte-by-byte in a for-loop?

4
  • "...only look for files that match the exact length..." Caution: Size of the file on disk might, probably will, be different from the size of the stream... The disk file could have an embedded thumbnail that the in memory stream does not.... Image files can be a little goofy that way :) Commented Jun 5, 2010 at 1:58
  • In my case I'm creating the image files on disk too, so that should be safe no? Commented Jun 5, 2010 at 2:02
  • Yes FileStream.Length == FileInfo.Length... but if you use Image.FromFile and save it to a MemoryStream they will not be the same length... i usually work with Image objects, hence my concern. Commented Jun 5, 2010 at 2:20
  • Interesting. Well it seems to be working so far. I will keep your concerns in mind if things start acting up. :) Thanks! Commented Jun 5, 2010 at 2:31

6 Answers 6

25

Another solution:

private static bool CompareMemoryStreams(MemoryStream ms1, MemoryStream ms2) { if (ms1.Length != ms2.Length) return false; ms1.Position = 0; ms2.Position = 0; var msArray1 = ms1.ToArray(); var msArray2 = ms2.ToArray(); return msArray1.SequenceEqual(msArray2); } 
Sign up to request clarification or add additional context in comments.

5 Comments

pretty much memory critical, but perfectly suits my needs for small streams. ;)
Why do you set Position = 0? MemoryStream.ToArray() documentation says "Writes the stream contents to a byte array, regardless of the Position property."
Just habit I suppose, from being bit by bit setting the position before operating on the stream.
This is a poor solution, quite aside from the fact that it requires allocating as much memory as the entire length of file (times two). If you have two big files and they differ at the start of the stream, this will wait until the entire contents of both files have been read to detect that and abort.
@MahmoudAl-Qudsi the request was to compare two memory streams, which by definition, are already in memory, not files.
17

Firstly, getting hashcode of the two streams won't help - to calculate hashcodes, you'd need to read the entire contents and perform some simple calculation while reading. If you compare the files byte-by-byte or using buffers, then you can stop earlier (after you find first two bytes/blocks) that don't match.

However, this approach would make sense if you needed to compare the MemoryStream against multiple files, because then you'd need to loop through the MemoryStream just once (to calculate the hashcode) and tne loop through all the files.

In any case, you'll have to write code to read the entire file. As you mentioned, this can be done either byte-by-byte or using buffers. Reading data into buffer is a good idea, because it may be more efficient operation when reading from HDD (e.g. reading 1kB buffer). Moreover, you could use asynchronous BeginRead method if you need to process multiple files in parallel.

Summary:

  • If you need to compare multiple files, use hashcode
  • To read/compare content of single file:
    • Read 1kB of data into a buffer from both streams
    • See if there is a difference (if yes, quit)
    • Continue looping

Implement the above steps asynchronously using BeginRead if you need to process mutliple files in parallel.

5 Comments

It's important to be aware of the (unlikely) possibility of hash collisions. Byte comparison would be necessary to avoid this issue.
So to be clear, I would read 1 kb chunks from the file into a buffer, then compare those buffers to the memstream byte by byte?
BufferedStream as a wrapper for the FileStream should take care of the buffering issue.
Concurrently reading multiple files from the same HDD isn't necessarily more efficient than one at a time, due to repositioning of the head.
@chaiguy: Yes, that should be the most efficient option, although if you use BufferedStream, reading byte-by-byte should work too. You may also run some performance tests to identify the best buffer size.
4

Firstly, getting hashcode of the two streams won't help - to calculate hashcodes, you'd need to read the entire contents and perform some simple calculation while reading.

I'm not sure if I misunderstood it or this is simply isn't true. Here's the example of hash calculation using streams

private static byte[] ComputeHash(Stream data) { using HashAlgorithm algorithm = MD5.Create(); byte[] bytes = algorithm.ComputeHash(data); data.Seek(0, SeekOrigin.Begin); //I'll use this trick so the caller won't end up with the stream in unexpected position return bytes; } 

I've measured this code with benchmark.net and it allocated 384 bytes on 900Mb file. Needless to say how inefficient loading whole file in memory in this case.

However, this is true

It's important to be aware of the (unlikely) possibility of hash collisions. Byte comparison would be necessary to avoid this issue.

So in case hashes don't match you have to perform additional checks in order to be sure that files are 100% different. In such a case following is a great approach.

As you mentioned, this can be done either byte-by-byte or using buffers. Reading data into buffer is a good idea, because it may be more efficient operation when reading from HDD (e.g. reading 1kB buffer).

Recently I had to perform such checks so I'll post results of this exercise as 2 utility methods

private bool AreStreamsEqual(Stream stream, Stream other) { const int bufferSize = 2048; if (other.Length != stream.Length) { return false; } byte[] buffer = new byte[bufferSize]; byte[] otherBuffer = new byte[bufferSize]; while ((_ = stream.Read(buffer, 0, buffer.Length)) > 0) { var _ = other.Read(otherBuffer, 0, otherBuffer.Length); if (!otherBuffer.SequenceEqual(buffer)) { stream.Seek(0, SeekOrigin.Begin); other.Seek(0, SeekOrigin.Begin); return false; } } stream.Seek(0, SeekOrigin.Begin); other.Seek(0, SeekOrigin.Begin); return true; } private bool IsStreamEuqalToByteArray(byte[] contents, Stream stream) { const int bufferSize = 2048; var i = 0; if (contents.Length != stream.Length) { return false; } byte[] buffer = new byte[bufferSize]; int bytesRead; while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0) { var contentsBuffer = contents .Skip(i * bufferSize) .Take(bytesRead) .ToArray(); if (!contentsBuffer.SequenceEqual(buffer)) { stream.Seek(0, SeekOrigin.Begin); return false; } } stream.Seek(0, SeekOrigin.Begin); return true; } 

3 Comments

great example code - you could encapsulate the reading and comparing of the streams in a task to ensure io access are not blocking caller thread
You can't just ignore the value returned from Stream.Read. Because of that, if one of the streams reads fewer bytes than requested, this code will return incorrect result.
How do you think the hash calculation would be performed without reading the entire contents? Of course it needs to reads both files. It may not read them into a big memory block, but it does need to read them. So this is very inefficient especially for files which differ early on.
2

We've open sourced a library to deal with this at NeoSmart Technologies, because we've had to compare opaque Stream objects for bytewise equality one time too many. It's available on NuGet as StreamCompare and you can read about its advantages over existing approaches in the official release announcement.

Usage is very straightforward:

var stream1 = ...; var stream2 = ...; var scompare = new StreamCompare(); var areEqual = await scompare.AreEqualAsync(stream1, stream2); 

It's written to abstract away as many of the gotchas and performance pitfalls as possible, and contains a number of optimizations to speed up comparisons (and to minimize memory usage). There's also a file comparison wrapper FileCompare included in the package, that can be used to compare two files by path.

StreamCompare is released under the MIT license and runs on .NET Standard 1.3 and above. NuGet packages for .NET Standard 1.3, .NET Standard 2.0, .NET Core 2.2, and .NET Core 3.0 are available. Full documentation is in the README file.

1 Comment

Seems rather elaborate but would be perfect to compare file stream. For memory stream, I think this will be less effective
0

rdfind has an interesting algorithm. Additionally to comparing the size, it looks at the first and last bytes at first (first bytes might often be the same due to standardized file headers). See rdfind

Comments

-7

Using Stream we don't get the result, each and every files has a unique identity, such as the last modified date and so on. So each and every file is different. This information is included in the stream

1 Comment

If you read a file with a stream, you only read its content, not additionally metadata stored by the filesystem. Also this question is especially about comparing the content of files.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.