Comparing large lists of hashes

Question

I have 2 lists. One is a list of MD5 and SHA1 hashes from files from a computer I have (ListA). The other is a list of MD5 and SHA1 hashes I have downloaded form NSRL (ListB). Its a compilation of MD5 and SHA1 hashes from files included in many different applications.

I am trying to find a quick way to compare these lists to each other.

Just for the reference of performance, the hashes from the system is a 7.2gb text file and the NSRL hash list is approximately 20gb. I have a system with 32gb of ram to perform the processing, so it should have enough memory to load both files into memory if need be.

I've looked into Except, and also considered reading each line in from the ListA and comparing it to ListB, but there has to be a better way than this. Any ideas?

Also, this is a comparison of the hashes from a machine to known hashes from a hash database. Its pretty common practice in forensics (from what I understand), so I'm open to the suggestion of applications that exist to do this already.

You might consider sorting your lists first. Then compare them using approach similar to merge sort. — Ivan Nevostruev
– Ivan Nevostruev, Commented May 20, 2014 at 22:18
@Phillippe Leybaert, This is a common forensics practice when looking for files on a system. Get an MD5/SHA1 has of every file, compare it to a list of known good files (or sometimes compare it to a list of known bad files) and see what stands out. — Sugitime
– Sugitime, Commented May 20, 2014 at 22:20
@Sugitime I understand it's a common forensics practice, but I'm still curious what your use case is. — Philippe Leybaert
– Philippe Leybaert, Commented May 20, 2014 at 22:31
@Phillippe Leybaert, Its some class work. Take the image of a system, get the hashes (I used ftimes), compare the hashes to NSLR db. Is that what you mean? — Sugitime
– Sugitime, Commented May 20, 2014 at 22:35

user845279 · Accepted Answer · 2014-05-20 23:38:39Z

Using a hash would be fastest but you don't have enough memory to load all of those items into a hash. Assuming the number of SHA-1 and MD5 entries are divided equally, you would have approximately 500 million entries in ListA and 1 billion in ListB. That would be at least 8 billion bytes assuming 8 byte pointers each.

Instead, you should use a Radix Trie to store just ListB first, then perform the comparison as you're reading ListA. It doesn't perform as well as a hash but it's a good space-time trade-off.

Kirill Polishchuk · Accepted Answer · 2014-05-20 22:22:26Z

Use HashSets. Firstly load all the items from both lists into HashSets. Then us IntersectWith which will take O(n).

Pretty sure, the bottleneck in your case would be reading data from files into memory. In terms of performance I will suggest read text files into memory and then parse it.

i3arnon · Accepted Answer · 2014-05-20 23:01:08Z

Create a class that can hold the data of a single hash item
Make sure it implements GetHashCode and Equals correctly.
Create 2 HashSets of the type you created, one for ListA and one for ListB.
Load all items form the lists to the hashsets.
Use SymmetricExceptWith (which takes O(n))to get all the hashes that aren't in both lists.

var setA = new HashSet<Item>(LoadListA()); var setB = new HashSet<Item>(LoadListb()); setA.SymmetricExceptWith(setB); if (setA.Count > 0) { Console.WriteLine("Extra items ןn A or B"); }

Collectives™ on Stack Overflow

Comparing large lists of hashes

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related