Unique, but realistic, object hash code

Question

Ok, I am designing up a piece of software that will keep one system synced with another. The problem is that the originating system is some legacy DB2 nightmare with me only having read-only access and tables having no timestamping capability whatsoever, meaning no way to detect which rows were changed.

My idea is to just load all the rows (in total I will have about 60000 rows, synced every half hour) calculating their hashes, whilst keeping <ID, hash> tuples in my integration database. Then change detection becomes a job of comparing hashes and updating records in destination system where hashes mismatch or tuples missing altogether. Forgot to mention that reading source is cheap, updating destination is expensive, its a web service with a lot of background processing, so I would avoid updating everything every time.

Now, my problem, the c# builtin hashcode claims that its unsuitable for this purpose (equal hash does not imply equal object) and crypto hashes seem like a big overkill with 256+ bit hashes. I don't think more than 64bits is needed, that would give me 1 in 10¹⁰ chance of collision given perfectly distributed hash and allow fast hash comparison on x64 arch.

So what should I use to generate unique hashes?

You can use another hash function like a MD5 on 128bits, a CRC32 or CRC64... You may also use crypto hash generating 256bits and keep only the first 64. — Guillaume
– Guillaume, Commented Dec 2, 2015 at 10:43
Do you need an off-the-shelf solution, as opposed to coding something yourself? — Evil Dog Pie
– Evil Dog Pie, Commented Dec 2, 2015 at 10:49
I can code if there is no off the shelf solution. Speaking of crypto, I am not that versed in its math, if I do SHA256 and take lower 64 bit, is that sufficiently uniform for my purpose? — mmix
– mmix, Commented Dec 2, 2015 at 11:33
Some alternatives to crypto hashing might be error detection and correction or compression algorithms. For example Hamming Codes or Lempel–Ziv–Welch. Hamming Codes are relatively expensive to calculate but good for fixed-length data and 'hash'. LZW is cheaper to calculate, but you won't be able to predict the size of the 'hash' you get at the end. Both would allow you to detect changes with high (but not complete) confidence. — Evil Dog Pie
– Evil Dog Pie, Commented Dec 2, 2015 at 11:47

Steve Cooper · Accepted Answer · 2015-12-02 11:56:16Z

Another option; calculate the hash in C# using a function like this;

private readonly System.Security.Cryptography.HashAlgorithm hash = System.Security.Cryptography.SHA1.Create(); public static string CalculateSignature(IEnumerable<object> values) { var sb = new StringBuilder(); foreach (var value in values) { string valueToHash = value == null ? ">>null<<" : Convert.ToString(value, CultureInfo.InvariantCulture); sb.Append(valueToHash).Append(char.ConvertFromUtf32(0)); } var signature = sb.ToString(); var bytesToHash = Encoding.UTF8.GetBytes(signature); var hashedBytes = hash.ComputeHash(bytesToHash); signature = Encoding.UTF8.GetString(hashedBytes); return signature; }

Edit: Hashing profiling test

To show how fast SHA1 hashing is, here's a quick test. On my dev machine, I get 60,000 hashes in 176ms. MD5 takes 161

var hash = System.Security.Cryptography.MD5.Create(); var stringtoHash = "3490518cvm90wg89puse5gu3tgu3v0afgmvkldfjgmvvvvvsh,9semc9petgucm9234ucv0[vhd,flhgvzemgu904vq2m0"; var sw = System.Diagnostics.Stopwatch.StartNew(); for(var i = 0; i < 60000; i++) { var bytesToHash = Encoding.UTF8.GetBytes(stringtoHash); var hashedBytes = hash.ComputeHash(bytesToHash); var signature = Encoding.UTF8.GetString(hashedBytes); } sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds);

This is using crypto hash, and SHA1 at that. I want to avoid crypto hashing and even if I were to use it I would go with binary serialized objects rather than string manipulation.
Is it because of the CPU expense? Calculating 60,000 SHA1 hashes will be cheap enough for a half-hourly import. Alternatively, replace with MD5 in the same namespace, which is 128-bit. But until you've profiled and can show that the hashing algorithm is too expensive, don't waste your time worrying!
@mmix -- just added a quick bit of code to show the speed of SHA1. On my machine, 60,000 SHA1 hashes took 176ms.

Steve Cooper · Accepted Answer · 2015-12-02 11:48:00Z

In your staging SQL tables, add a 'checksum' column, using SQL's checksum function;

Something like this;

update mysourcetable set check = checksum(id, field1, field2, field3, field4 ...)

Clarification

You mentioned having an integration database; my thought was that you would read the data from DB2 into an interim database, like SQL server, where you're already storing ID/hash pairs. If you copied all the data out of DB2, not just the IDs, then you could calculate the checksum in the integration database.

Not an option, I don't have write or schema change access on source database. And I need to calculate hashes on source data.

Collectives™ on Stack Overflow

Unique, but realistic, object hash code

2 Answers 2

4 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Related