1

Ok, I am designing up a piece of software that will keep one system synced with another. The problem is that the originating system is some legacy DB2 nightmare with me only having read-only access and tables having no timestamping capability whatsoever, meaning no way to detect which rows were changed.

My idea is to just load all the rows (in total I will have about 60000 rows, synced every half hour) calculating their hashes, whilst keeping <ID, hash> tuples in my integration database. Then change detection becomes a job of comparing hashes and updating records in destination system where hashes mismatch or tuples missing altogether. Forgot to mention that reading source is cheap, updating destination is expensive, its a web service with a lot of background processing, so I would avoid updating everything every time.

Now, my problem, the c# builtin hashcode claims that its unsuitable for this purpose (equal hash does not imply equal object) and crypto hashes seem like a big overkill with 256+ bit hashes. I don't think more than 64bits is needed, that would give me 1 in 1010 chance of collision given perfectly distributed hash and allow fast hash comparison on x64 arch.

So what should I use to generate unique hashes?

4
  • You can use another hash function like a MD5 on 128bits, a CRC32 or CRC64... You may also use crypto hash generating 256bits and keep only the first 64. Commented Dec 2, 2015 at 10:43
  • Do you need an off-the-shelf solution, as opposed to coding something yourself? Commented Dec 2, 2015 at 10:49
  • I can code if there is no off the shelf solution. Speaking of crypto, I am not that versed in its math, if I do SHA256 and take lower 64 bit, is that sufficiently uniform for my purpose? Commented Dec 2, 2015 at 11:33
  • Some alternatives to crypto hashing might be error detection and correction or compression algorithms. For example Hamming Codes or Lempel–Ziv–Welch. Hamming Codes are relatively expensive to calculate but good for fixed-length data and 'hash'. LZW is cheaper to calculate, but you won't be able to predict the size of the 'hash' you get at the end. Both would allow you to detect changes with high (but not complete) confidence. Commented Dec 2, 2015 at 11:47

2 Answers 2

2

Another option; calculate the hash in C# using a function like this;

private readonly System.Security.Cryptography.HashAlgorithm hash = System.Security.Cryptography.SHA1.Create(); public static string CalculateSignature(IEnumerable<object> values) { var sb = new StringBuilder(); foreach (var value in values) { string valueToHash = value == null ? ">>null<<" : Convert.ToString(value, CultureInfo.InvariantCulture); sb.Append(valueToHash).Append(char.ConvertFromUtf32(0)); } var signature = sb.ToString(); var bytesToHash = Encoding.UTF8.GetBytes(signature); var hashedBytes = hash.ComputeHash(bytesToHash); signature = Encoding.UTF8.GetString(hashedBytes); return signature; } 

Edit: Hashing profiling test

To show how fast SHA1 hashing is, here's a quick test. On my dev machine, I get 60,000 hashes in 176ms. MD5 takes 161

var hash = System.Security.Cryptography.MD5.Create(); var stringtoHash = "3490518cvm90wg89puse5gu3tgu3v0afgmvkldfjgmvvvvvsh,9semc9petgucm9234ucv0[vhd,flhgvzemgu904vq2m0"; var sw = System.Diagnostics.Stopwatch.StartNew(); for(var i = 0; i < 60000; i++) { var bytesToHash = Encoding.UTF8.GetBytes(stringtoHash); var hashedBytes = hash.ComputeHash(bytesToHash); var signature = Encoding.UTF8.GetString(hashedBytes); } sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); 
Sign up to request clarification or add additional context in comments.

4 Comments

This is using crypto hash, and SHA1 at that. I want to avoid crypto hashing and even if I were to use it I would go with binary serialized objects rather than string manipulation.
Is it because of the CPU expense? Calculating 60,000 SHA1 hashes will be cheap enough for a half-hourly import. Alternatively, replace with MD5 in the same namespace, which is 128-bit. But until you've profiled and can show that the hashing algorithm is too expensive, don't waste your time worrying!
@mmix -- just added a quick bit of code to show the speed of SHA1. On my machine, 60,000 SHA1 hashes took 176ms.
Cheers. Hope it helps.
0

In your staging SQL tables, add a 'checksum' column, using SQL's checksum function;

Something like this;

update mysourcetable set check = checksum(id, field1, field2, field3, field4 ...)

Clarification

You mentioned having an integration database; my thought was that you would read the data from DB2 into an interim database, like SQL server, where you're already storing ID/hash pairs. If you copied all the data out of DB2, not just the IDs, then you could calculate the checksum in the integration database.

1 Comment

Not an option, I don't have write or schema change access on source database. And I need to calculate hashes on source data.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.