Is finding collisions in a part-hash not often enough a bad problem?

Question

My situation: I've been working now for a couple of months on my own unique hash function, I've changed it many times and had two main versions but I won't bore anyone with the details of my work; at least not here. I'm here to ask a question that's specific and helpful to anyone attempting to design a cryptographic hash. (I hope)

Now I'm more than aware that simply being a normally-distrusted random function, does not make a cryptographically secure hash.

MY QUESTION however IS: If collisions are found LESS often than would be predicted by a random function, is this an alarm-bell that we may have some empirical weakness?

Allow me to elaborate:

In my tests, I found that any given set of 32 bits (from a set position in the hash) from an array of unrelated input (or related input but different output) hash functions were indeed seemingly random until I noticed that: In order to find a collision: one would have to go through about 89000 different attempts on average (although obviously the numbers swayed massively). Since the birthday bound is only 65536 (or ~77000), something seems amiss. Since my testing involved using batteries of different PRNGS to create the inputs aswell as non-random methodical ones and I ran these tests hundreds of millions of times, meaning I've made literally billions of hashes... I don't think it's my testing that is flawed or that my sample sizes are too small.

I can say with almost 100% confidence, that it is taking more attempts to find a collision than would be expected of a truly random (ideal) output.

Side note: I tried with 16 bit, 40 bit samples, etc and always came up with the same results: collisions were LESS frequent.

2nd Side note: When describing the hash as a PRNG or indeed a PRNG using the hash, it passes birthday-spacing tests on the dieharder suite just fine (I find this bit very confusing).

If anyone can explain this strange phenomenom (the passing on dieharder but failing on my own tests) then extra points but what I want to know is:

So given we do have a "Desired probability of random collision" in a given hash (and logically speaking any smaller word within that hash) then what happens when in a given word (and presumably the hash in general) collisions are found less often than expected? What info do we gain about the likely hood of pre-image attacks? Non ideal distribution or any other info in fact?

I know this is probably a hard question to answer but maybe not.

I can't get how " ~7700 " comes into the question, unless the 32-bit output is not binary. Independently: beware that the average number of random 32-bit samples to get a collision is not $2^{16}$; see this — fgrieu
– fgrieu ♦, Commented Jan 11, 2016 at 17:57
Apologies, I'll amend the question, I meant to say 77,000, because it's the figure published on the wiki page on the birthday attack. Pure maths is not my forte and maybe this is where I'm going wrong... I will also check I was averaging about 89000, since as I explained I did do a LOT of tests. — Iam Nick
– Iam Nick, Commented Jan 11, 2016 at 18:34
@fgrieu you mind explaining the crucial difference between where 77,000 and 82,137 come from? If I understand right: 77,000 is using an approximation of the p50, while 82,137 on the other hand is the EXACT number of AVERAGE attempts before we find a collision (which is a different thing). Have I got the right idea, there? — Iam Nick
– Iam Nick, Commented Jan 11, 2016 at 18:48
It's not possible to get collisions "less often than expected" because I'm pretty sure this bound is the tightest possible over all probability distributions (being lowest for a uniform distribution). Either you are miscalculating the bound, or generating your values wrong, your results cannot be correct. — Thomas
– Thomas, Commented Jan 11, 2016 at 19:26

Mike Edward Moras · Accepted Answer · 2016-04-27 17:36:04Z

There are different birthday bounds when we draw independent uniform random integers less then $d$ (for some large $d$, including $d=2^{32}$ of the question) and watch for collision(s):

In crypto, we often consider the bound of $\sqrt d$ ($65536$ for $d=2^{32}$) draws, at which there is a fair probability of collision: $p\approx1-1/\sqrt e\approx39.3\%$. This is also about the number of draws for which it is most likely that the first collision will occur.
The number of draws starting at which it becomes probable ($p\ge1/2$) that there is at least a collision; that is about $\sqrt{\log4}\sqrt d$ ($\approx1.17741\sqrt d$ , $\approx77163$ for $d=2^{32}$); more precise formulas are here.
The average (equivalently: expected) number of draws at which the first collision appears; that is about $\sqrt{\pi/2}\sqrt d$ ($\approx1.25331\sqrt d$ , $\approx82137$ for $d=2^{32}$); more precise formulas are here.

The difference between the three is because mode, median and mean of a discrete probability distribution are not the same thing (sort of: we rarely have a collision much before $77163$, and often have the first collision way after that, hence the significantly larger mean). The confusion is probably what causes the discrepancy in the question.

_{Image credit: Cmglee; source; Creative Commons license.}

If for some would-be hash function we consistently observe collisions much less frequently than predicted by these bounds, for would-be random input (restricted to a domain still much larger than the output domain is), then

our would-be random input (or its domain) is biased
and our hash function is not secure in the random oracle model, in a manner that is related to the bias in our random input.

Thank you, I know I said I was going to step away from this question for 24 hours but I really appreciate that input. :) That clears up a lot. — Iam Nick
– Iam Nick, Commented Jan 11, 2016 at 22:33

Stack Exchange Network

Is finding collisions in a part-hash not often enough a bad problem?

1 Answer 1

Hot Network Questions

Is finding collisions in a part-hash not often enough a bad problem?

1 Answer 1

Related

Hot Network Questions