17

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as if it was an integer?

Like:

std::size_t hash_value(float f) { return hash_value(*(reinterpret_cast<int*>(&f))); } 

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

2
  • 1
    As written you violate the strict-aliasing rules (and worse if int isn't the same size as float), but I'd be curious for an answer involving converting it to a char* of given length. Commented Sep 13, 2011 at 14:09
  • possible duplicate of Hash function for floats Commented Mar 27, 2014 at 10:42

4 Answers 4

6

Take a look at https://svn.boost.org/trac/boost/ticket/4038

In essence it boils down to two things:

  • Portability: when you take the binary representation of a float, then on some platform it could be possible that a float with a same value has multiple representations in binary. I don't know if there is actually a platform where such an issue exists, but with the complication of denormelized numbers, I'm not sure if this might actually happen.

  • the second issue is what you proposed, it might be that sizeof(float) does not equal sizeof(int).

I did not find anyone mentioning that the boost hash indeed avoids fewer collisions. Although I assume that separating the mantissa from the exponent might help, but the above link does not suggest that this was the driving design decision.

Sign up to request clarification or add additional context in comments.

Comments

5

One reason not to just use the bit pattern is that some different bit patterns must be considered equals and thus have the same hashcode, namely

  • positive and negative zero
  • possibly denormalized numbers (I don't think this can occur with IEEE 754, but C allows other float representations).
  • possibly NANs (there are many, at least in IEEE 754. It actually requires NAN patterns to be considered unequal to themselves, which arguably means the cannot be meaningfully used in a hashtable)

3 Comments

Why must NANs be considered equal? NAN != NAN.
Actually, I take that back. The standard says "all evaluations of the expression h(k) with the same value for k yield the same result." not (as I expected), "all evaulations h(k) and h(l) where k == l yield the same result". So == is less relevant than "being the same value", according to the hash requirements. Whether two different NANs are "the same value" is a standards-combing exercise I can't be bothered with :-)
even though NaN != NaN, hash functions are to be treated as black-boxes; that is, the consumer of hash values (such as associative containers) will not make provision for non-comparability. Thus, either hash(NaN) == hash(NaN) has to be true, or the data type shouldn't be hashable.
4

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

6 Comments

Can you elaborate on why you think hashing +/- infinity and NaN is meaningless? I fail to see why this could be an issue.
At least with IEEE 754, I don't think there can be different patterns with the same value due to denormalization
@Michael Borgwardt Certainly +/- 0 and various values of NaN could occur in IEEE 754. I'm not sure about other values though.
Unless the standard says otherwise, I'd expect +/- inf to be treated like any other values of the type - ideally they should have different hashes. If there are two representations of +inf in your floating point type, then they must have the same hash.
Nice answer except the pointless "Why would you want to do this?". There are use-cases. Just because you can't think of them...
|
0

I imagine it's so that two machines with incompatible floating-point formats hash to the same value.

1 Comment

And Imagine I just want to use floats in a std::unordered_set, there is no reason the hash value will be shared between machines.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.