Confused about C++ unordered_map and hash collision

Question

I've read that unordered_map places elements with the same hash in buckets and that's how it handles hash collisions. However, when I checked the insert function, it says:

Each element is inserted only if its key is not equivalent to the key of any other element already in the container

Does that mean that I cannot insert an element with the same hash?.. I'm supposed to be able to insert an element with a new hash since the unordered_map structure can handle collisions, right?.. I think I may be missing something.

2 different keys can result in the same hash, and even 2 different hashes can correspond to the same bucket. — BoBTFish
– BoBTFish, Commented Apr 22, 2015 at 8:06

paxdiablo · Accepted Answer · 2015-04-22 08:09:47Z

3

It's certainly possible for those statements to be consistent, once you realise that the hash isn't necessarily the key.

A group of distinct keys may generate the same hash value, so be stored in the same bucket, but that still allows for the restriction that duplicate keys are disallowed.

For example, let's say you have a friends collection using first name as the key. The hash function is (a rather simplistic) "use the first letter of the name.

So, while Albert, Andrew, Adam, Bill, Benny and Chloe are six different keys, they only account for three different hash values:

 A B C (buckets) ______/|\_____ / \ | / | \ / \ | Albert Andrew Adam Bill Benny Chloe (keys)

edited Apr 22, 2015 at 8:09

answered Apr 22, 2015 at 8:05

paxdiablo

888k243 gold badges1.6k silver badges2k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

BoBTFish Over a year ago

Worth mentioning that 2 different hashes can end up in the same bucket, given than there will (almost certainly) be far more possible hashes than buckets.

paxdiablo Over a year ago

@BoBTFish, that depends on how you define hash, I guess. If you get a 32-bit "hash" that's further reduced to an 8-bit bucket ID, I'd tend not to call the 32-bit value a hash, rather an intermediate value. The true hash would be the bucket ID.

BoBTFish Over a year ago

To me, it's the result_type of the hasher. The "bucket id" must surely be the same type as returned by bucket_count or max_bucket_count, which is size_type.

paxdiablo Over a year ago

@BoBTFish, I think then this is just a matter of terminology. The whole point of a hash function is to efficiently turn the data into a bucket ID, there's little use for an intermediate "hash" value once you have that bucket ID. It seems like you're saying something like hash = hashFn(data); bucket = hash % 200; whereas I view it more as hashAndBucket = hashFn(data) % 200; (or the hash function itself reducing the intemediate value to a bucket ID).

Tony Delroy Over a year ago

@paxdiablo: use in hash tables is only one application of hash function outputs - they can also be used e.g. as checksums, as keys, as fixed-size statistically-imperfect-but-good irreversable records of recent passwords that are disallowed when setting another etc.. It's also useful to distinguish getting the same hash function outputs from collisions after mapping into buckets, so you know whether a better hash function's needed, or a lower load factor, or for custom non-Standard-library hash table perhaps moving from a power-of-2 bucket count to a prime number....

Collectives™ on Stack Overflow

Confused about C++ unordered_map and hash collision

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related