Hash function to give fewer collisions

Question

I'm using this hash function but I'm getting a lot of collisions. The aim is to add the ascii values of elements and output the value. Any way to optimize this or another function to reduce the number of collisions?

int hash(char* s) { int hash = 0; while(*s) { hash = hash + *s; s++; } return hash; }

There are some well-known collision-free hash function. Since these are known, you can google them. — Eugene Sh.
– Eugene Sh., Commented Oct 11, 2018 at 21:17
@nicomp Why not? The length of the hash should be constant, that's right, but it should hash any input. — Eugene Sh.
– Eugene Sh., Commented Oct 11, 2018 at 21:28
One thing to note: a simple sum of all the characters means that permutations will always collide: hash("god") == hash("dog") — Barmar
– Barmar, Commented Oct 11, 2018 at 21:52

Jonathan Wood · Accepted Answer · 2018-10-11 21:24:33Z

A 32-bit int has a range of over 4 billion. (If your ints are 64-bit, the range is much bigger.) But your code simply adds up the values of each character in the string and it will never get anywhere near the upper range. All your hash codes will be smaller numbers, crowding the lower end of possible values, and increasing the chance of collisions.

That's why a good algorithm will be more complicated than this.

Here's one article that turned up in a quick Google search.

Bwebb · Accepted Answer · 2018-10-11 22:26:40Z

"foo bar" and "bar foo" hash to the same value right? Implement it in such a way that the ascii value and its position in the string are used to compute the hash, I naively imagine this will reduce collision significantly.

int hash(char* s) { int hash = 0; int pos = 0; while(*s) { pos++; hash += (*s * pos); s++; } return hash; }

Try this and see if it helps. I dont have much theoretical knowledge behind this answer.

EDIT* as mentioned below, you would probably want hash to be an unsigned int. I tested this out on codechef.com, here is the source and results:

#include <stdio.h> unsigned int hash(char* s); unsigned int hash2(char* s); int main(void) { unsigned int temp1 = hash("foo bar"); unsigned int temp2 = hash("bar foo"); printf("temp1 is %d and temp2 is %d\n",temp1, temp2); temp1 = hash2("foo bar"); temp2 = hash2("bar foo"); printf("temp1 is %d and temp2 is %d\n",temp1, temp2); return 0; } unsigned int hash(char* s) { unsigned int hash = 0; while(*s) { hash = hash + *s; s++; } return hash; } unsigned int hash2(char* s) { unsigned int hash = 0; int pos = 0; while(*s) { pos++; hash += (*s * pos); s++; } return hash; }

With output:

temp1 is 665 and temp2 is 665

temp1 is 2655 and temp2 is 2715

This is going to overflow quickly. And mind the fact that int overflow is undefined behavior.

MiCo · Accepted Answer · 2018-10-11 23:09:40Z

Yes, your "hash" function will have collisions for strings which consist of the same letters, for example "rail safety" and "fairy tales". This is because you are only using the addition which is commutative.

You can use something like this which involves a prime as factor.

unsigned long int hashBetter(const char* s) { unsigned long int hash = 1234567890ul; while(*s) { hash = (*s + hash) * 4294967291ul; s++; } return hash; }

Or you involve a CRC which spreads broadly the input data over the valid range of possible hash values:

unsigned long int hashGood(const char* s) { unsigned long int hash = 1234567890ul; while(*s) { hash = crc(hash, *s); s++; } return hash; }

Collectives™ on Stack Overflow

Hash function to give fewer collisions

3 Answers 3

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Related