How would you go about designing a function for a perfect hash?

Question

The domain of interest is string matching. Assume I have a structure like this.

typedef struct { char *name, int (*function)(); } StringArray StringArray s[] = { {"George", func1}, {"Paul", func2}, {"Ringo", func3}, {"John", func4}, {"", NULL} /* End of list */ }

There are a fixed number of strings in the array. They are hard coded as in the example. If the table changes, there would be a need to re-evaluate the quality of the hash function.

I want to apply a hash function to a string, and if the string matches one in the array, then call the function. A perfect hash function is needed for this. No collisions are allowed.The purpose of requiring hashing is to get O(1) performance on the lookup.

What ideas do you have on designing a function to do this?

The following page has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: partow.net/programming/hashfunctions/index.html — Matthieu N.
– Matthieu N., Commented Oct 31, 2010 at 23:10

David Thornley · Accepted Answer · 2009-04-09 17:10:10Z

The summary lists both C and C++. Which of them are you looking for? C and C++ are two distinct languages, and differ greatly in their string handling and data structures (and the fact that the C ones work in C++ doesn't change that).

Why, specifically, do you want a perfect hash function? Is it that you want to associate a string with a function, and thought that would be a good way to do it? Is this some sort of homework assignment? Do you have a reason not to use map<> in C++? (Or unordered_map<> if available?)

If you do need a perfect hash, what are the constraints on the strings? Will there be a certain fixed set you want to dispatch on? What about strings that don't match one of the set? Are you willing to accept hits from random strings, or is the number of incoming strings limited?

If you could edit your question to include information like that, we could be a lot more helpful.

EDIT (in response to the first two comments):

OK, we should look at C solutions, since you presumably want this for both your C and C++ work. You presumably want the performance, but have you tested? If we're dealing with strings coming in on the I/O system, the time there is likely to dwarf the dispatch time.

You are expecting arbitrary strings. It's a bit much to expect a perfect hash function that will avoid all collisions from random data, so you do need to consider that.

Have you considered a trie? It may be more efficient than a perfect hash function (or may not be), it should be fairly easy to implement in C, and it will avoid problems with redoing your list of dispatching strings or possible collisions.

I code in both c and c++, and god help me Pro*C. O(1) hashing for performance. Lol, no homework assignment. I am looking to build a tool to speed up some performance critical code. The example is made simple for discussion purposes. The real world use is not.
The strings will very in length. None of them will be of length zero. As a practical limit, no string in the array will be longer than 32 characters. What the caller passes in can be of any length, but if it is longer than the strings in the table it is the case of a no-match

EvilTeach · Accepted Answer · 2011-01-01 17:41:01Z

The final result of this exercise was to

Steal a number of string oriented hash functions off the net.
Build a sort of factory class which tested each of the functions against the data set with a range of mod operator values, looking for the smallest perfect hash that works with that function.
That factory class default constructor returns a string, which represents a set of arguments that when use picks the correct hash function, and mod size to give the perfect hash that requires the least amount of memory.
under normal usage, you simply instantiate the class with the returned arguments, and the class puts itself into a working state with the desired functions.
That constructor validates that there are no collisions and aborts if there are.
In the case where no perfect hash is found, it degrades into a binary search across a sorted version of the input table.

For the set of arrays I have in my domain, this seems to work very very well. A possible future optimization would be to do the same sort of testing, on substrings of the input. In the sample case, the first letter of each musicians name is sufficient to tell them apart. Then one would need to balance the cost of the actual hash function against the memory used.

My thanks to everyone who contributed ideas.

Evil

Vinay · Accepted Answer · 2009-04-09 15:35:58Z

0

You can use map

std::string foo() { return "Foo"; } std::string bar() { return "Bar"; } int main() { std::map<std::string, std::string (*)()> m; m["foo"] = &foo; m["bar"] = &bar; }

answered Apr 9, 2009 at 15:35

Vinay

4,8037 gold badges36 silver badges45 bronze badges

5 Comments

anon Over a year ago

std::map doesn't use a hash - it is tree based

Vinay Over a year ago

why to invent the wheel, you can use existing libraries like map.

anon Over a year ago

perhaps the questioner wanted the performance characteristics of hashing rather than those of tree searching?

Mykola Golubyev Over a year ago

@Mitch: what's the problem? We should often ask ourselves not how to do this or that but why I want to use this or that.

David Thornley Over a year ago

Precisely. Why does EvilTeach want a hash function? There may be good reasons for exactly that, but I haven't seen them. For the general case of taking a string and executing a function, this works fine.

galets · Accepted Answer · 2009-04-09 15:41:26Z

If collisions are absolutely not allowed, your only option is to keep track of every string in the database, which is probably not a best way to go.

What I would do is apply one of existing common strong hashing algorithms, such as: MD5 or SHA. There's miriads of samples all around, here's one for example: http://www.codeproject.com/KB/security/cryptest.aspx

user82238 · Accepted Answer · 2009-04-09 16:03:27Z

0

Use a balanced binary tree. Then you KNOW behaviour is ALWAYS O(logn).

I strongly dislike hashes. People don't realise how much risk they take with their algorithm. They run some test data and then deploy in the field. I have NEVER seen a deployed hash algorithm get checked for behaviour in the field.

O(log n) is almost always acceptable in lieu of O(1).

answered Apr 9, 2009 at 16:03

user82238

5 Comments

Konrad Rudolph Over a year ago

“O(log n) is almost always acceptable in lieu of O(1).” In many applications, this statement is flat out wrong. Just increase the number of data points to above a few millions to see this.

David Thornley Over a year ago

Once you've done that, test. Hashes do not give guaranteed results, unless you know in advance what all possible inputs can be. A hash function that tends to clump the input will likely not give you O(1).

EvilTeach Over a year ago

In this case, all the inputs are known. They are sitting in the array. and input string is either an exact match, or a no-match.

David Thornley Over a year ago

No, all the dispatchable inputs are known. Once you've hashed the input string, and gotten a hit, you do need to compare.

user82238 Over a year ago

Konrad - yes. But the number of applications with so many datum are few and far between. They form a specific, rather rare class. This is why I said almost always acceptable.

Megacan · Accepted Answer · 2009-04-09 16:00:46Z

-1

Well, there is no perfect hash function.

You have several that minimize collisions, but none eliminates them.

Can't advise one though :P

EDIT: The solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.

edited Apr 9, 2009 at 16:00

answered Apr 9, 2009 at 15:41

Megacan

2,5183 gold badges20 silver badges31 bronze badges

7 Comments

sipsorcery Over a year ago

@Adam: There's a pretty big caveat there given that it only applies when there is a distinct data set. Since the OP didn't make any mention of limiting the strings being used I'd agree with Megacan that there is no perfect hash in this case. +1.

anon Over a year ago

The questioner does make that mention, at least implicitly - there were only four Beatles )or siz if you include the drummer they sacked and Stu whatsisname) - still, a fixed data set.

Megacan Over a year ago

I was only saying, the solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.

sipsorcery Over a year ago

Yes it will be a fixed data set, there are only so many strings he could type in a lifetime :). The point here though is that "there is no perfect hash function" is a truism unless there are explicit constraints specified. Megacan was logoically correct in my book.

anon Over a year ago

@megacan - follow the link in the answer I posted - perfect hash functions are possible and quite widely used

|

Collectives™ on Stack Overflow

How would you go about designing a function for a perfect hash?

6 Answers 6

2 Comments

Comments

5 Comments

Comments

5 Comments

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

5 Comments

Comments

5 Comments

7 Comments

Linked

Related