13

The domain of interest is string matching. Assume I have a structure like this.

typedef struct { char *name, int (*function)(); } StringArray StringArray s[] = { {"George", func1}, {"Paul", func2}, {"Ringo", func3}, {"John", func4}, {"", NULL} /* End of list */ } 

There are a fixed number of strings in the array. They are hard coded as in the example. If the table changes, there would be a need to re-evaluate the quality of the hash function.

I want to apply a hash function to a string, and if the string matches one in the array, then call the function. A perfect hash function is needed for this. No collisions are allowed.The purpose of requiring hashing is to get O(1) performance on the lookup.

What ideas do you have on designing a function to do this?

2
  • See the gperf home page. Commented Apr 9, 2009 at 15:35
  • 32
    The following page has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: partow.net/programming/hashfunctions/index.html Commented Oct 31, 2010 at 23:10

6 Answers 6

2

The summary lists both C and C++. Which of them are you looking for? C and C++ are two distinct languages, and differ greatly in their string handling and data structures (and the fact that the C ones work in C++ doesn't change that).

Why, specifically, do you want a perfect hash function? Is it that you want to associate a string with a function, and thought that would be a good way to do it? Is this some sort of homework assignment? Do you have a reason not to use map<> in C++? (Or unordered_map<> if available?)

If you do need a perfect hash, what are the constraints on the strings? Will there be a certain fixed set you want to dispatch on? What about strings that don't match one of the set? Are you willing to accept hits from random strings, or is the number of incoming strings limited?

If you could edit your question to include information like that, we could be a lot more helpful.

EDIT (in response to the first two comments):

OK, we should look at C solutions, since you presumably want this for both your C and C++ work. You presumably want the performance, but have you tested? If we're dealing with strings coming in on the I/O system, the time there is likely to dwarf the dispatch time.

You are expecting arbitrary strings. It's a bit much to expect a perfect hash function that will avoid all collisions from random data, so you do need to consider that.

Have you considered a trie? It may be more efficient than a perfect hash function (or may not be), it should be fairly easy to implement in C, and it will avoid problems with redoing your list of dispatching strings or possible collisions.

Sign up to request clarification or add additional context in comments.

2 Comments

I code in both c and c++, and god help me Pro*C. O(1) hashing for performance. Lol, no homework assignment. I am looking to build a tool to speed up some performance critical code. The example is made simple for discussion purposes. The real world use is not.
The strings will very in length. None of them will be of length zero. As a practical limit, no string in the array will be longer than 32 characters. What the caller passes in can be of any length, but if it is longer than the strings in the table it is the case of a no-match
1

The final result of this exercise was to

  • Steal a number of string oriented hash functions off the net.
  • Build a sort of factory class which tested each of the functions against the data set with a range of mod operator values, looking for the smallest perfect hash that works with that function.
  • That factory class default constructor returns a string, which represents a set of arguments that when use picks the correct hash function, and mod size to give the perfect hash that requires the least amount of memory.
  • under normal usage, you simply instantiate the class with the returned arguments, and the class puts itself into a working state with the desired functions.
  • That constructor validates that there are no collisions and aborts if there are.
  • In the case where no perfect hash is found, it degrades into a binary search across a sorted version of the input table.

For the set of arrays I have in my domain, this seems to work very very well. A possible future optimization would be to do the same sort of testing, on substrings of the input. In the sample case, the first letter of each musicians name is sufficient to tell them apart. Then one would need to balance the cost of the actual hash function against the memory used.

My thanks to everyone who contributed ideas.

Evil

Comments

0

You can use map

std::string foo() { return "Foo"; } std::string bar() { return "Bar"; } int main() { std::map<std::string, std::string (*)()> m; m["foo"] = &foo; m["bar"] = &bar; } 

5 Comments

std::map doesn't use a hash - it is tree based
why to invent the wheel, you can use existing libraries like map.
perhaps the questioner wanted the performance characteristics of hashing rather than those of tree searching?
@Mitch: what's the problem? We should often ask ourselves not how to do this or that but why I want to use this or that.
Precisely. Why does EvilTeach want a hash function? There may be good reasons for exactly that, but I haven't seen them. For the general case of taking a string and executing a function, this works fine.
0

If collisions are absolutely not allowed, your only option is to keep track of every string in the database, which is probably not a best way to go.

What I would do is apply one of existing common strong hashing algorithms, such as: MD5 or SHA. There's miriads of samples all around, here's one for example: http://www.codeproject.com/KB/security/cryptest.aspx

Comments

0

Use a balanced binary tree. Then you KNOW behaviour is ALWAYS O(logn).

I strongly dislike hashes. People don't realise how much risk they take with their algorithm. They run some test data and then deploy in the field. I have NEVER seen a deployed hash algorithm get checked for behaviour in the field.

O(log n) is almost always acceptable in lieu of O(1).

5 Comments

“O(log n) is almost always acceptable in lieu of O(1).” In many applications, this statement is flat out wrong. Just increase the number of data points to above a few millions to see this.
Once you've done that, test. Hashes do not give guaranteed results, unless you know in advance what all possible inputs can be. A hash function that tends to clump the input will likely not give you O(1).
In this case, all the inputs are known. They are sitting in the array. and input string is either an exact match, or a no-match.
No, all the dispatchable inputs are known. Once you've hashed the input string, and gotten a hit, you do need to compare.
Konrad - yes. But the number of applications with so many datum are few and far between. They form a specific, rather rare class. This is why I said almost always acceptable.
-1

Well, there is no perfect hash function.

You have several that minimize collisions, but none eliminates them.

Can't advise one though :P

EDIT: The solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.

7 Comments

@Adam: There's a pretty big caveat there given that it only applies when there is a distinct data set. Since the OP didn't make any mention of limiting the strings being used I'd agree with Megacan that there is no perfect hash in this case. +1.
The questioner does make that mention, at least implicitly - there were only four Beatles )or siz if you include the drummer they sacked and Stu whatsisname) - still, a fixed data set.
I was only saying, the solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.
Yes it will be a fixed data set, there are only so many strings he could type in a lifetime :). The point here though is that "there is no perfect hash function" is a truism unless there are explicit constraints specified. Megacan was logoically correct in my book.
@megacan - follow the link in the answer I posted - perfect hash functions are possible and quite widely used
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.