Hash algorithm for english phrases

Question

I'm working on an English words app right now, I want every word has a different int id, as all words are different from one another, I think they simply can be assigned an integer (or long?) easily.

I don't want to give them ids serially according to alphabetical order. I think there may be an existing algorithm for this very requirement, I don't want to invent my own wheels, so, please help me.

I prefer integer id because I want the structure to be compact and small enough for transfer over the internet, because one word list might contain hundreds and thousands of words.

Imagine I have data structure as bellow:

struct word { int wordId; byte familiarity; } // I prefer the mapping like this apple -> 0x1, 0x4 app -> 0x2E, 0x2 ape -> 0xEA, 0x1

UPDATE:

Okay, what I'm trying to do is to provide the users several wordlists, each of which contains a couple of words, chances are the user already learn some of the words(eg. apple), so he/she wants to skip those words, and hope they'll never show up again. So, I want to enable the user to skip those words, and the selected words will be sent to the server or kept in a local file, it might be unnecessary to send the whole word or phrase. I've found a question here: http://stackoverflow.com/questions/7700400/whats-a-good-hash-function-for-english-words, do you have any better solutions?

It's called a dictionary; several good ones, both Oxfords and Websters, are available at your local bookstore. However, it may take you a while to type in all 300,000 words from a good one. — Pieter Geerkens
– Pieter Geerkens, Commented Mar 18, 2013 at 8:29
"I don't want to give them ids serially according to alphabetical order" ― why? — n. m. could be an AI
– n. m. could be an AI, Commented Mar 18, 2013 at 8:30
I see this as a proposed solution for an unknow problem. Please provide some background on the actual business problem being addressed, as something like the LZW encryption algorithm may be much better suited to it than this trial balloon. — Pieter Geerkens
– Pieter Geerkens, Commented Mar 18, 2013 at 8:35
@n.m. because the vocabulary is growing, for instance, in v1.0, my vocabulary is "apple, orange, strawberry", if I give them id serially, it may have a mapping like "apple 1, orange 2, strawberry 3", and in v2.0, I add a lemon, if I assign ids alphabetically, it should be "apple 1, lemon 2, orange 3, strawberry 4", but the ids are already taken. — Liang Zhao
– Liang Zhao, Commented Mar 18, 2013 at 8:57

Community · Accepted Answer · 2017-05-23 11:50:09Z

Yes, it seems impossible to find a perfect collision free hashing algorithm, I might end up with maintaining a mapping file.

I also find a great question and answer here.

Actually I don't mind the performance of this algorithm for it's all done on server, and only once while starting up. All I want is the id for every word/phrase is unique, as short as possible, like a fingerprint. I wonder if I can take advantage of the prime numbers..

Finally, I decide to use a long for my id

(8 bit) first letter of the first word
(8 bit) last letter of the last word
(4 bit) word count
(4 bit) the serial number of the longest word in the phrase
(8 bit) character count, space included
(32bit) MurmurHash3 result

You can find murmurHash3 cs implementation here:
https://gist.github.com/automatonic/3725443
I think this approach will generate unique ids for any existing words and phrases with no collision.

Collectives™ on Stack Overflow

Hash algorithm for english phrases

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related