1

I'm working on an English words app right now, I want every word has a different int id, as all words are different from one another, I think they simply can be assigned an integer (or long?) easily.

I don't want to give them ids serially according to alphabetical order. I think there may be an existing algorithm for this very requirement, I don't want to invent my own wheels, so, please help me.

I prefer integer id because I want the structure to be compact and small enough for transfer over the internet, because one word list might contain hundreds and thousands of words.

Imagine I have data structure as bellow:

struct word { int wordId; byte familiarity; } // I prefer the mapping like this apple -> 0x1, 0x4 app -> 0x2E, 0x2 ape -> 0xEA, 0x1 

UPDATE:

Okay, what I'm trying to do is to provide the users several wordlists, each of which contains a couple of words, chances are the user already learn some of the words(eg. apple), so he/she wants to skip those words, and hope they'll never show up again. So, I want to enable the user to skip those words, and the selected words will be sent to the server or kept in a local file, it might be unnecessary to send the whole word or phrase. I've found a question here: http://stackoverflow.com/questions/7700400/whats-a-good-hash-function-for-english-words, do you have any better solutions?

11
  • It's called a dictionary; several good ones, both Oxfords and Websters, are available at your local bookstore. However, it may take you a while to type in all 300,000 words from a good one. Commented Mar 18, 2013 at 8:29
  • "I don't want to give them ids serially according to alphabetical order" ― why? Commented Mar 18, 2013 at 8:30
  • I see this as a proposed solution for an unknow problem. Please provide some background on the actual business problem being addressed, as something like the LZW encryption algorithm may be much better suited to it than this trial balloon. Commented Mar 18, 2013 at 8:35
  • @n.m. because the vocabulary is growing, for instance, in v1.0, my vocabulary is "apple, orange, strawberry", if I give them id serially, it may have a mapping like "apple 1, orange 2, strawberry 3", and in v2.0, I add a lemon, if I assign ids alphabetically, it should be "apple 1, lemon 2, orange 3, strawberry 4", but the ids are already taken. Commented Mar 18, 2013 at 8:57
  • 1
    So assign the numbers in chronological order. Commented Mar 18, 2013 at 8:59

1 Answer 1

0

Yes, it seems impossible to find a perfect collision free hashing algorithm, I might end up with maintaining a mapping file.

I also find a great question and answer here.

Actually I don't mind the performance of this algorithm for it's all done on server, and only once while starting up. All I want is the id for every word/phrase is unique, as short as possible, like a fingerprint. I wonder if I can take advantage of the prime numbers..

Finally, I decide to use a long for my id

(8 bit) first letter of the first word
(8 bit) last letter of the last word
(4 bit) word count
(4 bit) the serial number of the longest word in the phrase
(8 bit) character count, space included
(32bit) MurmurHash3 result

You can find murmurHash3 cs implementation here:
https://gist.github.com/automatonic/3725443
I think this approach will generate unique ids for any existing words and phrases with no collision.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.