29

I'm trying to develop a system that can change my string into a unique integral value, meaning say for example the word "account" has an encrypted numerical value of 0891 and no other word can possibly be converted to 0891 with the same conversion process, it does not however need to be able to be converted back the generated integer to string.

At the same time it will be dependent on the word structure rules, meaning words such as "accuracy" and "announcement" will have a generated number greater than 0891 and words such as "a", "abacus" and "abbreviation" will have a generated number less than 0891.

The purpose of this application is to serve similar to an index or primary key. The reason why I'm not using an increment index is for security purposes and is due to the indexes dependency to the number of data in the set

(e.g.)

[0] A, [1] B, [2] C, [3] D, [4] E, [5] F 

The above letters has each corresponding index, E has the index of 4

However if the data is suddenly increased or decreased then sorted

[0] A, [1] AA, [2] AAB, [3] C, [4] D, [5] DA, [6] DZ, [7] E, [8] F 

E now has the index of 7

Each word must have a unique independent integral equivalent and has the corresponding weights.

I need to know if there exist an algorithm that can do the above.

Any help will be appreciated.

8
  • 3
    Unless you impose a maximum word length, this is not possible. (Even if you do impose a word length, I'm still not sure). Commented May 13, 2013 at 11:45
  • 3
    i would start to say that if you want security, you should drop your dependence on the "word structure rules". You are already making the job easier for an attacker with such requirement. Commented May 13, 2013 at 11:46
  • 1
    What UmNyobe said, plus you should accept collisions. An index usually has collisions, there's nothing wrong with that as long as they're the exception, not the rule. Commented May 13, 2013 at 11:47
  • 1
    Please note that if you want to convert something into something else, without the option of returning to the original value, you are referring to hashing instead of encryption. Your problem is that most hashing algorithms have a chance of returning the same hash for different inputs. And especially if the output has to be an integral value; this will be quite likely. Commented May 13, 2013 at 11:47
  • 2
    @marcolopes: There are only 2^32 possible integer hash codes. There are many more than 2^32 possible strings, so String.hashCode() is guaranteed to generate the same hash code for multiple strings. Commented Nov 14, 2016 at 21:35

9 Answers 9

15

This is not possible with the constraints you have given, unless you impose a maximum length.

Assume that k("a") and k("b") are the codes of these two strings.

With your constraints, you are looking for a unique integer number that falls inbetween these two values, but k("a") < k("a....a") < k("b"). As there is an infinite number of strings of style "a....a" (and "akjhdsfkjhs") that would need to fit inbetween the two codes, such an order preserving general, unique, fixed-length code cannot exist for strings of arbitrary length. Because you would need as many integers as strings, and since strings are not bounded by length this cannot work.

Drop either general (so don't allow inserting new strings), unique (allow collissions - e.g. use the first four letters as code!), the unbounded length (to e.g. 3 characters) or the order-preserving property.

Sign up to request clarification or add additional context in comments.

Comments

13

For simplicity, I'll assume a to z are the only characters allowed in words.

Let's assign numbers up to length 2 strings:

String Value a 0 aa 1 ab 2 ... az 26 b 27 ba 28 bb 29 ... bz 53 c 54 ... 

Now, by just looking at that, you should be able to appreciate that, to determine the offset of any given shorter-length string, you'd need the maximum length allowed. Let's assume we know this number.

For algorithmic simplicity, we would prefer to start at 27: (feel free to try to figure it out for starting from 0, you'll need some special cases)

String Value a 27 aa 28 ab 29 ... 

So, essentially, the left-most character contributes a value 27*(1-26) (for a-z) and the next character to the right, if one exists, contributes 1-26 (for a-z) to the value for a string.

Now this can be generalized to say that the left-most number would contribute (1-26)*27^(len-1), the next (1-26)*27^(len-2), and so on, until (1-26)*27^0.

Which leads me to some Java code:

long result = 0; for (int i = 0; i < s.length(); i++) result += pow(27, MAX_LENGTH - i - 1)*(1 + s.charAt(i) - 'a'); 

Test output:

a = 150094635296999121 aa = 155653695863554644 aaa = 155859586995649293 aaaa = 155867212593134280 aaaaa = 155867495022670761 abacus = 161447654121636735 abbreviation = 161763445236432690 account = 167509959568845165 accuracy = 167554723653128367 announcement = 230924421746611173 z = 3902460517721977146 

Online demo.

Yes, those are some reasonably big numbers for just up to length 13 strings, but, without sequentially assigning numbers to words in an actual dictionary, you can't do any better (except that you can start at 0, which is, relatively speaking, a small difference), since there are that many possibilities of letter sequences.

Comments

5

For uniqueness, start with assigning primes to the letters: A -> 2, B -> 3, C -> 5, D -> 7 etc.

To calculate the "key" of a given letter in a word, raise the prime to the power of the position index in the word. To get the "key" of the whole word, multiply all the letter keys together.

For example the word CAB:

C -> 5 ^ 1 = 5 A -> 2 ^ 2 = 4 B -> 3 ^ 3 = 81 CAB -> 5 * 4 * 81 = 1620. 

No other word will ever give you 1620 as a key.

Note: you don't have to start with A -> 2 or assign primes to the characters of the alphabet in order as long as you keep track of the mapping. Also bear in mind that the results of this will get large very quickly.

However, bear in mind the other comments about security - this is not a particularly secure algorithm.

3 Comments

same answer as rahulroc, same counter-example: hash('abba') == hash("baab")
Does it actually preserve order as requested? AFAICT your order is incorrect: B < AB
fwiw, in python 2.7 hash('abba') and hash('baab') are not equal. I wonder what they are doing
2

If you don't have any limit on the number of bytes that these integers can occupy, then the underlying (e.g. Ascii) byte codes for each character will give you an integer representation. Equivalently, assign 0=A, 1=B up to Z=25 and then the word itself is the integer in base 26.

1 Comment

How would this handle the strings "10020" and "100C0" ?
1

Assign a unique prime value to each alphabet in increasing order(order not necessary).

Please Note : As multiplication of prime numbers is a unique result which can only be multiplied by these numbers, it will give you unique values for each word.

Algorithm :

int hash = 0; forEach (int i = 0 ; i < word.length ; i++) { hash *= (prime[c[i]] ** (length - i)); } 

prime - An array to store prime values corresponding to each

powered to (length - 1) to give value to the place at which this character occurs to maintain a dictionary order.

This algorithm will give sufficiently large values that will overrun your array.

Also : words will smaller lengths may give lower values than some words with larger length and it may affect your dictionary order but I'm not sure why do you want a dictionary order as the uniqueness will be maintained here.

6 Comments

counter-example: hash('abba') == hash("baab")
Plus, it doesn't scale. You quickly run out of integers, and it does not preserve order: hash('b') < hash('ab'), isn't it?
@ErichSchubert On a side note, there are infinitely many integers, even though integral types in programming languages are usually limited. How many integers this skips might be a good question.
I already mentioned in my answer that it will most likely give very huge numbers for very large strings. Also, smaller length words may have a lesser hashcode than the words with large lengths. but hash('abba') != hash('baab'). lets say a=2, b=3. Then hash("abba")=23 * 32 * 31 *2 = 432 and hash('baab')=32 * 2**2 * 2*1 * 3 = 648. which are different numbers. Multiplication of prime numbers will give unique values which can only be obtained by these prime numbers.
First, "hash *=" with initial for hash being 0 will always be zero. Perhaps you meant 1. Second, the (length - i) part goes from length to 1 in the for loop, so from 4 to 1 in the example. Thus, hash("abba") would be 2^4 * 3^3 * 3^2 * 2^1 = 7776, and bash("baba") would be 3^4 * 2^3 * 2^2 * 3^1 = 7776 again.
|
1

Yes, but mostly no.

Yes as in Stochastically's answer. By setting up a base 26 (or base 128 for all ASCII), you could theoretically hash each string uniquely.

On the other hand, this is impractical, not only would numbers get too big for most languages, but also this would likely be an incredibly consuming process. Furthermore, if strings are allowed to be infinite, then a form of Cantor's diagonal argument can be applied also "breaking" this algorithm. It is impossible to create a one-to-one mapping of a set with cardinality aleph-one (strings) to a set of cardinality aleph-null (ints).

Comments

1

You can do this:

SEPARETOR = '000' string_to_hash = "some_string" hashed_result = int(SEPARETOR.join(list(str(ord(character)) for character in string_to_hash))) 

Enjoy!

Comments

1

The function in general form for a string s of length n is:

hashCode(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] 

Where ^ denotes exponentiation. As Java uses 32-bit integers to hold the hash value, all values should be kept as such.

if you want to hash string to small integer, you can use the following C# code:

int StringToIntegerHash(string str) { int hash = 0; str = GetTicketHash(str); for(int i=0; i<str.Length;i++) { hash +=(int) ((int)str[i]) * Math.Pow(2, str.Length - i); } return hash; } string GetTicketHash(string str) { const string chars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; byte[] bytes = Encoding.UTF8.GetBytes(str); SHA256Managed hashstring = new SHA256Managed(); byte[] hash = hashstring.ComputeHash(bytes); char[] hash2 = new char[16]; // Note that here we are wasting bits of hash! // But it isn't really important, because hash.Length == 32 for (int i = 0; i < hash2.Length; i++) { hash2[i] = chars[hash[i] % chars.Length]; } return new string(hash2); } 

Comments

0

I would just convert the string into a byte-array and then convert that into a number. Here a PS-sample code:

$string = "test" # convert string into byte-array: $enc = [System.Text.Encoding]::UTF8 $arr = $enc.GetBytes($string) # convert byte-array into number: $hexbin = [System.Runtime.Remoting.Metadata.W3cXsd2001.SoapHexBinary]::new() $hexbin.Value = $arr $result = $hexbin.ToString() write-host $result 

Of course you could go with any other/shorter conversion like base-26 etc, but that makes the coding way more complex and slower.

FYI - in case you want to converts strings into numbers for faster comparison in a DB, then keep in mind that most DBs are already hashing strings internally. No need for any other finetuning.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.