Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

8
  • 103
    Nice. I have a machine-learning application, doing statistical NLP over a large corpus. After a few initial passes of morphological normalization on the original words in the text, I throw away the string values and use hash codes instead. Throughout my entire corpus, there are about 600,000 unique words, and using the default java hashcode function, I was getting about 3.5% collisions. But if I SHA-256 the string value and then generate a hashcode from the digested string, the collision ratio is less than 0.0001%. Thanks! Commented Nov 18, 2010 at 18:54
  • 29
    @benjismith One in a million is far too large... is "less than 0.0001%" an oblique way of saying "exactly 0"? I really doubt that you saw a SHA-256 collision because that has never been observed, anywhere, ever; not even for 160-bit SHA-1. If you have two strings that produce the same SHA-256, the security community would love to see them; you'll be world-famous... in a very obscure way. See Comparison of SHA Functions Commented Feb 26, 2014 at 1:03
  • 10
    @TimSylvester, you misunderstood. I didn't find SHA-256 collisions. I computed the SHA-256 and then fed the resultant byte sequences into a typical Java "hashCode" function, because I needed a 32-bit hash. That's where I found the collisions. Nothing remarkable :) Commented Feb 27, 2014 at 5:09
  • 1
    Isn't there a difference between 'hashing' and 'encrypting'? I understand MessageDigest is a one way hashing function, right? Also, when I used the function, I got the hashed string as a lot of junk UTF characters when I opened the file in LibreOffice. Is it possible to get the hashed string as a random bunch of alphanumeric characters instead of junk UTF characters? Commented Nov 13, 2016 at 17:54
  • 2
    String encryptedString and stringToEncrypt.getBytes() refer to encryption, when this really is a hashing algorithm. Commented May 10, 2017 at 17:43