Designing hashCode method Java

Question

I'm studying Item 9, Effective Java [Always override hashcode() when you override equals].

I have a few queries regarding the points made by author :

The author says:

A nonzero initial value is used in step 1 so the hash value will be affected by initial fields whose hash value, as computed in step 2.a, is zero. If zero were used as the initial value in step 1, the overall hash value would be unaffected by any such initial fields, which could increase collisions. The value 17 is arbitrary.

Step 2.a is:

For each significant field f in your object (each field taken into account by the equals method, that is), do the following: a. Compute an int hash code c for the field:

i. If the field is a boolean ,compute (f ? 1 : 0) .

ii. If the field is a byte , char , short , or int , compute (int) f .

iii. If the field is a long , compute (int) (f^ (f >>> 32)) .

iv. If the field is a float , compute Float.floatToIntBits(f) .

v. If the field is a double , compute Double.doubleToLongBits(f) , and then hash the resulting long as in step 2.a.iii.

vi. If the field is an object reference and this class’s equals method compares the field by recursively invoking equals , recursively invoke hashCode on the field. If a more complex comparison is required, compute a “canonical representation” for this field and invoke hashCode on the canonical representation. If the value of the field is null , return 0 (or some other constant, but 0 is traditional).

vii. If the field is an array, treat it as if each element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine these values per step 2.b. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.

Suppose result is calculated as:

result = 31 * result + areaCode; result = 31 * result + prefix; result = 31 * result + lineNumber;

In case initial value of result is 0 and all given fields above are 0, result would remain 0. But, even if the result isn't 0 initially, result would amount to the same constant every time the initial fields are 0 which would be: 31*(31*(31*17)). How would this value help in decreasing collisions?

The last paragraph states that :

Many classes in the Java platform libraries, such as String , Integer , and Date , include in their specifications the exact value returned by their hashCode method as a function of the instance value. This is generally not a good idea, as it severely limits your ability to improve the hash function in future releases. If you leave the details of a hash function unspecified and a flaw is found or a better hash function discovered, you can change the hash function in a subsequent release, confident that no clients depend on the exact values returned by the hash function.

What does he means by saying that the exact value returned by hashCode is a function of the instance value?

Thanks in advance for any help.

According to me," hashCode method as a function of the instance value" means that the hash code generated will be dependent on values of variables of object or instance. So, it may be possible that two objects having same values, may generate same hashCode. Which may lead to collision. Then, A better hashCode algorithm needs to be implemented in order to remove collision. — Ashish Ani
– Ashish Ani, Commented Dec 1, 2015 at 7:21
If you check the documentation of String hashcode() generation method, they implement a formula, which is based on the characters in the string. — Ashish Ani
– Ashish Ani, Commented Dec 1, 2015 at 7:22
It all depends on what you actually do with the hashcode. Ultimately it is a tradeoff between implementation effort, runtime cost, and ideal amount of collisions you can tolerate. There is no 'better'. When in doubt, use one of the convenient hashcode builders from e.g. guava or just return -1 if you really don't care. Alternatively think of maybe using MD5 or Murmur to get a decent spread of values at not too much computation overhead. I've found out the hard way that String.hashCode() and HashMaps, don't scale that well beyond a few tens of thousands of entries. — Jilles van Gurp
– Jilles van Gurp, Commented Dec 7, 2015 at 9:49

Seelenvirtuose · Accepted Answer · 2015-12-01 07:51:23Z

How would this value help in decreasing collisions?

Hash collision is primarily achieved by a good distribution across the whole hash range (here the integer type).

By defining 0 as the initial value for calculating the hash result, you have a somewhat restricted distribution in a small range. Objects that differ in a minor way - maybe in some field only - produce hash codes that are not far away from each other. This makes hash collisions more likely.

By defining a non-zero initial value, you simply increase the gaps between calculated hash codes for objects that differ only in a minor way. So you better utilize the hash range and effectively make hash collisions more unlikely.

What does he means by saying that the exact value returned by hashCode is a function of the instance value?

It simply means that you should calculate the hash code by using the object's value, i.e. the values of its fields. You already did it in your example, and I think that you already implicitly understood it.

But: Joshua Bloch intended to say something else with this paragraph: He wanted to warn you about not documenting the exact function how the hash code is calculated. If you do so, you restrict yourself to not being able anymore to change the implementation in future releases because some users might expect a specific implementation, and you would break some code depending on yours.

When you say 'By defining a non-zero initial value, you simply increase the gaps between calculated hash codes for objects that differ only in a minor way...' is there any chance you could edit your question with a simple example of when that would be the case? I'm struggling to see how that works, thanks :-)

Ashish Ani · Accepted Answer · 2015-12-01 07:25:13Z

See this example:

 String a = "Abc"; String b = "Abc"; String c = "Pqr"; System.out.println(" "+a.hashCode()+" "+b.hashCode()+" "+c.hashCode());

Output: 65602 65602 80497

Which clearly shows that hashCode() of string depends on values.

Extract from hashCode() documentation:
int java.lang.String.hashCode()

Returns a hash code for this string. The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

Your observation makes sense to me. What I infer is, that in case of the aforementioned classes, if the hashCode method depends on the instance variables, it might be hard to rewrite that hash method in a future release without breaking out the code for previous version. For example, 2 strings which are logically equal (when their hashcodes are computed using the characters contained in them), may become unequal if the future hashCode definition takes some other parameters into account. In a way, there's a tight coupling between the hashCode and the instance, which is always bad.
But I'm still not clear about the first part. Can you provide reason for the initial value for 'result' not being used as '0'? How would it provide help in reducing collisions?

pvg · Accepted Answer · 2015-12-01 07:26:15Z

The hashCode implementation in Effective Java specifically instructs you to pick a non-zero value for the initial value of result. As to your second question, hashCode is supposed to produce the same value when the internal state used for equals comparisons of the object is the same. So the fact that you'd get the same value when the instance variables are all zero fulfills the contract of hashCode. Note that the entire subheading is 'Always override hashCode when you override equals'.

yitzih · Accepted Answer · 2015-12-01 07:26:43Z

For your first question, if the 2 objects are equal they are supposed to return the same hash value, this is the reason why overriding the hash method is a good idea when you override the equals method. It doesn't avoid collisions of equal objects but it does make it less likely to have collisions when objects are not equal which is more important since we want to be able to find unique objects as fast as possible.

As for your second question, I don't pretend to have much experience with designing Hash code however I believe he means that certain Objects may only return a single Hash value (for example a Singleton).

He is saying that putting that value in the documentation is a bad practice since you may want to change the hash function later, or other variables within the hash function may change at a later date changing the return value.

Either way specifying the return value or relying on a specified return value is a bad idea.

my first question is regarding the initial value of result not being 0? How would that help in decreasing collisions? I know that I have to override hashCode when I override equals. This is more about the designing of hash method.
Ah ok I misunderstood. It isn't just 0 that shouldn't be used, it is any non prime number. (Large prime numbers make the best hashing algorithms.) This is because operations on prime numbers are much less likely to end up with the similar results. 0 is simply the worst non prime number because the result will often be the same for every number (ie 10 * 0 is 0 and so is 15 * 0).

oopexpert · Accepted Answer · 2015-12-07 09:35:45Z

First of all I want to say a very important thing that is often not clearly formulated:

Implementing hashcode for MOST CASES is NOT IMPORTANT. It breaks down to a performance issue only. So if you have a problem with hashcode and object identity, just return -1. You will have poor performance but a solid and correct implementation. But until you have thousands of objects that make use of hashcode you won't recognize the poor performance. by the way: "Collision" looks like a significant word in the context of hashcode. Yes it is, but only if performance is really an issue. A "Collision" of hashcode-values does not mean your program does not work correctly. It means that your program may run slower. Because the access with a key to a map will cause an sequential iteration over objects with the same hashcode. In high performance environments the may be an issue. In most cases not.

But what IS IMPORTANT if you override hashcode: you have to implement it CORRECTLY. So the definition should always be satisfied: if equals returns true, hashcode should return the same value.

One other thing: while you accidentally not face problems with it, calculating hashcode on non-immutable values is a bad idea. This is because once hashcode is used, the object is placed at a special position within the "Map". If the values change the hashcode depends on, this object may be lost or it becomes hardly accessible. This WILL affect correctness of your program.

Conclusion: Do only make use of hashcode if you really need the performance. And then you should make sure you apply it correctly. It's easy make errors here but these errors may be the hardest to identify.

While it is true that you shouldn't worry about making a state of the art hash function unless you need it, and that hash functions aren't particularly import usually, you should never just return some constant value. If you want a "no effort required" hash function, just use Objects.hash(field1, field2, field3).
As for your other suggestion about only relying on immutable fields, I feel that is misguided. Personally, I would be more confused / concerned if I used an object as a key in a HashMap, mutated it, and could still get the value out of the HashMap; I would wonder if, for example, something happened and it actually wasn't mutated.
Returning -1 is a solid fallback solution when your hashcode is corrupt AND the object's position in a hash tree relies on it. And your hashcode method is corrupt if it uses mutable fields. Try it on your own: put an object in a hashmap (as a key) or in a Hashset and mutate fields that are used in the hashcode method. The object becomes unreachable because the object now has a state that does not correspond to the bucket it was put in previously. BTW you did not explain why returning -1 should never be returned as I clearly stated that it is a fallback when having problems with it.
Exactly. Every Object you are likely to be familiar with will become unreachable; that is the expected behavior.
You shouldn't use a pathological hash function because there's simply no benefit. Mutating keys is a bad idea in general, so destroying the performance of hash-based containers just to allow yourself to possibly mutate them simply isn't a good trade-off. As for the fallback comment; it isn't hard to satisfy the hashCode() contract.

Collectives™ on Stack Overflow

Designing hashCode method Java

5 Answers 5

How would this value help in decreasing collisions?

What does he means by saying that the exact value returned by hashCode is a function of the instance value?

1 Comment

2 Comments

Comments

2 Comments

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

How would this value help in decreasing collisions?

What does he means by saying that the exact value returned by hashCode is a function of the instance value?

1 Comment

2 Comments

Comments

2 Comments

5 Comments

Related