What’s In A Hash Collision?

When I have used Custom Types in Map Keys and Sets, I generally follow a similar pattern listed in the Apex Developer Guide:

public Integer hashCode() {
    return (31 * x) ^ y;

Someone asked recently and I honestly don’t know: Why Multiply by 31? My guess is the primary intent is to reduce the risk of hash collision. Does it matter that it’s prime? Are larger or smaller numbers worse for some reason?

The involvement of this seemingly magic number is especially interesting because sometimes we actually want all the keys to return the same hash code, for example when comparing date ranges.

Consider this example (Execute Anonymous script) which seems to operate the same whether the hashes are distinct or all the same:

class Key
    final String value1, value2;
    Key(String value1, String value2)
        this.value1 = value1;
        this.value2 = value2;
    public Boolean equals(Object instance)
        Key that = (Key)instance;
        return this.value1 == that.value1 &&
            this.value2 == that.value2;
    public Integer hashCode() { return 1; }
    // as opposed to (31 * value1.hashCode()) ^ value2.hashCode()

Set<Key> keys = new Set<Key>
    new Key('a', 'a'), new Key('a', 'a'),
    new Key('b', 'b'), new Key('b', 'b'),
    new Key('c', 'c'), new Key('c', 'c'),
    new Key('d', 'd'), new Key('d', 'd')
system.assertEquals(4, keys.size());

What is the benefit in returning distinct hash codes, when the behavior appears to be the same? Is there a performance gain?


The difference is how many times equals will be called (and thus, your performance). As an example, let’s take a look at the following class:

public class KeyTrial {
    public static Integer eCounter = 0, hCounter = 0;
    Integer value;
    public KeyTrial(Integer val) {
        value = val;
    public Boolean equals(Object o) {
        return ((KeyTrial)o).value == value;
    public Integer hashCode() {
        return value;

Now, we’ll execute the following code:

Set<KeyTrial> trials = new Set<KeyTrial>();
Long startTime = DateTime.now().getTime();
for(Integer i = 0; i < 1000; i++) {
    trials.add(new KeyTrial((Math.random()*10000000).intValue()));
Long endTime = DateTime.now().getTime();
System.debug(LoggingLevel.error, 'equals called: '+KeyTrial.eCounter);
System.debug(LoggingLevel.error, 'hashCode called: '+KeyTrial.hCounter);
System.debug(LoggingLevel.error, 'time elapsed: '+(endTime-startTime));

The eCounter (number of equals methods called) will vary randomly, only when there’s a hash collision. In a sample run, I got the following output:

14:24:09.1 (129448217)|USER_DEBUG|[7]|ERROR|equals called: 0
14:24:09.1 (129469186)|USER_DEBUG|[8]|ERROR|hashCode called: 1000
14:24:09.1 (129491729)|USER_DEBUG|[9]|ERROR|time elapsed: 117

As you can see, there were no collisions, and it ran 1000 items in 117 ms.

In contrast, when we change the return of hashCode to 1, we get the following output:

14:23:16.11 (4860282035)|USER_DEBUG|[7]|ERROR|equals called: 508207
14:23:16.11 (4860302147)|USER_DEBUG|[8]|ERROR|hashCode called: 1000
14:23:16.11 (4860324908)|USER_DEBUG|[9]|ERROR|time elapsed: 4856

The time difference is staggering. It took 41 times more CPU time, because equals had to be called 508207 times to verify if there was a match. The exact performance will vary, but expect it to be absolutely terrible.

This is because Set collections are b-trees. In other words, they’re sorted by hash code first, then compared by equals if more than 1 item exists for that hash code to verify if it is a duplicate.

B-trees are quick because they can quickly locate an entry even in thousands of items. If you’ve ever played “guess my number”, where someone thinks of a value between 1 and 100, and they tell you higher or lower for each guess, you know that you should binary search, for example, guessing 50, 75, 63, 69, 66, 65 instead of counting from 1 to 100 until you hit it. In fact, if you do a binary search, you are guaranteed to hit the correct value within 7 guesses.

Similarly, a binary search only needs to make 1 comparison for each bit in the size of the item list. For example, if there are 1000 items, it requires no more than 10 decisions to find the correct hash code.

You can read more about them by searching your favorite engine, or looking at how the hashmap works in Java.

There’s a variety of ways you can generate a unique hash code; you should select an algorithm that suits your purposes. Simple multiplication is a bad choice, because x times y is the same as y times x, so you’d have far more hash collisions. Hash collisions means that equals needs to be called, which is bad. Ideally, you want to spread your values across a wide spectrum to avoid collisions. However, using something like (31 * x) ^ y will not yield the same value as (31 * y) ^ x. You’re using a non-associative property, so there will be fewer hash collisions.

As for why 31 was selected, I found this answer, which says it better than I ever could:

Because you want the number you are multiplying by and the number of
buckets you are inserting into to have orthogonal prime

Suppose there are 8 buckets to insert into. If the number you are
using to multiply by is some multiple of 8, then the bucket inserted
into will only be determined by the least significant entry (the one
not multiplied at all). Similar entries will collide. Not good for a
hash function.

31 is a large enough prime that the number of buckets is unlikely to
be divisible by it (and in fact, modern java HashMap implementations
keep the number of buckets to a power of 2).

Source : Link , Question Author : Adrian Larson , Answer Author : Community

Leave a Comment