TSM - An introduction to optimising a hashing strategy

Peter Lawrey - CEO @ Higher Frequency Trading Ltd

The strategy that's used for hashing keys, can have a direct impact on the performance of a hashed collections such as a HashMap or HashSet. The built-in hashing functions are designed to be generic and work well in a wide range of use cases. Can we do better, especially if you have a good idea of the use case?

Testing a hashing strategy

In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimised for "Orthogonal Bits" which looked at making sure each hash result was as different as possible based on just one bit changing.

However, if you have a known set of elements/keys to hash you can optimise for that specific use case, rather trying to find a generic solution.

Minimising Collisions

One of the main things you want to avoid in a hashed collection is collisions. This is when two or more keys map to the same bucket. These collisions mean you have to do more work to check the key is the one you expected as there is now multiple keys in the same bucket. Ideally there is at most 1 key in each bucket.

I just need unique hash codes don't I?

A common misconception is that to avoid collisions all you need to have unique hash code. While unique hash codes is highly desirable, it is not enough.

Say you have a set of keys and all of them have unique 32-bit hash codes. If you then have an array of 4 billion buckets, each key will have its own bucket, and there is no collisions. It is generally undesirable to have such large arrays for all hash collections. In fact HashMap and HashSet are limited by the largest power of 2 size you can have for an array which is 2^30 or just over one billion.

What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes are modulo-ed to the number of buckets. If the number of buckets is a power of two, you can use a mask of the lowest bits.

Lets look at an example, ftse350.csv If we take the first column as a key or element, we get 352 strings. These strings have unique String.hashCode()s, but say we take the lower bits of these hash code. Do we see collisions?

The size of the HashMap for a load factor of 0.7 (the default) is 512 which uses a mask of the lower 9 bits. As you can see around 30% of keys have a collision even though we started with unique hash codes.

To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

This mixes the high bits of the hash code with the low bits, to improve the randomness of the lower bits. For the case above where there is a high collision rate, there is an improvement. See the third column.

A look at the hash function for String

Note: the implementation for String is defined in the Javadoc so there is little chance we can change it but we could define a new hashing strategy.

Components of a hashing strategy.

While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case. This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.

Lets try some different multiplying factors instead of 31.

You can see that the choice of a magic number matters, but also there is lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain.

As you can see the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.

If you use the agitation function as well you can get around 68 collisions. This is getting close to the same collision rate as doubling the size of the array. i.e. an improved collision rate without using more memory.

But what happens when we add new keys to the hash collection, will our magic number still be good for us? This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs. The worst case for hash() is 250 collisions, That is 70% of keys colliding which is pretty bad. The agitation function improves this a little however it's still not great. Note: if we add the shifted value instead of xor-ing it we get a worse result in this case.

However if we do two shifts, to mix not just the top and bottom bits, but bits from four different portions of the hash code generated, we find that the worst case collision rate is much lower. This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.

What if we have add instead of xor in the hash function?

In the agitation function using xor was perhaps better than using add. What happens if we change this

The best case numbers are slightly better, however the worst case collision rate are notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more. This would seem a risky choice as you have to consider that the keys may change over time.

Why do we chose odd multipliers?

When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 1 = 0 and 1 1 = 1. However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?

If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.

How do more advanced hashing strategies differ?

For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own)

In summary

By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions but also have some confidence than should the key set change, we have reduced the impact this might have had.

For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a miss match between the key set and the magic number can still have a high collision rate.

Conclusion

In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.

You also need tests which indicate how bad things are likely to get if the key set changes without re-optimisation.

Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.