ABONAMENTE VIDEO REDACȚIA
RO
EN
NOU
Numărul 150
Numărul 149 Numărul 148 Numărul 147 Numărul 146 Numărul 145 Numărul 144 Numărul 143 Numărul 142 Numărul 141 Numărul 140 Numărul 139 Numărul 138 Numărul 137 Numărul 136 Numărul 135 Numărul 134 Numărul 133 Numărul 132 Numărul 131 Numărul 130 Numărul 129 Numărul 128 Numărul 127 Numărul 126 Numărul 125 Numărul 124 Numărul 123 Numărul 122 Numărul 121 Numărul 120 Numărul 119 Numărul 118 Numărul 117 Numărul 116 Numărul 115 Numărul 114 Numărul 113 Numărul 112 Numărul 111 Numărul 110 Numărul 109 Numărul 108 Numărul 107 Numărul 106 Numărul 105 Numărul 104 Numărul 103 Numărul 102 Numărul 101 Numărul 100 Numărul 99 Numărul 98 Numărul 97 Numărul 96 Numărul 95 Numărul 94 Numărul 93 Numărul 92 Numărul 91 Numărul 90 Numărul 89 Numărul 88 Numărul 87 Numărul 86 Numărul 85 Numărul 84 Numărul 83 Numărul 82 Numărul 81 Numărul 80 Numărul 79 Numărul 78 Numărul 77 Numărul 76 Numărul 75 Numărul 74 Numărul 73 Numărul 72 Numărul 71 Numărul 70 Numărul 69 Numărul 68 Numărul 67 Numărul 66 Numărul 65 Numărul 64 Numărul 63 Numărul 62 Numărul 61 Numărul 60 Numărul 59 Numărul 58 Numărul 57 Numărul 56 Numărul 55 Numărul 54 Numărul 53 Numărul 52 Numărul 51 Numărul 50 Numărul 49 Numărul 48 Numărul 47 Numărul 46 Numărul 45 Numărul 44 Numărul 43 Numărul 42 Numărul 41 Numărul 40 Numărul 39 Numărul 38 Numărul 37 Numărul 36 Numărul 35 Numărul 34 Numărul 33 Numărul 32 Numărul 31 Numărul 30 Numărul 29 Numărul 28 Numărul 27 Numărul 26 Numărul 25 Numărul 24 Numărul 23 Numărul 22 Numărul 21 Numărul 20 Numărul 19 Numărul 18 Numărul 17 Numărul 16 Numărul 15 Numărul 14 Numărul 13 Numărul 12 Numărul 11 Numărul 10 Numărul 9 Numărul 8 Numărul 7 Numărul 6 Numărul 5 Numărul 4 Numărul 3 Numărul 2 Numărul 1
×
▼ LISTĂ EDIȚII ▼
Numărul 40
Abonament PDF

An introduction to optimising a hashing strategy

Peter Lawrey
CEO @ Higher Frequency Trading Ltd



PROGRAMARE

The strategy that's used for hashing keys, can have a direct impact on the performance of a hashed collections such as a HashMap or HashSet. The built-in hashing functions are designed to be generic and work well in a wide range of use cases.  Can we do better, especially if you have a good idea of the use case?

Testing a hashing strategy

In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimised for "Orthogonal Bits" which looked at making sure each hash result was as different as possible based on just one bit changing.

However, if you have a known set of elements/keys to hash you can optimise for that specific use case, rather trying to find a generic solution.

Minimising Collisions

One of the main things you want to avoid in a hashed collection is collisions.  This is when two or more keys map to the same bucket.  These collisions mean you have to do more work to check the key is the one you expected as there is now multiple keys in the same bucket.  Ideally there is at most 1 key in each bucket.

I just need unique hash codes don't I?

A common misconception is that to avoid collisions all you need to have unique hash code.  While unique hash codes is highly desirable, it is not enough.

Say you have a set of keys and all of them have unique 32-bit hash codes.  If you then have an array of 4 billion buckets, each key will have its own bucket, and there is no collisions.  It is generally undesirable to have such large arrays for all hash collections. In fact HashMap and HashSet are limited by the largest power of 2 size you can have for an array which is 2^30 or just over one billion.

What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes are modulo-ed to the number of buckets. If the number of buckets is a power of two, you can use a mask of the lowest bits.

Lets look at an example, ftse350.csv If we take the first column as a key or element, we get 352 strings.  These strings have unique String.hashCode()s, but say we take the lower bits of these hash code. Do we see collisions?

The size of the HashMap for a load factor of 0.7 (the default) is 512 which uses a mask of the lower 9 bits.  As you can see around 30% of keys have a collision even though we started with unique hash codes.

The code for HashTesterMain is here.

To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

From the source for HashMap.hash You can read the Javadoc for more details

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This mixes the high bits of the hash code with the low bits, to improve the randomness of the lower bits.  For the case above where there is a high collision rate, there is an improvement. See the third column.

A look at the hash function for String

The code for String.hashCode()

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;
        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

Note: the implementation for String is defined in the Javadoc so there is little chance we can change it but we could define a new hashing strategy.

Components of a hashing strategy.

There is two parts I look at in a hashing strategy.

While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case.  This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.

Lets try some different multiplying factors instead of 31.

You can see that the choice of a magic number matters, but also there is lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain.

The key code to look at is

public static int hash(String s, int multiplier) {
    int h = 0;
    for (int i = 0; i < s.length(); i++) {
        h = multiplier * h + s.charAt(i);
    }
    return h;
}
private static int xorShift16(int hash) {
    return hash ^ (hash >> 16);
}
private static int addShift16(int hash) {
    return hash + (hash >> 16);
}
private static int xorShift16n9(int hash) {
    hash ^= (hash >>> 16);
    hash ^= (hash >>> 9);
    return hash;
}

As you can see the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.

If you use the agitation function as well you can get around 68 collisions.  This is getting close to the same collision rate as doubling the size of the array. i.e. an improved collision rate without using more memory.

But what happens when we add new keys to the hash collection, will our magic number still be good for us?  This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs.  The worst case for hash() is 250 collisions, That is 70% of keys colliding which is pretty bad.  The agitation function improves this a little however it's still not great.  Note: if we add the shifted value instead of xor-ing it we get a worse result in this case.

However if we do two shifts, to mix not just the top and bottom bits, but bits from four different portions of the hash code generated, we find that the worst case collision rate is much lower.  This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.

What if we have add instead of xor in the hash function?

In the agitation function using xor was perhaps better than using add. What happens if we change this

h = multiplier * h + s.charAt(i);

with

h = multiplier * h ^ s.charAt(i);

The best case numbers are slightly better, however the worst case collision rate are notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more.  This would seem a risky choice as you have to consider that the keys may change over time.

Why do we chose odd multipliers?

When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 1 = 0 and 1 1 = 1.  However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?

If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.

How do more advanced hashing strategies differ?

For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own)

The agitation function is more complex.

In summary

By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions but also have some confidence than should the key set change, we have reduced the impact this might have had.

This is without using more memory, or even much more processing power.

We still have the option of utilising more memory.

For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a miss match between the key set and the magic number can still have a high collision rate.

Conclusion

In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.  

You also need tests which indicate how bad things are likely to get if the key set changes without re-optimisation. 

Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.

NUMĂRUL 149 - Development with AI

Sponsori

  • Accenture
  • BT Code Crafters
  • Accesa
  • Bosch
  • Betfair
  • MHP
  • BoatyardX
  • .msg systems
  • P3 group
  • Ing Hubs
  • Cognizant Softvision
  • Colors in projects