Vectorised Polynomial Hash Codes
To provide support for the idea of pluggable hashing strategies, Peter Lawrey demonstrates that there are better and faster hash codes than the JDK implementation of String.hashCode
or Arrays.hashCode
. I really like the analysis of output distribution so recommend reading the post. However, I’m not absolutely sure if pluggable hashing strategies would be a good idea. They can induce coupling between the strategy implementation and the contents of the hashed data structure, which have different life cycles and code ownership. If performance is what matters, why not just make the existing algorithm much faster?
Peter’s hash code uses Unsafe
to reinterpret each four bytes as an integer, but is otherwise just another polynomial hash code with a different coefficient. It produces a slightly different result, but with potentially better properties. Here’s that hash code.
private static final int M2 = 0x7A646E4D;
// read 4 bytes at a time from a byte[] assuming Java 9+ Compact Strings
private static int getIntFromArray(byte[] value, int i) {
return UnsafeMemory.INSTANCE.UNSAFE.getInt(value, BYTE_ARRAY_OFFSET + i);
}
public static int nativeHashCode(byte[] value) {
long h = getIntFromArray(value, 0);
for (int i = 4; i < value.length; i += 4)
h = h * M2 + getIntFromArray(value, i);
h *= M2;
return (int) h ^ (int) (h >>> 25);
}
Leaving the output distribution to one side, Peter reports that this hash code performs better than Arrays.hashCode(byte[])
and this is accurate. Where does the performance come from? The reintepretation reduces the number of multiplications by a factor of four, but you need Unsafe
to achieve this. This also obviates the need to convert each byte to an integer to avoid overflow. Another problem is solved just by changing the multiplier. Arrays.hashCode
is generally slow because the multiplication by 31 gets strength reduced to a left shift by five and a subtraction, which inadvertently creates a data dependency which can’t be unrolled. When the multiplier is 31, just unrolling the multiplication to disable the strength reduction can increase throughput by 2x, and the rather obscure choice of 0x7A646E4D
means that no such transformation takes place: this results in independent chains of multiplications and additions in the main loop:
0.18% 0.46% ││ 0x00007f3b21c05285: movslq 0x18(%rdx,%r8,1),%rsi
5.93% 6.28% ││ 0x00007f3b21c0528a: movslq 0x1c(%rdx,%r8,1),%rax
0.12% 0.42% ││ 0x00007f3b21c0528f: imul $0x7a646e4d,%rcx,%rcx
11.87% 37.31% ││ 0x00007f3b21c05296: movslq 0x14(%rdx,%r8,1),%rdi
0.10% 0.18% ││ 0x00007f3b21c0529b: movslq 0x10(%rdx,%r8,1),%r8
0.06% 0.58% ││ 0x00007f3b21c052a0: add %rcx,%r8
5.29% 1.30% ││ 0x00007f3b21c052a3: imul $0x7a646e4d,%r8,%r8
18.34% 21.94% ││ 0x00007f3b21c052aa: add %r8,%rdi
6.33% 1.96% ││ 0x00007f3b21c052ad: imul $0x7a646e4d,%rdi,%r8
17.60% 10.88% ││ 0x00007f3b21c052b4: add %r8,%rsi
5.39% 0.72% ││ 0x00007f3b21c052b7: imul $0x7a646e4d,%rsi,%rcx
17.80% 11.58% ││ 0x00007f3b21c052be: add %rax,%rcx
Is this as good as it can get and is there something fundamentally wrong with the JDK algorithm? The algorithm can be vectorised, but this is beyond C2’s autovectoriser. The same algorithm is used for Arrays.hashCode(int[])
, which doesn’t have the complication of type promotion from byte
to int
. I have noted before that this can be transformed to a loop which C2 can autovectorise by precomputing the coefficients of the polynomial (i.e. the powers of 31 until they repeat modulo 32, but x -> x * 31
has a very long period modulo 32) but this requires either an enormous array or a maximum length.
private int[] coefficients;
private int seed;
void init(int size) {
coefficients = new int[size];
coefficients[size - 1] = 1;
for (int i = size - 2; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
seed = 31 * coefficients[0];
}
public int hashCodeAutoVectorised() {
int result = seed;
for (int i = 0; i < data.length && i < coefficients.length; ++i) {
result += coefficients[i] * data[i];
}
return result;
}
This idea isn’t practical but demonstrates that this kind of polynomial can be computed efficiently, if only the coefficients could be generated without disabling autovectorisation. Generating the coefficients on the fly is possible with the Vector API. It requires a multiplier containing eight consecutive powers of 31, and the exponent of each element needs to be increased by eight in each iteration. This can be achieved with a broadcast variable.
public int polynomialHashCode() {
var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[33 - 9]);
var coefficients = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
var acc = YMM_INT.zero();
for (int i = data.length; i - YMM_INT.length() >= 0; i -= YMM_INT.length()) {
acc = acc.add(coefficients.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
coefficients = coefficients.mul(next);
}
return acc.addAll() + coefficients.get(7);
}
There’s a problem here - it sandwiches a low latency addition between two high latency multiplications, so there is a data dependency and unrolling without breaking the dependency isn’t necessarily helpful. The dependency can be broken manually by using four accumulators, four coefficient vectors consisting of 32 consecutive powers of 31, and each coefficient must have its logarithm increased by 32 in each iteration. It may look dirty but the dependencies are eradicated.
public int polynomialHashCodeUnrolled() {
var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[0]);
var coefficients1 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
var coefficients2 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 16);
var coefficients3 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 24);
var coefficients4 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 32);
var acc1 = YMM_INT.zero();
var acc2 = YMM_INT.zero();
var acc3 = YMM_INT.zero();
var acc4 = YMM_INT.zero();
for (int i = data.length; i - 4 * YMM_INT.length() >= 0; i -= YMM_INT.length() * 4) {
acc1 = acc1.add(coefficients1.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
acc2 = acc2.add(coefficients2.mul(YMM_INT.fromArray(data, i - 2 * YMM_INT.length())));
acc3 = acc3.add(coefficients3.mul(YMM_INT.fromArray(data, i - 3 * YMM_INT.length())));
acc4 = acc4.add(coefficients4.mul(YMM_INT.fromArray(data, i - 4 * YMM_INT.length())));
coefficients1 = coefficients1.mul(next);
coefficients2 = coefficients2.mul(next);
coefficients3 = coefficients3.mul(next);
coefficients4 = coefficients4.mul(next);
}
return acc1.add(acc2).add(acc3).add(acc4).addAll() + coefficients1.get(7);
}
The implementation above does not handle arbitrary length input, but the input could either be padded with zeroes (see this paper on padded SLP autovectorisation) or followed by a post loop. It produces the same output as Arrays.hashCode
, so how much faster is it?
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
arraysHashCode | 1024 | thrpt | 20 | 1095.089 | 3.980 | ops/ms |
arraysHashCode | 65536 | thrpt | 20 | 16.963 | 0.130 | ops/ms |
hashCodeAutoVectorised | 1024 | thrpt | 20 | 3716.853 | 18.025 | ops/ms |
hashCodeAutoVectorised | 65536 | thrpt | 20 | 57.265 | 0.907 | ops/ms |
polynomialHashCode | 1024 | thrpt | 20 | 2623.090 | 7.920 | ops/ms |
polynomialHashCode | 65536 | thrpt | 20 | 39.344 | 0.238 | ops/ms |
polynomialHashCodeUnrolled | 1024 | thrpt | 20 | 8266.119 | 34.480 | ops/ms |
polynomialHashCodeUnrolled | 65536 | thrpt | 20 | 131.196 | 6.234 | ops/ms |
So there really is absolutely nothing wrong with the algorithm from a performance perspective, but the implementation can be improved vastly (~8x). It seems that the tools required for JDK engineers and users alike to make optimisations like these are in the pipeline!
What about byte arrays? One obstacle to vectorisation is that to implement the JDK algorithm strictly, the bytes must accumulate into 32 bit values, which means that the lanes need to widen, so the contents of a single vector register of bytes would need to fan out to four vector registers of integers. This would be achievable by loading vectors at eight byte offsets and permuting the first eight bytes of each vector into every fourth position, but this is quite convoluted.
Peter demonstrates that reinterpreting four bytes as an integer doesn’t necessarily degrade the hash distribution, and may even improve it, so I used the same trick: rebracketing a ByteVector
to an IntVector
to produce the “wrong” result, but a reasonable hash code nonetheless. A nice feature of the Vector API is allowing this kind of reinterpretation without resorting to Unsafe
, via Vector.rebracket
.
public int hashCodeVectorAPINoDependencies() {
var next = YMM_INT.broadcast(POWERS_OF_31[8]);
var coefficients1 = YMM_INT.fromArray(POWERS_OF_31, 0);
var coefficients2 = coefficients1.mul(next);
var coefficients3 = coefficients2.mul(next);
var coefficients4 = coefficients3.mul(next);
next = next.mul(next);
next = next.mul(next);
var acc1 = YMM_INT.zero();
var acc2 = YMM_INT.zero();
var acc3 = YMM_INT.zero();
var acc4 = YMM_INT.zero();
for (int i = 0; i < data.length; i += YMM_BYTE.length() * 4) {
acc1 = acc1.add(coefficients1.mul(YMM_BYTE.fromArray(data, i).rebracket(YMM_INT)));
acc2 = acc2.add(coefficients2.mul(YMM_BYTE.fromArray(data, i + YMM_BYTE.length()).rebracket(YMM_INT)));
acc3 = acc3.add(coefficients3.mul(YMM_BYTE.fromArray(data, i + 2 * YMM_BYTE.length()).rebracket(YMM_INT)));
acc4 = acc4.add(coefficients4.mul(YMM_BYTE.fromArray(data, i + 3 * YMM_BYTE.length()).rebracket(YMM_INT)));
coefficients1 = coefficients1.mul(next);
coefficients2 = coefficients2.mul(next);
coefficients3 = coefficients3.mul(next);
coefficients4 = coefficients4.mul(next);
}
return acc1.add(acc2).add(acc3).add(acc4).addAll();
}
Owing to its use of better tools yet to be released, this version is many times faster than either the JDK implementation or Peter’s.
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
arraysHashCode | 128 | thrpt | 20 | 8897.392 | 220.582 | ops/ms |
arraysHashCode | 256 | thrpt | 20 | 4286.621 | 156.794 | ops/ms |
arraysHashCode | 512 | thrpt | 20 | 2024.858 | 72.030 | ops/ms |
arraysHashCode | 1024 | thrpt | 20 | 1002.173 | 39.917 | ops/ms |
hashCodeVectorAPINoDependencies | 128 | thrpt | 20 | 88395.374 | 3369.397 | ops/ms |
hashCodeVectorAPINoDependencies | 256 | thrpt | 20 | 64799.697 | 1035.175 | ops/ms |
hashCodeVectorAPINoDependencies | 512 | thrpt | 20 | 48248.967 | 864.728 | ops/ms |
hashCodeVectorAPINoDependencies | 1024 | thrpt | 20 | 27429.025 | 916.850 | ops/ms |
hashCodeVectorAPIDependencies | 128 | thrpt | 20 | 96679.811 | 316.470 | ops/ms |
hashCodeVectorAPIDependencies | 256 | thrpt | 20 | 52140.582 | 1825.173 | ops/ms |
hashCodeVectorAPIDependencies | 512 | thrpt | 20 | 26327.195 | 492.730 | ops/ms |
hashCodeVectorAPIDependencies | 1024 | thrpt | 20 | 10929.500 | 351.732 | ops/ms |
nativeHashCode | 128 | thrpt | 20 | 38799.185 | 188.636 | ops/ms |
nativeHashCode | 256 | thrpt | 20 | 17438.536 | 257.168 | ops/ms |
nativeHashCode | 512 | thrpt | 20 | 7041.815 | 209.817 | ops/ms |
nativeHashCode | 1024 | thrpt | 20 | 3217.187 | 96.379 | ops/ms |
Benchmark source code
Paul Sandoz discussed this topic at Oracle Code One 2018: Slides, Video.