Vectorised Polynomial Hash Codes

To provide support for the idea of pluggable hashing strategies, Peter Lawrey demonstrates that there are better and faster hash codes than the JDK implementation of String.hashCode or Arrays.hashCode. I really like the analysis of output distribution so recommend reading the post. However, I’m not absolutely sure if pluggable hashing strategies would be a good idea. They can induce coupling between the strategy implementation and the contents of the hashed data structure, which have different life cycles and code ownership. If performance is what matters, why not just make the existing algorithm much faster?

Peter’s hash code uses Unsafe to reinterpret each four bytes as an integer, but is otherwise just another polynomial hash code with a different coefficient. It produces a slightly different result, but with potentially better properties. Here’s that hash code.

private static final int M2 = 0x7A646E4D;

// read 4 bytes at a time from a byte[] assuming Java 9+ Compact Strings
private static int getIntFromArray(byte[] value, int i) {
    return UnsafeMemory.INSTANCE.UNSAFE.getInt(value, BYTE_ARRAY_OFFSET + i); 
}

public static int nativeHashCode(byte[] value) {
    long h = getIntFromArray(value, 0);
    for (int i = 4; i < value.length; i += 4)
        h = h * M2 + getIntFromArray(value, i);
    h *= M2;
    return (int) h ^ (int) (h >>> 25);
}

Leaving the output distribution to one side, Peter reports that this hash code performs better than Arrays.hashCode(byte[]) and this is accurate. Where does the performance come from? The reintepretation reduces the number of multiplications by a factor of four, but you need Unsafe to achieve this. This also obviates the need to convert each byte to an integer to avoid overflow. Another problem is solved just by changing the multiplier. Arrays.hashCode is generally slow because the multiplication by 31 gets strength reduced to a left shift by five and a subtraction, which inadvertently creates a data dependency which can’t be unrolled. When the multiplier is 31, just unrolling the multiplication to disable the strength reduction can increase throughput by 2x, and the rather obscure choice of 0x7A646E4D means that no such transformation takes place: this results in independent chains of multiplications and additions in the main loop:

  0.18%    0.46%     ││  0x00007f3b21c05285: movslq 0x18(%rdx,%r8,1),%rsi
  5.93%    6.28%     ││  0x00007f3b21c0528a: movslq 0x1c(%rdx,%r8,1),%rax
  0.12%    0.42%     ││  0x00007f3b21c0528f: imul   $0x7a646e4d,%rcx,%rcx
 11.87%   37.31%     ││  0x00007f3b21c05296: movslq 0x14(%rdx,%r8,1),%rdi
  0.10%    0.18%     ││  0x00007f3b21c0529b: movslq 0x10(%rdx,%r8,1),%r8
  0.06%    0.58%     ││  0x00007f3b21c052a0: add    %rcx,%r8
  5.29%    1.30%     ││  0x00007f3b21c052a3: imul   $0x7a646e4d,%r8,%r8
 18.34%   21.94%     ││  0x00007f3b21c052aa: add    %r8,%rdi
  6.33%    1.96%     ││  0x00007f3b21c052ad: imul   $0x7a646e4d,%rdi,%r8
 17.60%   10.88%     ││  0x00007f3b21c052b4: add    %r8,%rsi
  5.39%    0.72%     ││  0x00007f3b21c052b7: imul   $0x7a646e4d,%rsi,%rcx
 17.80%   11.58%     ││  0x00007f3b21c052be: add    %rax,%rcx

Is this as good as it can get and is there something fundamentally wrong with the JDK algorithm? The algorithm can be vectorised, but this is beyond C2’s autovectoriser. The same algorithm is used for Arrays.hashCode(int[]), which doesn’t have the complication of type promotion from byte to int. I have noted before that this can be transformed to a loop which C2 can autovectorise by precomputing the coefficients of the polynomial (i.e. the powers of 31 until they repeat modulo 32, but x -> x * 31 has a very long period modulo 32) but this requires either an enormous array or a maximum length.

    private int[] coefficients;
    private int seed;

    void init(int size) {
        coefficients = new int[size]; 
        coefficients[size - 1] = 1;
        for (int i = size - 2; i >= 0; --i) {
            coefficients[i] = 31 * coefficients[i + 1];
        }
        seed = 31 * coefficients[0];
    }

    public int hashCodeAutoVectorised() {
        int result = seed;
        for (int i = 0; i < data.length && i < coefficients.length; ++i) {
            result += coefficients[i] * data[i];
        }
        return result;
    }

This idea isn’t practical but demonstrates that this kind of polynomial can be computed efficiently, if only the coefficients could be generated without disabling autovectorisation. Generating the coefficients on the fly is possible with the Vector API. It requires a multiplier containing eight consecutive powers of 31, and the exponent of each element needs to be increased by eight in each iteration. This can be achieved with a broadcast variable.

  public int polynomialHashCode() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[33 - 9]);
    var coefficients = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var acc = YMM_INT.zero();
    for (int i = data.length; i - YMM_INT.length() >= 0; i -= YMM_INT.length()) {
      acc = acc.add(coefficients.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      coefficients = coefficients.mul(next);
    }
    return acc.addAll() + coefficients.get(7);
  }

There’s a problem here - it sandwiches a low latency addition between two high latency multiplications, so there is a data dependency and unrolling without breaking the dependency isn’t necessarily helpful. The dependency can be broken manually by using four accumulators, four coefficient vectors consisting of 32 consecutive powers of 31, and each coefficient must have its logarithm increased by 32 in each iteration. It may look dirty but the dependencies are eradicated.

  public int polynomialHashCodeUnrolled() {
    var next = YMM_INT.broadcast(POWERS_OF_31_BACKWARDS[0]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 8);
    var coefficients2 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 16);
    var coefficients3 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 24);
    var coefficients4 = YMM_INT.fromArray(POWERS_OF_31_BACKWARDS, 33 - 32);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = data.length; i - 4 * YMM_INT.length() >= 0; i -= YMM_INT.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_INT.fromArray(data, i - YMM_INT.length())));
      acc2 = acc2.add(coefficients2.mul(YMM_INT.fromArray(data, i - 2 * YMM_INT.length())));
      acc3 = acc3.add(coefficients3.mul(YMM_INT.fromArray(data, i - 3 * YMM_INT.length())));
      acc4 = acc4.add(coefficients4.mul(YMM_INT.fromArray(data, i - 4 * YMM_INT.length())));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll() + coefficients1.get(7);
  }

The implementation above does not handle arbitrary length input, but the input could either be padded with zeroes (see this paper on padded SLP autovectorisation) or followed by a post loop. It produces the same output as Arrays.hashCode, so how much faster is it?

Benchmark	(size)	Mode	Cnt	Score	Error	Units
arraysHashCode	1024	thrpt	20	1095.089	3.980	ops/ms
arraysHashCode	65536	thrpt	20	16.963	0.130	ops/ms
hashCodeAutoVectorised	1024	thrpt	20	3716.853	18.025	ops/ms
hashCodeAutoVectorised	65536	thrpt	20	57.265	0.907	ops/ms
polynomialHashCode	1024	thrpt	20	2623.090	7.920	ops/ms
polynomialHashCode	65536	thrpt	20	39.344	0.238	ops/ms
polynomialHashCodeUnrolled	1024	thrpt	20	8266.119	34.480	ops/ms
polynomialHashCodeUnrolled	65536	thrpt	20	131.196	6.234	ops/ms

So there really is absolutely nothing wrong with the algorithm from a performance perspective, but the implementation can be improved vastly (~8x). It seems that the tools required for JDK engineers and users alike to make optimisations like these are in the pipeline!

What about byte arrays? One obstacle to vectorisation is that to implement the JDK algorithm strictly, the bytes must accumulate into 32 bit values, which means that the lanes need to widen, so the contents of a single vector register of bytes would need to fan out to four vector registers of integers. This would be achievable by loading vectors at eight byte offsets and permuting the first eight bytes of each vector into every fourth position, but this is quite convoluted.

Peter demonstrates that reinterpreting four bytes as an integer doesn’t necessarily degrade the hash distribution, and may even improve it, so I used the same trick: rebracketing a ByteVector to an IntVector to produce the “wrong” result, but a reasonable hash code nonetheless. A nice feature of the Vector API is allowing this kind of reinterpretation without resorting to Unsafe, via Vector.rebracket.

  public int hashCodeVectorAPINoDependencies() {
    var next = YMM_INT.broadcast(POWERS_OF_31[8]);
    var coefficients1 = YMM_INT.fromArray(POWERS_OF_31, 0);
    var coefficients2 = coefficients1.mul(next);
    var coefficients3 = coefficients2.mul(next);
    var coefficients4 = coefficients3.mul(next);
    next = next.mul(next);
    next = next.mul(next);
    var acc1 = YMM_INT.zero();
    var acc2 = YMM_INT.zero();
    var acc3 = YMM_INT.zero();
    var acc4 = YMM_INT.zero();
    for (int i = 0; i < data.length; i += YMM_BYTE.length() * 4) {
      acc1 = acc1.add(coefficients1.mul(YMM_BYTE.fromArray(data, i).rebracket(YMM_INT)));
      acc2 = acc2.add(coefficients2.mul(YMM_BYTE.fromArray(data, i + YMM_BYTE.length()).rebracket(YMM_INT)));
      acc3 = acc3.add(coefficients3.mul(YMM_BYTE.fromArray(data, i + 2 * YMM_BYTE.length()).rebracket(YMM_INT)));
      acc4 = acc4.add(coefficients4.mul(YMM_BYTE.fromArray(data, i + 3 * YMM_BYTE.length()).rebracket(YMM_INT)));
      coefficients1 = coefficients1.mul(next);
      coefficients2 = coefficients2.mul(next);
      coefficients3 = coefficients3.mul(next);
      coefficients4 = coefficients4.mul(next);
    }
    return acc1.add(acc2).add(acc3).add(acc4).addAll();
  }

Owing to its use of better tools yet to be released, this version is many times faster than either the JDK implementation or Peter’s.

Benchmark	(size)	Mode	Cnt	Score	Error	Units
arraysHashCode	128	thrpt	20	8897.392	220.582	ops/ms
arraysHashCode	256	thrpt	20	4286.621	156.794	ops/ms
arraysHashCode	512	thrpt	20	2024.858	72.030	ops/ms
arraysHashCode	1024	thrpt	20	1002.173	39.917	ops/ms
hashCodeVectorAPINoDependencies	128	thrpt	20	88395.374	3369.397	ops/ms
hashCodeVectorAPINoDependencies	256	thrpt	20	64799.697	1035.175	ops/ms
hashCodeVectorAPINoDependencies	512	thrpt	20	48248.967	864.728	ops/ms
hashCodeVectorAPINoDependencies	1024	thrpt	20	27429.025	916.850	ops/ms
hashCodeVectorAPIDependencies	128	thrpt	20	96679.811	316.470	ops/ms
hashCodeVectorAPIDependencies	256	thrpt	20	52140.582	1825.173	ops/ms
hashCodeVectorAPIDependencies	512	thrpt	20	26327.195	492.730	ops/ms
hashCodeVectorAPIDependencies	1024	thrpt	20	10929.500	351.732	ops/ms
nativeHashCode	128	thrpt	20	38799.185	188.636	ops/ms
nativeHashCode	256	thrpt	20	17438.536	257.168	ops/ms
nativeHashCode	512	thrpt	20	7041.815	209.817	ops/ms
nativeHashCode	1024	thrpt	20	3217.187	96.379	ops/ms

Benchmark source code

Paul Sandoz discussed this topic at Oracle Code One 2018: Slides, Video.