Collecting Rocks and Benchmarks
As long as I can remember, I have been interested in rocks, I have hundreds of them in storage. Rocks are interesting because they hold little clues about processes nobody has ever seen happen. For instance, one of the first rocks I ever took an interest in was a smooth granite pebble, which I collected on a walk when I was six. Let’s be honest, as a six year old, it was the pattern on the surface of the rock that caught my eye, but this truly was a fascinating rock because it shouldn’t have been smooth and it shouldn’t have been in England. It probably came from Norway, and while it’s possible that a Norwegian brought the rock to England, it’s highly unlikely.
The best possible explanation of the rock’s existence in that point in space and time was that there was once a glacier covering Norway, the North Sea, and Northern England, moving in that direction. The glacier carried the pebble, and many others like it, and dumped it as glacial outwash in the English Midlands. Nobody alive saw this happen, but if this happened, Scotland (relieved of the weight of an ice sheet) should still be rising in altitude, and there should be clay in the Midlands but not in the South. It’s likely that there was an ice age, not just because I found a granite pebble, but because Scotland is rising, and the English Midlands are covered in clay. You may lack the tools, funds, or time to do so, but you can apply this process to virtually anything to figure out how something happened or works.
If you’re an application developer, as I am, it’s highly unlikely that you wrote the platform you use, so you probably don’t really understand it. The vast majority of the platform you use was written by other people over what might as well be geological timescales. Benchmarks are a lot like rocks in that they can reveal implementation details you aren’t otherwise party to, and, with the right trade offs, you may be able to harness these in your applications. Collecting rocks doesn’t make a geologist, and I think you need to be quite inquisitive for benchmarking to work out, because you need to seek corroborating evidence: just as the presence of a pebble is scant support for an ice age, a single benchmark score doesn’t say much about anything.
If you’re interested in the JVM, I think instruction profiling is essential because it gives so much away. For instance, you may not appreciate the significance of choice of garbage collector, but you’ll see lots of strange instructions in some benchmarks if you profile them, and you may have the curiosity to ask what they do and what put them there. If you don’t do it, you won’t really know the boundaries of validity of whatever your observation is. You could find that changing your garbage collector invalidates your findings.
Partly because I want the information on this site to be basically correct, but also to illustrate how conclusions can be jumped to because a benchmark score confirms a personal bias, I’ll look again at a couple of very superficial benchmarks I wrote about last year, to do with polynomial hash codes. The measurements were probably fine, but I didn’t really get to the bottom of the issue and I could easily have found a faster implementation if I had looked a bit harder.
Polynomial Hash Codes
The Java String.hashCode
and Arrays.hashCode
methods are implemented as the dot product of the data and a vector of powers of 31.
// Arrays.java
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
Since String
caches its hash code, you’d be hard pressed to make use of the following optimisations, but I found I learned something about what C2 does and doesn’t do by attempting to optimise it. First of all, what does the built in hash code actually do? You can only really find out with instruction profiling, which you can enable in JMH with -prof perfasm
on Linux or -prof xperfasm
on Windows.
The bulk of the sampled instructions are in this block below:
0x000002720f204af2: add edi,dword ptr [rax+rbx*4+10h]
2.37% 0x000002720f204af6: mov r10d,edi
1.72% 0x000002720f204af9: shl r10d,5h
2.06% 0x000002720f204afd: sub r10d,edi
3.77% 0x000002720f204b00: add r10d,dword ptr [rax+rbx*4+14h]
3.93% 0x000002720f204b05: mov r8d,r10d
0.01% 0x000002720f204b08: shl r8d,5h
3.80% 0x000002720f204b0c: sub r8d,r10d
3.98% 0x000002720f204b0f: add r8d,dword ptr [rax+rbx*4+18h]
4.12% 0x000002720f204b14: mov r10d,r8d
0x000002720f204b17: shl r10d,5h
3.78% 0x000002720f204b1b: sub r10d,r8d
3.75% 0x000002720f204b1e: add r10d,dword ptr [rax+rbx*4+1ch]
3.81% 0x000002720f204b23: mov r8d,r10d
0x000002720f204b26: shl r8d,5h
4.04% 0x000002720f204b2a: sub r8d,r10d
4.15% 0x000002720f204b2d: add r8d,dword ptr [rax+rbx*4+20h]
3.98% 0x000002720f204b32: mov r10d,r8d
0x000002720f204b35: shl r10d,5h
4.27% 0x000002720f204b39: sub r10d,r8d
3.95% 0x000002720f204b3c: add r10d,dword ptr [rax+rbx*4+24h]
3.77% 0x000002720f204b41: mov r8d,r10d
0x000002720f204b44: shl r8d,5h
4.01% 0x000002720f204b48: sub r8d,r10d
4.02% 0x000002720f204b4b: add r8d,dword ptr [rax+rbx*4+28h]
4.11% 0x000002720f204b50: mov ecx,r8d
0x000002720f204b53: shl ecx,5h
4.08% 0x000002720f204b56: sub ecx,r8d
4.31% 0x000002720f204b59: add ecx,dword ptr [rax+rbx*4+2ch]
The first thing to ask is where is the multiplication? There is no multiplication, it’s been replaced by a left shift and a subtraction.
public int StrengthReduction() {
int result = 1;
for (int i = 0; i < data.length; ++i) {
result = (result << 5) - result + data[i];
}
return result;
}
This is the compiler trying to be helpful because shifts and subtractions are cheaper than multiplications, and for 31, this transformation is possible. The snippet is one long chain of instructions: notice the register dependencies in the assembly snippet:
4.15% 0x000002720f204b2d: add r8d,dword ptr [rax+rbx*4+20h]
3.98% 0x000002720f204b32: mov r10d,r8d
0x000002720f204b35: shl r10d,5h
4.27% 0x000002720f204b39: sub r10d,r8d
The addition must complete before the contents of r8d
are available for the move, the left shift waits for the move, and the subtraction waits for the shift. No two elements of the array are ever processed simultaneously. First suggested by Peter Levart, I came across it on Daniel Lemire’s blog, the dependency can be broken by manually unrolling the loop:
@Benchmark
public int Unrolled() {
if (data == null)
return 0;
int result = 1;
int i = 0;
for (; i + 7 < data.length; i += 8) {
result = 31 * 31 * 31 * 31 * 31 * 31 * 31 * 31 * result
+ 31 * 31 * 31 * 31 * 31 * 31 * 31 * data[i]
+ 31 * 31 * 31 * 31 * 31 * 31 * data[i + 1]
+ 31 * 31 * 31 * 31 * 31 * data[i + 2]
+ 31 * 31 * 31 * 31 * data[i + 3]
+ 31 * 31 * 31 * data[i + 4]
+ 31 * 31 * data[i + 5]
+ 31 * data[i + 6]
+ data[i + 7]
;
}
for (; i < data.length; i++) {
result = 31 * result + data[i];
}
return result;
}
Weirdly, this implementation does very well (this isn’t new: there has been a ticket for it for several years). Without even bothering with a throughput score, the profiling output shows that this must be much better. The assembly is quite hard to read because it’s full of tricks I didn’t know existed, but look out for the hexadecimal constants and convince yourself that several are simply powers of 31. The multiplication by 31 is strength reduced to a left shift and a subtraction again.
0.26% 0x000001d67bdd3c8e: mov r8d,94446f01h
0.01% 0x000001d67bdd3c94: jmp 1d67bdd3cb1h
0x000001d67bdd3c96: nop word ptr [rax+rax+0h]
0x000001d67bdd3ca0: mov edi,r11d
0.03% 0x000001d67bdd3ca3: vmovq r14,xmm0
0.42% 0x000001d67bdd3ca8: mov ebp,dword ptr [rsp+70h]
7.14% 0x000001d67bdd3cac: vmovq rbx,xmm1
0.01% 0x000001d67bdd3cb1: cmp edi,r9d
0x000001d67bdd3cb4: jnb 1d67bdd3d6ch
0.04% 0x000001d67bdd3cba: imul r10d,dword ptr [rcx+rdi*4+10h],67e12cdfh ; Another strength reduction trick
7.74% 0x000001d67bdd3cc3: add r10d,r8d ; 1742810335 * x + 2487512833
2.69% 0x000001d67bdd3cc6: mov r11d,edi
0.09% 0x000001d67bdd3cc9: add r11d,7h
0.46% 0x000001d67bdd3ccd: cmp r11d,r9d
0x000001d67bdd3cd0: jnb 1d67bdd3dbah
6.82% 0x000001d67bdd3cd6: vmovq xmm1,rbx
0.61% 0x000001d67bdd3cdb: mov dword ptr [rsp+70h],ebp
0.06% 0x000001d67bdd3cdf: vmovq xmm0,r14
0.46% 0x000001d67bdd3ce4: mov r14,qword ptr [r15+70h]
6.60% 0x000001d67bdd3ce8: mov r11d,edi
0.67% 0x000001d67bdd3ceb: add r11d,8h
0.04% 0x000001d67bdd3cef: movsxd rax,edi
0.41% 0x000001d67bdd3cf2: add edi,0fh
6.87% 0x000001d67bdd3cf5: mov edx,dword ptr [rcx+rax*4+28h]
0.68% 0x000001d67bdd3cf9: imul r8d,dword ptr [rcx+rax*4+14h],34e63b41h ; multiply by 887503681
0.67% 0x000001d67bdd3d02: add r8d,r10d ; --------------------------
7.30% 0x000001d67bdd3d05: mov r10d,edx ; Multiply by 31
0.63% 0x000001d67bdd3d08: shl r10d,5h ;
0.08% 0x000001d67bdd3d0c: sub r10d,edx ; --------------------------
0.73% 0x000001d67bdd3d0f: imul edx,dword ptr [rcx+rax*4+24h],3c1h ; multiply by 961
7.47% 0x000001d67bdd3d17: imul ebp,dword ptr [rcx+rax*4+20h],745fh ; multiply by 29791
0.56% 0x000001d67bdd3d1f: imul esi,dword ptr [rcx+rax*4+1ch],0e1781h ; multiply by 923521
7.02% 0x000001d67bdd3d27: imul ebx,dword ptr [rcx+rax*4+18h],1b4d89fh ; multiply by 28629151
0.57% 0x000001d67bdd3d2f: add r8d,ebx
6.99% 0x000001d67bdd3d32: add r8d,esi
0.66% 0x000001d67bdd3d35: add r8d,ebp
0.14% 0x000001d67bdd3d38: add r8d,edx
0.91% 0x000001d67bdd3d3b: add r8d,r10d
7.04% 0x000001d67bdd3d3e: add r8d,dword ptr [rcx+rax*4+2ch] ; add the data value at offset 7
1.73% 0x000001d67bdd3d43: test dword ptr [r14],eax
0.06% 0x000001d67bdd3d46: cmp edi,r9d
0x000001d67bdd3d49: jnl 1d67bdd3c15h
0.45% 0x000001d67bdd3d4f: imul r8d,r8d,94446f01h ; multiply by 2487512833 (coprime to 31, follow r8d backwards)
11.90% 0x000001d67bdd3d56: cmp r11d,r9d
It’s probably not worth deciphering all the tricks in the code above, but notice that there is a lot of parallelism: the chain of signed multiplications target different registers and are independent. This code is much faster.
I wrote the code below in July last year to try to parallelise this calculation. At the expense of precomputing the powers of 31 up to some fixed length, such as the maximum length of strings in your database, it’s quite fast.
private final int[] coefficients;
public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[0] = 1;
for (int i = 1; i <= maxLength; ++i) {
coefficients[i] = 31 * coefficients[i - 1];
}
}
public int hashCode(int[] value) {
final int max = value.length;
int result = coefficients[max];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[max - i - 1] * value[i];
}
return result;
}
I was non-commital in the original post but I sort-of claimed this code was vectorised without even bothering to look at the disassembly. It’s scalar, but it’s much more parallel than the unrolled version, and all the clever strength reductions and dependencies are gone.
0.15% 0x0000022d8e6825e0: movsxd rbx,ecx
0.07% 0x0000022d8e6825e3: mov rdx,rsi
3.57% 0x0000022d8e6825e6: sub rdx,rbx
0.08% 0x0000022d8e6825e9: mov r10d,dword ptr [r9+rbx*4+10h]
0.18% 0x0000022d8e6825ee: imul r10d,dword ptr [rdi+rdx*4+0ch]
0.15% 0x0000022d8e6825f4: add r10d,r8d
4.25% 0x0000022d8e6825f7: mov r11d,dword ptr [r9+rbx*4+14h]
0.14% 0x0000022d8e6825fc: imul r11d,dword ptr [rdi+rdx*4+8h]
0.19% 0x0000022d8e682602: add r11d,r10d
1.31% 0x0000022d8e682605: mov r10d,dword ptr [r9+rbx*4+18h]
3.93% 0x0000022d8e68260a: imul r10d,dword ptr [rdi+rdx*4+4h]
0.22% 0x0000022d8e682610: add r10d,r11d
0.94% 0x0000022d8e682613: mov r8d,dword ptr [r9+rbx*4+1ch]
0.05% 0x0000022d8e682618: imul r8d,dword ptr [rdi+rdx*4]
3.68% 0x0000022d8e68261d: add r8d,r10d
1.02% 0x0000022d8e682620: mov r10d,dword ptr [r9+rbx*4+20h]
0.19% 0x0000022d8e682625: imul r10d,dword ptr [rdi+rdx*4+0fffffffffffffffch]
0.61% 0x0000022d8e68262b: add r10d,r8d
4.71% 0x0000022d8e68262e: mov r11d,dword ptr [r9+rbx*4+24h]
0.08% 0x0000022d8e682633: imul r11d,dword ptr [rdi+rdx*4+0fffffffffffffff8h]
0.82% 0x0000022d8e682639: add r11d,r10d
5.57% 0x0000022d8e68263c: mov r10d,dword ptr [r9+rbx*4+28h]
0.04% 0x0000022d8e682641: imul r10d,dword ptr [rdi+rdx*4+0fffffffffffffff4h]
0.68% 0x0000022d8e682647: add r10d,r11d
4.67% 0x0000022d8e68264a: mov r11d,dword ptr [r9+rbx*4+2ch]
0.08% 0x0000022d8e68264f: imul r11d,dword ptr [rdi+rdx*4+0fffffffffffffff0h]
0.45% 0x0000022d8e682655: add r11d,r10d
4.50% 0x0000022d8e682658: mov r10d,dword ptr [r9+rbx*4+30h]
0.20% 0x0000022d8e68265d: imul r10d,dword ptr [rdi+rdx*4+0ffffffffffffffech]
0.37% 0x0000022d8e682663: add r10d,r11d
3.82% 0x0000022d8e682666: mov r8d,dword ptr [r9+rbx*4+34h]
0.05% 0x0000022d8e68266b: imul r8d,dword ptr [rdi+rdx*4+0ffffffffffffffe8h]
0.42% 0x0000022d8e682671: add r8d,r10d
4.18% 0x0000022d8e682674: mov r10d,dword ptr [r9+rbx*4+38h]
0.02% 0x0000022d8e682679: imul r10d,dword ptr [rdi+rdx*4+0ffffffffffffffe4h]
0.25% 0x0000022d8e68267f: add r10d,r8d
5.11% 0x0000022d8e682682: mov r11d,dword ptr [r9+rbx*4+3ch]
0.03% 0x0000022d8e682687: imul r11d,dword ptr [rdi+rdx*4+0ffffffffffffffe0h]
0.28% 0x0000022d8e68268d: add r11d,r10d
4.88% 0x0000022d8e682690: mov r10d,dword ptr [r9+rbx*4+40h]
0.09% 0x0000022d8e682695: imul r10d,dword ptr [rdi+rdx*4+0ffffffffffffffdch]
0.21% 0x0000022d8e68269b: add r10d,r11d
4.56% 0x0000022d8e68269e: mov r8d,dword ptr [r9+rbx*4+44h]
0.02% 0x0000022d8e6826a3: imul r8d,dword ptr [rdi+rdx*4+0ffffffffffffffd8h]
0.18% 0x0000022d8e6826a9: add r8d,r10d
4.73% 0x0000022d8e6826ac: mov r10d,dword ptr [r9+rbx*4+48h]
0.06% 0x0000022d8e6826b1: imul r10d,dword ptr [rdi+rdx*4+0ffffffffffffffd4h]
0.10% 0x0000022d8e6826b7: add r10d,r8d
4.12% 0x0000022d8e6826ba: mov r8d,dword ptr [r9+rbx*4+4ch]
That blog post really was lazy. There’s a bit of a problem with the access pattern because the coefficients are accessed in reverse order, and at an offset: it’s too complicated for the optimiser. The code below is just a dot product and it should come as no surprise that it’s faster.
private int[] coefficients;
private int seed;
void init(int size) {
coefficients = new int[size];
coefficients[size - 1] = 1;
for (int i = size - 2; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
seed = 31 * coefficients[0];
}
@Benchmark
public int Vectorised() {
int result = seed;
for (int i = 0; i < data.length && i < coefficients.length; ++i) {
result += coefficients[i] * data[i];
}
return result;
}
The code vectorises, but the reduction isn’t as good as it could be with handwritten code.
0.22% 0x000001d9c2e0f320: vmovdqu ymm0,ymmword ptr [rdi+rsi*4+70h]
2.31% 0x000001d9c2e0f326: vpmulld ymm0,ymm0,ymmword ptr [r11+rsi*4+70h]
0.61% 0x000001d9c2e0f32d: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+50h]
2.61% 0x000001d9c2e0f333: vpmulld ymm9,ymm1,ymmword ptr [r11+rsi*4+50h]
0.53% 0x000001d9c2e0f33a: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+30h]
2.07% 0x000001d9c2e0f340: vpmulld ymm10,ymm1,ymmword ptr [r11+rsi*4+30h]
0.60% 0x000001d9c2e0f347: vmovdqu ymm1,ymmword ptr [rdi+rsi*4+10h]
2.33% 0x000001d9c2e0f34d: vpmulld ymm11,ymm1,ymmword ptr [r11+rsi*4+10h]
0.61% 0x000001d9c2e0f354: vphaddd ymm7,ymm11,ymm11
3.04% 0x000001d9c2e0f359: vphaddd ymm7,ymm7,ymm8
3.56% 0x000001d9c2e0f35e: vextracti128 xmm8,ymm7,1h
0.53% 0x000001d9c2e0f364: vpaddd xmm7,xmm7,xmm8
1.56% 0x000001d9c2e0f369: vmovd xmm8,r8d
1.77% 0x000001d9c2e0f36e: vpaddd xmm8,xmm8,xmm7
0.93% 0x000001d9c2e0f372: vmovd edx,xmm8
0.27% 0x000001d9c2e0f376: vphaddd ymm2,ymm10,ymm10
2.75% 0x000001d9c2e0f37b: vphaddd ymm2,ymm2,ymm6
2.32% 0x000001d9c2e0f380: vextracti128 xmm6,ymm2,1h
1.95% 0x000001d9c2e0f386: vpaddd xmm2,xmm2,xmm6
0.63% 0x000001d9c2e0f38a: vmovd xmm6,edx
0.50% 0x000001d9c2e0f38e: vpaddd xmm6,xmm6,xmm2
7.76% 0x000001d9c2e0f392: vmovd edx,xmm6
0.22% 0x000001d9c2e0f396: vphaddd ymm5,ymm9,ymm9
2.68% 0x000001d9c2e0f39b: vphaddd ymm5,ymm5,ymm1
0.34% 0x000001d9c2e0f3a0: vextracti128 xmm1,ymm5,1h
6.27% 0x000001d9c2e0f3a6: vpaddd xmm5,xmm5,xmm1
0.88% 0x000001d9c2e0f3aa: vmovd xmm1,edx
0.92% 0x000001d9c2e0f3ae: vpaddd xmm1,xmm1,xmm5
7.85% 0x000001d9c2e0f3b2: vmovd edx,xmm1
0.43% 0x000001d9c2e0f3b6: vphaddd ymm4,ymm0,ymm0
2.59% 0x000001d9c2e0f3bb: vphaddd ymm4,ymm4,ymm3
0.34% 0x000001d9c2e0f3c0: vextracti128 xmm3,ymm4,1h
5.72% 0x000001d9c2e0f3c6: vpaddd xmm4,xmm4,xmm3
0.80% 0x000001d9c2e0f3ca: vmovd xmm3,edx
0.58% 0x000001d9c2e0f3ce: vpaddd xmm3,xmm3,xmm4
8.09% 0x000001d9c2e0f3d2: vmovd r8d,xmm3
Using JDK9, this results in a 3x throughput gain over the built in Arrays.hashCode
, and that includes the cost of doubling the number of bytes to process and a suboptimal reduction phase. This is going to be a prime candidate for the Vector API, where a vector of powers of 31 could be multiplied by 31^8
on each iteration, before multiplying by a vector of the next 8 data elements.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: size |
---|---|---|---|---|---|---|---|
BuiltIn | thrpt | 1 | 10 | 3.439980 | 0.024231 | ops/us | 256 |
BuiltIn | thrpt | 1 | 10 | 0.842664 | 0.009019 | ops/us | 1024 |
BuiltIn | thrpt | 1 | 10 | 0.103638 | 0.003070 | ops/us | 8192 |
FixedLength | thrpt | 1 | 10 | 7.421648 | 0.169558 | ops/us | 256 |
FixedLength | thrpt | 1 | 10 | 1.966116 | 0.044398 | ops/us | 1024 |
FixedLength | thrpt | 1 | 10 | 0.249666 | 0.009994 | ops/us | 8192 |
StrengthReduction | thrpt | 1 | 10 | 3.417248 | 0.045733 | ops/us | 256 |
StrengthReduction | thrpt | 1 | 10 | 0.840830 | 0.011091 | ops/us | 1024 |
StrengthReduction | thrpt | 1 | 10 | 0.104265 | 0.001537 | ops/us | 8192 |
Unrolled | thrpt | 1 | 10 | 6.353271 | 0.042330 | ops/us | 256 |
Unrolled | thrpt | 1 | 10 | 1.672845 | 0.035389 | ops/us | 1024 |
Unrolled | thrpt | 1 | 10 | 0.205575 | 0.009375 | ops/us | 8192 |
Vectorised | thrpt | 1 | 10 | 10.958270 | 0.109993 | ops/us | 256 |
Vectorised | thrpt | 1 | 10 | 2.948918 | 0.081830 | ops/us | 1024 |
Vectorised | thrpt | 1 | 10 | 0.358819 | 0.027316 | ops/us | 8192 |