Stages
The conventional wisdom is that the stages of a task should be pipelined, so you don’t need to wait for the completion of one stage before the next is started. It surprises me that it seems you can sometimes do better when performing each stage of a pipeline in a short batch. Useful optimisation opportunities can arise from this phenomenon, with only minor code changes. I recently applied this principle to implement fast batch iterators in RoaringBitmap.
I came across a discussion about shuffling arrays on Twitter, stimulated by a blog post by Daniel Lemire. Imagine you want to randomly shuffle the contents of an array. One approach to take would be to iterate over the array in reverse, at each index i
, generate a random index j
smaller than i
, and swap the elements at i
and j
. Here’s some benchmark code to measure how long this takes for an assortment of swapping strategies, including one where the swaps are just precomputed and looked up in an array.
@Benchmark
public void shuffle(Blackhole bh) {
for (int i = data.length; i > 1; i--)
swap(data, i - 1, op.applyAsInt(i));
bh.consume(data);
}
private static void swap(int[] arr, int i, int j) {
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
}
There is a large difference between the version where the random swap is precomputed and when the swap is computed on the fly with ThreadLocalRandom.nextInt
.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: mode | Param: size |
---|---|---|---|---|---|---|---|---|
shuffle | thrpt | 1 | 10 | 2198.459182 | 274.965189 | ops/s | THREAD_LOCAL_RANDOM | 65536 |
shuffle | thrpt | 1 | 10 | 1015.796005 | 16.225480 | ops/s | THREAD_LOCAL_RANDOM | 131072 |
shuffle | thrpt | 1 | 10 | 7300.732038 | 46.788234 | ops/s | PRECOMPUTED | 65536 |
shuffle | thrpt | 1 | 10 | 3828.021096 | 450.874537 | ops/s | PRECOMPUTED | 131072 |
The difference is large, but a lot more work is being done when the random indices are computed on the fly. A good measure of efficiency per unit of work is cycles per instruction (CPI). Running the benchmark with -prof perfnorm
shows that these benchmarks are at parity for cycles per instruction: if throughput is lower when the random numbers are generated on the fly, it’s because there are more instructions to execute.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: mode | Param: size |
---|---|---|---|---|---|---|---|---|
shuffle:CPI | thrpt | 1 | 1 | 0.427028 | NaN | #/op | THREAD_LOCAL_RANDOM | 65536 |
shuffle:CPI | thrpt | 1 | 1 | 0.447793 | NaN | #/op | THREAD_LOCAL_RANDOM | 131072 |
shuffle:CPI | thrpt | 1 | 1 | 0.477202 | NaN | #/op | PRECOMPUTED | 65536 |
shuffle:CPI | thrpt | 1 | 1 | 0.565153 | NaN | #/op | PRECOMPUTED | 131072 |
Nevertheless, instruction profiling with -prof perfasm
shows that the code is qualitatively different when computing the next swapped index is simple. When there is random number generation to do, most of the time is attributed either to mov
or just after mov
instructions (probably because of profiling skid) during the swap. For example, with the smaller array:
0.04% 0.00% │ ││ 0x00007fa009c0a8f9: xor 0x10(%rsi,%rdx,4),%r10d
15.31% 13.18% │ ││ 0x00007fa009c0a8fe: mov %r10d,0xc(%rsi,%rcx,4)
3.43% 3.05% │ ││ 0x00007fa009c0a903: xor 0x10(%rsi,%rdx,4),%r10d
5.37% 5.92% │ ││ 0x00007fa009c0a908: mov %r10d,0x10(%rsi,%rdx,4)
4.15% 4.22% │ ││ 0x00007fa009c0a90d: xor %r10d,0xc(%rsi,%rcx,4)
10.80% 8.80% │ ││ 0x00007fa009c0a912: cmp $0x1,%r9d ; probably skid
The key difference in the precomputed case is that the loop is unrolled with several isomorphic chains of instructions. None of the loads seem to be quite so expensive according to the sampled frequencies.
0.08% 0.16% │ 0x00007fda2dc0dfb2: cmp %r10d,%r9d
│ 0x00007fda2dc0dfb5: jae 0x00007fda2dc0e264
0.00% 0.00% │ 0x00007fda2dc0dfbb: xor 0x10(%rdx,%r9,4),%edi
2.90% 2.89% │ 0x00007fda2dc0dfc0: mov %edi,0xc(%rdx,%r11,4)
0.48% 0.33% │ 0x00007fda2dc0dfc5: xor 0x10(%rdx,%r9,4),%edi
0.45% 0.48% │ 0x00007fda2dc0dfca: mov %edi,0x10(%rdx,%r9,4)
0.56% 0.46% │ 0x00007fda2dc0dfcf: xor %edi,0xc(%rdx,%r11,4)
4.29% 3.88% │ 0x00007fda2dc0dfd4: mov 0x8(%rdx,%r11,4),%edi
0.03% 0.01% │ 0x00007fda2dc0dfd9: mov 0x8(%rsi,%r11,4),%r9d
1.38% 1.46% │ 0x00007fda2dc0dfde: mov %r11d,%ebx
0.02% 0.01% │ 0x00007fda2dc0dfe1: add $0xfffffffe,%ebx
0.63% 0.61% │ 0x00007fda2dc0dfe4: cmp %r10d,%r9d
│ 0x00007fda2dc0dfe7: jae 0x00007fda2dc0e26f
0.00% 0.01% │ 0x00007fda2dc0dfed: xor 0x10(%rdx,%r9,4),%edi
2.60% 2.38% │ 0x00007fda2dc0dff2: mov %edi,0x8(%rdx,%r11,4)
0.58% 0.51% │ 0x00007fda2dc0dff7: xor 0x10(%rdx,%r9,4),%edi
0.90% 0.96% │ 0x00007fda2dc0dffc: mov %edi,0x10(%rdx,%r9,4)
0.68% 0.66% │ 0x00007fda2dc0e001: xor %edi,0x8(%rdx,%r11,4)
4.85% 4.17% │ 0x00007fda2dc0e006: mov 0x4(%rdx,%r11,4),%edi
0.01% 0.02% │ 0x00007fda2dc0e00b: mov 0x4(%rsi,%r11,4),%r9d
1.12% 0.95% │ 0x00007fda2dc0e010: mov %r11d,%ecx
0.01% 0.00% │ 0x00007fda2dc0e013: add $0xfffffffd,%ecx
1.02% 1.02% │ 0x00007fda2dc0e016: cmp %r10d,%r9d
│ 0x00007fda2dc0e019: jae 0x00007fda2dc0e267
0.01% 0.01% │ 0x00007fda2dc0e01f: xor 0x10(%rdx,%r9,4),%edi
2.47% 2.10% │ 0x00007fda2dc0e024: mov %edi,0x4(%rdx,%r11,4)
0.69% 0.57% │ 0x00007fda2dc0e029: xor 0x10(%rdx,%r9,4),%edi
1.37% 1.50% │ 0x00007fda2dc0e02e: mov %edi,0x10(%rdx,%r9,4)
0.77% 0.83% │ 0x00007fda2dc0e033: xor %edi,0x4(%rdx,%r11,4)
4.28% 3.85% │ 0x00007fda2dc0e038: mov (%rdx,%r11,4),%edi
0.03% 0.02% │ 0x00007fda2dc0e03c: mov (%rsi,%r11,4),%r9d
1.14% 0.97% │ 0x00007fda2dc0e040: mov %r11d,%ebx
0.01% 0.00% │ 0x00007fda2dc0e043: add $0xfffffffc,%ebx
With unrolling, some of each chain can take place concurrently, and if there is a cache miss in one chain, it won’t stall the progress of the other chains. Without this capacity for parallelism, a cache miss during the swap will stop all work from progressing. As the probability of a cache miss increases, the cost of the load bottleneck in the swap should increase: this can be stressed by increasing the size of the array. With a large (100M) array, there’s a good chance of a cache miss virtually all the time. CPI increases in both cases, markedly so with the precomputed swaps, but throughput converges: access to main memory has become the bottleneck.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: mode | Param: size |
---|---|---|---|---|---|---|---|---|
shuffle:CPI | thrpt | 1 | 1 | 1.354325 | NaN | #/op | THREAD_LOCAL_RANDOM | 100000000 |
shuffle:CPI | thrpt | 1 | 1 | 3.854150 | NaN | #/op | PRECOMPUTED | 100000000 |
The perfasm output points to the first load in the generated swap as the bottleneck: notice the large cost on the mov
instruction.
0.15% 0.24% │ ││ 0x00007f8405c0a264: cmp %r9d,%edx
│ ││ 0x00007f8405c0a267: jae 0x00007f8405c0a350
0.10% 0.11% │ ││ 0x00007f8405c0a26d: xor 0x10(%r11,%rdx,4),%eax
73.97% 63.58% │ ││ 0x00007f8405c0a272: mov %eax,0xc(%r11,%rcx,4)
2.46% 1.87% │ ││ 0x00007f8405c0a277: xor 0x10(%r11,%rdx,4),%eax
1.42% 0.67% │ ││ 0x00007f8405c0a27c: mov %eax,0x10(%r11,%rdx,4)
2.19% 1.44% │ ││ 0x00007f8405c0a281: xor %eax,0xc(%r11,%rcx,4)
2.16% 1.37% │ ││ 0x00007f8405c0a286: cmp $0x1,%edi
With precomputed swaps, there is no single bottleneck, and my intuition is that there is some concurrency, despite the higher CPI. This is a long way from being proven.
10.33% 11.23% ││ 0x00007fdb35c09250: mov %r9d,0xc(%rsi,%r10,4)
0.41% 0.45% ││ 0x00007fdb35c09255: xor 0x10(%rsi,%r11,4),%r9d
0.36% 0.25% ││ 0x00007fdb35c0925a: mov %r9d,0x10(%rsi,%r11,4)
0.42% 0.42% ││ 0x00007fdb35c0925f: xor %r9d,0xc(%rsi,%r10,4)
0.51% 0.66% ││ 0x00007fdb35c09264: mov 0x8(%rsi,%r10,4),%r9d
0.03% 0.09% ││ 0x00007fdb35c09269: mov 0x8(%r13,%r10,4),%r11d
0.25% 0.20% ││ 0x00007fdb35c0926e: mov %r10d,%r8d
0.03% 0.15% ││ 0x00007fdb35c09271: add $0xfffffffe,%r8d
0.19% 0.17% ││ 0x00007fdb35c09275: cmp %ebx,%r11d
││ 0x00007fdb35c09278: jae 0x00007fdb35c09440
0.02% 0.06% ││ 0x00007fdb35c0927e: xor 0x10(%rsi,%r11,4),%r9d
10.40% 10.66% ││ 0x00007fdb35c09283: mov %r9d,0x8(%rsi,%r10,4)
0.41% 0.35% ││ 0x00007fdb35c09288: xor 0x10(%rsi,%r11,4),%r9d
0.41% 0.30% ││ 0x00007fdb35c0928d: mov %r9d,0x10(%rsi,%r11,4)
0.45% 0.39% ││ 0x00007fdb35c09292: xor %r9d,0x8(%rsi,%r10,4)
0.48% 0.60% ││ 0x00007fdb35c09297: mov 0x4(%rsi,%r10,4),%r9d
0.03% 0.06% ││ 0x00007fdb35c0929c: mov 0x4(%r13,%r10,4),%r11d
0.06% 0.11% ││ 0x00007fdb35c092a1: mov %r10d,%edi
0.02% 0.16% ││ 0x00007fdb35c092a4: add $0xfffffffd,%edi
0.25% 0.20% ││ 0x00007fdb35c092a7: cmp %ebx,%r11d
This can be exploited so the random numbers can be generated on the fly without a single bottleneck by using a hybrid approach. The random swaps can be generated on the fly and written into a small buffer. Once the buffer is full, the swaps are done. This should “decouple” the random number generation from the swapping code, and should allow some of the swaps to be performed independently. Concretely:
@Benchmark
public void shuffleBuffered(Blackhole bh) {
for (int i = data.length; i - unroll > 1; i -= unroll) {
for (int j = 0; j < buffer.length; ++j) {
buffer[j] = op.applyAsInt(i - j);
}
for (int j = 0; j < buffer.length; ++j) {
swap(data, i - j - 1, buffer[j]);
}
}
bh.consume(data);
}
There’s not much to be gained (or lost) from this until the array gets quite large, but it’s a relatively interesting outcome. CPI is on the whole improved, and throughput improves as a function of buffer size, so long as the buffer is small.
Mode | Benchmark | 16 | 32 | 64 | 128 | 256 |
---|---|---|---|---|---|---|
PRECOMPUTED | shuffle | 0.30639 | 0.296566 | 0.309829 | 0.312449 | 0.311183 |
PRECOMPUTED | shuffle:CPI | 3.004183 | 3.126903 | 2.989748 | 2.987508 | 3.000369 |
THREAD_LOCAL_RANDOM | shuffle | 0.271536 | 0.266418 | 0.271858 | 0.265593 | 0.264507 |
THREAD_LOCAL_RANDOM | shuffle:CPI | 1.303454 | 1.328127 | 1.300731 | 1.32857 | 1.377559 |
THREAD_LOCAL_RANDOM | shuffleBuffered | 0.296098 | 0.324416 | 0.346934 | 0.353246 | 0.35277 |
THREAD_LOCAL_RANDOM | shuffleBuffered:CPI | 0.96738 | 0.937101 | 0.893673 | 0.87786 | 0.874607 |
Frankly, I can’t think of anything less interesting than scrambling an array, but this observation made me think there may be something in the idea of decoupling stages of work in choice scenarios. After going through the otherwise pointless exercise above, I decided it might be worthwhile spending some time porting a batch iteration feature implemented in the Go version of RoaringBitmap to Java. This idea turned out to be hugely profitable, speeding up iteration between 2x and 10x. If you use RoaringBitmap, it might be worth switching to these new batch iterators.
This topic is explored in more depth by Daniel Lemire in Fast Random Integer Generation in an Interval.