Observing Memory Level Parallelism with JMH
Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op
is either precomputed (a closure over an array of indices to swap) or a call to ThreadLocalRandom
:
@Benchmark
public void shuffle(Blackhole bh) {
for (int i = data.length; i > 1; i--)
swap(data, i - 1, op.applyAsInt(i));
bh.consume(data);
}
@Benchmark
public void shuffleBuffered(Blackhole bh) {
for (int i = data.length; i - unroll > 1; i -= unroll) {
for (int j = 0; j < buffer.length; ++j) {
buffer[j] = op.applyAsInt(i - j);
}
for (int j = 0; j < buffer.length; ++j) {
swap(data, i - j - 1, buffer[j]);
}
}
bh.consume(data);
}
private static void swap(int[] arr, int i, int j) {
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
}
I couldn’t explain the effect observed in terms of the default performance counters made available by JMH, but offered an intuitive explanation that the cache miss could be shared between four independent chains of execution so that cache misses in a given chain would not stall the others. This intuition was gleaned from perfasm: I guessed the bottleneck on this load was due to cache misses. In the simple shuffle, there was one big hit:
73.97% 63.58% │ ││ 0x00007f8405c0a272: mov %eax,0xc(%r11,%rcx,4)
Executing the staged shuffle, I saw several smaller bottlenecks and could only guess the simpler code within each stage had more parallelism; these cache misses were happening at the same time and independently.
10.33% 11.23% ││ 0x00007fdb35c09250: mov %r9d,0xc(%rsi,%r10,4)
...
10.40% 10.66% ││ 0x00007fdb35c09283: mov %r9d,0x8(%rsi,%r10,4)
Since then, I found out about the counters l1d_pend_miss.pending
and l1d_pend_miss.pending_cycles
. What do these counters mean? Many descriptions for counters are infuriating, l1d_pend_miss.pending
especially so:
"This event counts duration of L1D miss outstanding, that is each cycle number of Fill Buffers (FB) outstanding required by Demand Reads. FB either is held by demand loads, or it is held by non-demand loads and gets hit at least once by demand. The valid outstanding interval is defined until the FB deallocation by one of the following ways: from FB allocation, if FB is allocated by demand; from the demand Hit FB, if it is allocated by hardware or software prefetch. Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads, including ones causing cache-line splits and reads due to page walks resulted from any request type." (source)
In the words of the Virgin Mary, come again? There is clarity in the terseness of the description in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.
L1D_PEND_MISS.PENDING | Increments the number of outstanding L1D misses every cycle. |
L1D_PEND_MISS.PENDING_CYCLES | Cycles with at least one outstanding L1D misses from this logical processor. |
The first counter records how many loads from non L1 memory locations are in flight during a cycle (that is, how many L1 cache misses are happening right now), increasing whenever there is at least one cache miss happening, and increasing until the load is complete. The second counter records how many cycles have some kind of outstanding cache miss in flight during the cycle. If there’s pipelining taking place, the first counter can increase by more than one per cycle, and if at least some work is done without experiencing a cache miss, the second counter will be less than the total number of cycles, and if there are two or more cache misses outstanding at the same time, the counter will take a smaller value than if the cache misses had taken place sequentially. Therefore, their ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cycles
indicates how much memory level parallelism exists, that is, to what extent loads take place at the same time.
Can this be measured in JMH with the perfnorm profiler? Yes, I couldn’t find any documentation for it but reverse engineered this from the LinuxPerfNormProfiler
source code:
-prof perfnorm:events=l1d_pend_miss.pending,l1d_pend_miss.pending_cycles
Note that this argument will override the standard events, so CPI, cycles and so on need to be added at the command line explicitly. Now the hypothetical parallel cache misses can be quantified. The figures for shuffle (the implementation without any staging) is reassuringly flat as a function of stage length, whereas a clear positive trend can be seen in both the throughput and MLP ratio for the staged implementation.
Mode | Benchmark | 8 | 16 | 32 | 64 |
---|---|---|---|---|---|
PRECOMPUTED | shuffle | 0.347 | 0.352 | 0.345 | 0.37 |
PRECOMPUTED | shuffle:l1d_pend_miss.pending | 17390603073 | 17718936860 | 15979073823 | 20057689191 |
PRECOMPUTED | shuffle:l1d_pend_miss.pending_cycles | 3657159215 | 3706319384 | 3489256633 | 3930306563 |
PRECOMPUTED | shuffle:ratio | 4.76 | 4.78 | 4.58 | 5.10 |
THREAD_LOCAL_RANDOM | shuffle | 0.217 | 0.233 | 0.231 | 0.214 |
THREAD_LOCAL_RANDOM | shuffle:l1d_pend_miss.pending | 18246771955 | 17801360193 | 17736302365 | 19638836068 |
THREAD_LOCAL_RANDOM | shuffle:l1d_pend_miss.pending_cycles | 7280468758 | 7093396781 | 7086435578 | 7843415714 |
THREAD_LOCAL_RANDOM | shuffle:ratio | 2.51 | 2.51 | 2.50 | 2.50 |
THREAD_LOCAL_RANDOM | shuffleBuffered | 0.248 | 0.307 | 0.326 | 0.345 |
THREAD_LOCAL_RANDOM | shuffleBuffered:l1d_pend_miss.pending | 21899069718 | 23064517091 | 23320550954 | 22387833224 |
THREAD_LOCAL_RANDOM | shuffleBuffered:l1d_pend_miss.pending_cycles | 6203611528 | 5021906699 | 4539979273 | 4132226201 |
THREAD_LOCAL_RANDOM | shuffleBuffered:ratio | 3.53 | 4.59 | 5.14 | 5.42 |