Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op is either precomputed (a closure over an array of indices to swap) or a call to ThreadLocalRandom:

  @Benchmark
  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));
    bh.consume(data);
  }

  @Benchmark
  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      }
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);
      }
    }
    bh.consume(data);
  }

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];
  }

I couldn’t explain the effect observed in terms of the default performance counters made available by JMH, but offered an intuitive explanation that the cache miss could be shared between four independent chains of execution so that cache misses in a given chain would not stall the others. This intuition was gleaned from perfasm: I guessed the bottleneck on this load was due to cache misses. In the simple shuffle, there was one big hit:

 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4) 

Executing the staged shuffle, I saw several smaller bottlenecks and could only guess the simpler code within each stage had more parallelism; these cache misses were happening at the same time and independently.

 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
 ...  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4) 

Since then, I found out about the counters l1d_pend_miss.pending and l1d_pend_miss.pending_cycles. What do these counters mean? Many descriptions for counters are infuriating, l1d_pend_miss.pending especially so:

"This event counts duration of L1D miss outstanding, that is each cycle number of Fill Buffers (FB) outstanding required by Demand Reads. FB either is held by demand loads, or it is held by non-demand loads and gets hit at least once by demand. The valid outstanding interval is defined until the FB deallocation by one of the following ways: from FB allocation, if FB is allocated by demand; from the demand Hit FB, if it is allocated by hardware or software prefetch. Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads, including ones causing cache-line splits and reads due to page walks resulted from any request type." (source)

In the words of the Virgin Mary, come again? There is clarity in the terseness of the description in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

L1D_PEND_MISS.PENDING Increments the number of outstanding L1D misses every cycle.
L1D_PEND_MISS.PENDING_CYCLES Cycles with at least one outstanding L1D misses from this logical processor.

The first counter records how many loads from non L1 memory locations are in flight during a cycle (that is, how many L1 cache misses are happening right now), increasing whenever there is at least one cache miss happening, and increasing until the load is complete. The second counter records how many cycles have some kind of outstanding cache miss in flight during the cycle. If there’s pipelining taking place, the first counter can increase by more than one per cycle, and if at least some work is done without experiencing a cache miss, the second counter will be less than the total number of cycles, and if there are two or more cache misses outstanding at the same time, the counter will take a smaller value than if the cache misses had taken place sequentially. Therefore, their ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cyclesindicates how much memory level parallelism exists, that is, to what extent loads take place at the same time.

Can this be measured in JMH with the perfnorm profiler? Yes, I couldn’t find any documentation for it but reverse engineered this from the LinuxPerfNormProfiler source code:

-prof perfnorm:events=l1d_pend_miss.pending,l1d_pend_miss.pending_cycles

Note that this argument will override the standard events, so CPI, cycles and so on need to be added at the command line explicitly. Now the hypothetical parallel cache misses can be quantified. The figures for shuffle (the implementation without any staging) is reassuringly flat as a function of stage length, whereas a clear positive trend can be seen in both the throughput and MLP ratio for the staged implementation.

Mode Benchmark 8 16 32 64
PRECOMPUTED shuffle 0.347 0.352 0.345 0.37
PRECOMPUTED shuffle:l1d_pend_miss.pending 17390603073 17718936860 15979073823 20057689191
PRECOMPUTED shuffle:l1d_pend_miss.pending_cycles 3657159215 3706319384 3489256633 3930306563
PRECOMPUTED shuffle:ratio 4.76 4.78 4.58 5.10
THREAD_LOCAL_RANDOM shuffle 0.217 0.233 0.231 0.214
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending 18246771955 17801360193 17736302365 19638836068
THREAD_LOCAL_RANDOM shuffle:l1d_pend_miss.pending_cycles 7280468758 7093396781 7086435578 7843415714
THREAD_LOCAL_RANDOM shuffle:ratio 2.51 2.51 2.50 2.50
THREAD_LOCAL_RANDOM shuffleBuffered 0.248 0.307 0.326 0.345
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending 21899069718 23064517091 23320550954 22387833224
THREAD_LOCAL_RANDOM shuffleBuffered:l1d_pend_miss.pending_cycles 6203611528 5021906699 4539979273 4132226201
THREAD_LOCAL_RANDOM shuffleBuffered:ratio 3.53 4.59 5.14 5.42