Observing Memory Level Parallelism with JMH

Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op is either precomputed (a closure over an array of indices to swap) or a call to ThreadLocalRandom:

  @Benchmark
  public void shuffle(Blackhole bh) {
    for (int i = data.length; i > 1; i--)
      swap(data, i - 1, op.applyAsInt(i));
    bh.consume(data);
  }

  @Benchmark
  public void shuffleBuffered(Blackhole bh) {
    for (int i = data.length; i - unroll > 1; i -= unroll) {
      for (int j = 0; j < buffer.length; ++j) {
        buffer[j] = op.applyAsInt(i - j);
      }
      for (int j = 0; j < buffer.length; ++j) {
        swap(data, i - j - 1, buffer[j]);
      }
    }
    bh.consume(data);
  }

  private static void swap(int[] arr, int i, int j) {
    arr[i] ^= arr[j];
    arr[j] ^= arr[i];
    arr[i] ^= arr[j];
  }

I couldn’t explain the effect observed in terms of the default performance counters made available by JMH, but offered an intuitive explanation that the cache miss could be shared between four independent chains of execution so that cache misses in a given chain would not stall the others. This intuition was gleaned from perfasm: I guessed the bottleneck on this load was due to cache misses. In the simple shuffle, there was one big hit:

 73.97%   63.58%  │      ││  0x00007f8405c0a272: mov    %eax,0xc(%r11,%rcx,4)

Executing the staged shuffle, I saw several smaller bottlenecks and could only guess the simpler code within each stage had more parallelism; these cache misses were happening at the same time and independently.

 10.33%   11.23%   ││  0x00007fdb35c09250: mov    %r9d,0xc(%rsi,%r10,4)  
 ...  
 10.40%   10.66%   ││  0x00007fdb35c09283: mov    %r9d,0x8(%rsi,%r10,4)

Since then, I found out about the counters l1d_pend_miss.pending and l1d_pend_miss.pending_cycles. What do these counters mean? Many descriptions for counters are infuriating, l1d_pend_miss.pending especially so:

"This event counts duration of L1D miss outstanding, that is each cycle number of Fill Buffers (FB) outstanding required by Demand Reads. FB either is held by demand loads, or it is held by non-demand loads and gets hit at least once by demand. The valid outstanding interval is defined until the FB deallocation by one of the following ways: from FB allocation, if FB is allocated by demand; from the demand Hit FB, if it is allocated by hardware or software prefetch. Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads, including ones causing cache-line splits and reads due to page walks resulted from any request type." (source)

In the words of the Virgin Mary, come again? There is clarity in the terseness of the description in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

L1D_PEND_MISS.PENDING	Increments the number of outstanding L1D misses every cycle.
L1D_PEND_MISS.PENDING_CYCLES	Cycles with at least one outstanding L1D misses from this logical processor.

The first counter records how many loads from non L1 memory locations are in flight during a cycle (that is, how many L1 cache misses are happening right now), increasing whenever there is at least one cache miss happening, and increasing until the load is complete. The second counter records how many cycles have some kind of outstanding cache miss in flight during the cycle. If there’s pipelining taking place, the first counter can increase by more than one per cycle, and if at least some work is done without experiencing a cache miss, the second counter will be less than the total number of cycles, and if there are two or more cache misses outstanding at the same time, the counter will take a smaller value than if the cache misses had taken place sequentially. Therefore, their ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cyclesindicates how much memory level parallelism exists, that is, to what extent loads take place at the same time.

Can this be measured in JMH with the perfnorm profiler? Yes, I couldn’t find any documentation for it but reverse engineered this from the LinuxPerfNormProfiler source code:

-prof perfnorm:events=l1d_pend_miss.pending,l1d_pend_miss.pending_cycles

Note that this argument will override the standard events, so CPI, cycles and so on need to be added at the command line explicitly. Now the hypothetical parallel cache misses can be quantified. The figures for shuffle (the implementation without any staging) is reassuringly flat as a function of stage length, whereas a clear positive trend can be seen in both the throughput and MLP ratio for the staged implementation.

Mode	Benchmark	8	16	32	64
PRECOMPUTED	shuffle	0.347	0.352	0.345	0.37
PRECOMPUTED	shuffle:l1d_pend_miss.pending	17390603073	17718936860	15979073823	20057689191
PRECOMPUTED	shuffle:l1d_pend_miss.pending_cycles	3657159215	3706319384	3489256633	3930306563
PRECOMPUTED	shuffle:ratio	4.76	4.78	4.58	5.10
THREAD_LOCAL_RANDOM	shuffle	0.217	0.233	0.231	0.214
THREAD_LOCAL_RANDOM	shuffle:l1d_pend_miss.pending	18246771955	17801360193	17736302365	19638836068
THREAD_LOCAL_RANDOM	shuffle:l1d_pend_miss.pending_cycles	7280468758	7093396781	7086435578	7843415714
THREAD_LOCAL_RANDOM	shuffle:ratio	2.51	2.51	2.50	2.50
THREAD_LOCAL_RANDOM	shuffleBuffered	0.248	0.307	0.326	0.345
THREAD_LOCAL_RANDOM	shuffleBuffered:l1d_pend_miss.pending	21899069718	23064517091	23320550954	22387833224
THREAD_LOCAL_RANDOM	shuffleBuffered:l1d_pend_miss.pending_cycles	6203611528	5021906699	4539979273	4132226201
THREAD_LOCAL_RANDOM	shuffleBuffered:ratio	3.53	4.59	5.14	5.42