Garbage Collectors Affect Microbenchmarks

When comparing garbage collectors there are two key metrics: how much time is spent collecting garbage, and the maximum pause time. There’s another dimension to the choice of garbage collector though: how it instruments JIT compiled code and the consequences of that instrumentation. The cost of this instrumentation is usually a tiny price to pay for improved pause times which only matters to some applications, but it makes writing benchmarks for code which assigns and reads references potentially error prone: sometimes the effect of changing the garbage collector is larger than the difference between two competing implementations. To illustrate this I compare a microbenchmark for a document cursor with three garbage collectors: ParallelOld (the default in OpenJDK8), G1 (the default from OpenJDK 9 onwards) and the experimental ZGC available from JDK11 onwards.

The code being benchmarked is simple. Imagine a stream of JSON-like documents which need to be translated into another format. The documents contain a special field called the cursor, for which, for some reason, the last-encountered value must always be known. There is a callback which will be invoked whenever a value of a certain type is encountered (e.g. writeLong(long value)) and a callback which will be invoked whenever a name of an attribute is encountered: writeName(String name). The interface being implemented cannot be changed to include a method writeLong(String name, long value) because it is owned by a third party, so the state between the two calls must be saved between the invocations. On each invocation of the writeName callback, we could save the name in the cursor object.

public class CursoredScanner2 implements CursoredScanner {

  private final String trigger;
  private String current;
  private long cursor;

  public CursoredScanner2(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.current = name;
  }

  @Override
  public void writeLong(long value) {
    if (trigger.equals(current)) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }

}

Alternatively, we could do the same number of comparisons by storing whether the last name was the name of the cursor or not:

public class CursoredScanner1 implements CursoredScanner {

  private final String trigger;

  private boolean atCursor;
  private long cursor;

  public CursoredScanner1(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.atCursor = trigger.equals(name);
  }

  @Override
  public void writeLong(long value) {
    if (atCursor) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }
}

Each implementation performs the same number of string comparisons. Supposing performance matters, how can one of the alternatives be selected? I wrote a benchmark which captures the cursor value from documents of varying sizes. I ran this benchmark with different garbage collector settings with JDK11. With ParallelOld, I saw that CursoredScanner2 was slightly slower.

-XX:+UseCondCardMark -XX:+UseParallelOldGC -mx1G -XX:+AlwaysPreTouch

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: scannerType	Param: size	Param: triggerName
scan	thrpt	1	15	58.081438	1.008727	ops/us	SCANNER1	10	trigger1
scan	thrpt	1	15	6.586134	0.173920	ops/us	SCANNER1	100	trigger1
scan	thrpt	1	15	49.402537	0.943554	ops/us	SCANNER2	10	trigger1
scan	thrpt	1	15	5.248657	0.135281	ops/us	SCANNER2	100	trigger1

The cost here can be attributed to the card marking which keeps the approximation of inter-generational references up to date when references are assigned (see here). By avoiding assigning the reference in CursoredScanner1, the garbage collector doesn’t need to instrument anything at all, because the object graph isn’t being mutated.

G1 offers significant reductions in maximum pause times by structuring the heap and tracking references differently, it also instruments reference assignments to keep its book-keeping data structures up to date. The effect of this is pronounced in this benchmark, the barrier can be seen here with some skid implicating the innocent adjacent instruction.

-XX:+UseG1GC -mx1G -XX:+AlwaysPreTouch

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: scannerType	Param: size	Param: triggerName
scan	thrpt	1	15	62.633572	0.995514	ops/us	SCANNER1	10	trigger1
scan	thrpt	1	15	7.660122	0.231402	ops/us	SCANNER1	100	trigger1
scan	thrpt	1	15	23.833586	0.379903	ops/us	SCANNER2	10	trigger1
scan	thrpt	1	15	2.757419	0.148344	ops/us	SCANNER2	100	trigger1

What about ZGC, one of the two upcoming ultra low pause garbage collectors? I can’t pretend to understand in detail how ZGC works (beyond what I can glean from profilers) but suffice it to say: it works differently, and instruments application code differently. Rather than intercepting reference assignment, it seems to intercept reads. It’s not clear why both implementations perform slightly worse than CursoredScanner1 with G1 or ParallelOld, but there’s not much to choose between the two when using ZGC.

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC -mx1G -XX:+AlwaysPreTouch

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: scannerType	Param: size	Param: triggerName
scan	thrpt	1	15	43.761915	1.160516	ops/us	SCANNER1	10	trigger1
scan	thrpt	1	15	6.190803	0.101114	ops/us	SCANNER1	100	trigger1
scan	thrpt	1	15	39.080922	0.826591	ops/us	SCANNER2	10	trigger1
scan	thrpt	1	15	4.763075	0.126938	ops/us	SCANNER2	100	trigger1

Am I making a case for using ParallelOld and avoiding assigning references because the throughput is slightly better? Not really, while it’s possible that’s appropriate in some applications, the point is that unless benchmarks focus exclusively on primitive types, the garbage collector has to be considered, and results need to be qualified by this choice. It would be very hard to choose between these implementations without knowing in advance which garbage collector would be in use.

As an aside, this is the first time I have run ZGC, so I’m keen to track down the read barrier I have heard about. It looks like the sequence of instructions mov, test, jne occurs on each read:

0x00007fe47c765295: mov    0x30(%r10),%r9
0x00007fe47c765299: test   %r9,0x20(%r15)
0x00007fe47c76529d: jne    0x00007fe47c765b68 
0x00007fe47c7652a3: mov    0x10(%r9),%r14    
0x00007fe47c7652a7: test   %r14,0x20(%r15)
0x00007fe47c7652ab: jne    0x00007fe47c765b76

The assembly above can be seen whenever a reference is read, and sets up a (predictable) data dependency between reads: the mov instruction must happen before the test instruction, which may trigger the jne, so the second move must depend on the jump and can’t be reordered. I was wondering what the purpose of this was, and if the data dependency was the means or the end, and what’s in r15 and found a decent article about this. Aleksey Shipilëv, who writes garbage collectors for a living and is better placed to interpret this output, gave some feedback: in Hotspot, r15 is the base address for thread local storage. Here, r15 + 0x20 is ZGC’s so called bad mask, and failing the test means that the object needs to be marked or relocated. Neither marking nor relocation actually show up in this profile because there isn’t enough garbage generated to trigger it, so the code at 0x00007fe47c765b68 can’t be seen. If the test passes nothing need happen, and the next reference is read (and intercepted itself). What jumps out at me here is the data dependency, but there’s also no obvious bottleneck in the profile.