Garbage Collectors Affect Microbenchmarks
When comparing garbage collectors there are two key metrics: how much time is spent collecting garbage, and the maximum pause time. There’s another dimension to the choice of garbage collector though: how it instruments JIT compiled code and the consequences of that instrumentation. The cost of this instrumentation is usually a tiny price to pay for improved pause times which only matters to some applications, but it makes writing benchmarks for code which assigns and reads references potentially error prone: sometimes the effect of changing the garbage collector is larger than the difference between two competing implementations. To illustrate this I compare a microbenchmark for a document cursor with three garbage collectors: ParallelOld (the default in OpenJDK8), G1 (the default from OpenJDK 9 onwards) and the experimental ZGC available from JDK11 onwards.
The code being benchmarked is simple. Imagine a stream of JSON-like documents which need to be translated into another format. The documents contain a special field called the cursor, for which, for some reason, the last-encountered value must always be known. There is a callback which will be invoked whenever a value of a certain type is encountered (e.g. writeLong(long value)
) and a callback which will be invoked whenever a name of an attribute is encountered: writeName(String name)
. The interface being implemented cannot be changed to include a method writeLong(String name, long value)
because it is owned by a third party, so the state between the two calls must be saved between the invocations. On each invocation of the writeName
callback, we could save the name in the cursor object.
public class CursoredScanner2 implements CursoredScanner {
private final String trigger;
private String current;
private long cursor;
public CursoredScanner2(String trigger) {
this.trigger = trigger;
}
@Override
public void writeName(String name) {
this.current = name;
}
@Override
public void writeLong(long value) {
if (trigger.equals(current)) {
this.cursor = value;
}
}
@Override
public long getCursor() {
return cursor;
}
}
Alternatively, we could do the same number of comparisons by storing whether the last name was the name of the cursor or not:
public class CursoredScanner1 implements CursoredScanner {
private final String trigger;
private boolean atCursor;
private long cursor;
public CursoredScanner1(String trigger) {
this.trigger = trigger;
}
@Override
public void writeName(String name) {
this.atCursor = trigger.equals(name);
}
@Override
public void writeLong(long value) {
if (atCursor) {
this.cursor = value;
}
}
@Override
public long getCursor() {
return cursor;
}
}
Each implementation performs the same number of string comparisons. Supposing performance matters, how can one of the alternatives be selected? I wrote a benchmark which captures the cursor value from documents of varying sizes. I ran this benchmark with different garbage collector settings with JDK11. With ParallelOld, I saw that CursoredScanner2
was slightly slower.
-XX:+UseCondCardMark -XX:+UseParallelOldGC -mx1G -XX:+AlwaysPreTouch
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: scannerType | Param: size | Param: triggerName |
---|---|---|---|---|---|---|---|---|---|
scan | thrpt | 1 | 15 | 58.081438 | 1.008727 | ops/us | SCANNER1 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 6.586134 | 0.173920 | ops/us | SCANNER1 | 100 | trigger1 |
scan | thrpt | 1 | 15 | 49.402537 | 0.943554 | ops/us | SCANNER2 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 5.248657 | 0.135281 | ops/us | SCANNER2 | 100 | trigger1 |
The cost here can be attributed to the card marking which keeps the approximation of inter-generational references up to date when references are assigned (see here). By avoiding assigning the reference in CursoredScanner1
, the garbage collector doesn’t need to instrument anything at all, because the object graph isn’t being mutated.
G1 offers significant reductions in maximum pause times by structuring the heap and tracking references differently, it also instruments reference assignments to keep its book-keeping data structures up to date. The effect of this is pronounced in this benchmark, the barrier can be seen here with some skid implicating the innocent adjacent instruction.
-XX:+UseG1GC -mx1G -XX:+AlwaysPreTouch
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: scannerType | Param: size | Param: triggerName |
---|---|---|---|---|---|---|---|---|---|
scan | thrpt | 1 | 15 | 62.633572 | 0.995514 | ops/us | SCANNER1 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 7.660122 | 0.231402 | ops/us | SCANNER1 | 100 | trigger1 |
scan | thrpt | 1 | 15 | 23.833586 | 0.379903 | ops/us | SCANNER2 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 2.757419 | 0.148344 | ops/us | SCANNER2 | 100 | trigger1 |
What about ZGC, one of the two upcoming ultra low pause garbage collectors? I can’t pretend to understand in detail how ZGC works (beyond what I can glean from profilers) but suffice it to say: it works differently, and instruments application code differently. Rather than intercepting reference assignment, it seems to intercept reads. It’s not clear why both implementations perform slightly worse than CursoredScanner1
with G1 or ParallelOld, but there’s not much to choose between the two when using ZGC.
-XX:+UnlockExperimentalVMOptions -XX:+UseZGC -mx1G -XX:+AlwaysPreTouch
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: scannerType | Param: size | Param: triggerName |
---|---|---|---|---|---|---|---|---|---|
scan | thrpt | 1 | 15 | 43.761915 | 1.160516 | ops/us | SCANNER1 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 6.190803 | 0.101114 | ops/us | SCANNER1 | 100 | trigger1 |
scan | thrpt | 1 | 15 | 39.080922 | 0.826591 | ops/us | SCANNER2 | 10 | trigger1 |
scan | thrpt | 1 | 15 | 4.763075 | 0.126938 | ops/us | SCANNER2 | 100 | trigger1 |
Am I making a case for using ParallelOld and avoiding assigning references because the throughput is slightly better? Not really, while it’s possible that’s appropriate in some applications, the point is that unless benchmarks focus exclusively on primitive types, the garbage collector has to be considered, and results need to be qualified by this choice. It would be very hard to choose between these implementations without knowing in advance which garbage collector would be in use.
As an aside, this is the first time I have run ZGC, so I’m keen to track down the read barrier I have heard about. It looks like the sequence of instructions mov
, test
, jne
occurs on each read:
0x00007fe47c765295: mov 0x30(%r10),%r9
0x00007fe47c765299: test %r9,0x20(%r15)
0x00007fe47c76529d: jne 0x00007fe47c765b68
0x00007fe47c7652a3: mov 0x10(%r9),%r14
0x00007fe47c7652a7: test %r14,0x20(%r15)
0x00007fe47c7652ab: jne 0x00007fe47c765b76
The assembly above can be seen whenever a reference is read, and sets up a (predictable) data dependency between reads: the mov
instruction must happen before the test instruction, which may trigger the jne
, so the second move must depend on the jump and can’t be reordered. I was wondering what the purpose of this was, and if the data dependency was the means or the end, and what’s in r15
and found a decent article about this. Aleksey Shipilëv, who writes garbage collectors for a living and is better placed to interpret this output, gave some feedback: in Hotspot, r15
is the base address for thread local storage. Here, r15 + 0x20
is ZGC’s so called bad mask, and failing the test means that the object needs to be marked or relocated. Neither marking nor relocation actually show up in this profile because there isn’t enough garbage generated to trigger it, so the code at 0x00007fe47c765b68 can’t be seen. If the test passes nothing need happen, and the next reference is read (and intercepted itself). What jumps out at me here is the data dependency, but there’s also no obvious bottleneck in the profile.