Does Inlined Mean Streamlined? Part 1: Escape Analysis

There’s a lot of folklore about the importance of inlining in the JVM. Undoubtedly, inlining can improve performance by removing the overhead of function calls, but, more importantly, various optimisations depend on it and just don’t happen without it. However, I think the importance of inlining is often overstated, especially considering the trade off between flexibility and ability to inline. This post is the first in a series where I use JMH to run simple experiments to assess the impact of failure to inline on C2’s ability to optimise programs. This post is about how inlining affects escape analysis, and whether you should care.

Inlining is the process of replacing function calls with the function’s code, much like denormalisation of databases. Just as database denormalisation can eliminate the overhead of performing joins at the expense of increasing the level of data duplication and therefore database size, inlining removes the overhead of function calls, at the expense of the amount of space required to represent the program. The analogy breaks down because copying the function’s code into the call site also aids an optimising compiler like C2 by increasing the scope of what can be optimised within a method, so C2 does this aggressively. It’s well known that there are two ways to confound inlining: code size (InlineSmallCode sets the limit of what can be inlined to 2KB by default), and having lots of polymorphism. Failure to inline can also be provoked by the JMH annotation @CompilerControl(DONT_INLINE).

In the first benchmark, I will look at a contrived example of the kind of small method you may find in Java code written in a functional style. Functional programming exploits monads, which represent a generic computation as a wrapper type, a wrapping operation known as the unit function, and a way to compose functions applied to the wrapper type, known as the bind function. You can also think of them as burritos. Some monadic types common in functionally tinged Java are Either (contains an instance of one type or another), Try (produces an output or an exception) and Optional which exists in the JDK. One drawback of monadic types in Java is that the wrapper type needs to be materialised (rather than exist only as a figment of the compiler’s imagination) and risks being allocated.

Here is an interface exposing a method returning an Optional intended to safely map a potentially null value of type S to type Optional<T> via a mapping between the unwrapped types S and T. To avoid measuring the cost of different implementations, it is implemented the same way three times to reach the threshold where Hotspot will give up on inlining calls to the escapee.

public interface Escapee<T> {
  <S> Optional<T> map(S value, Function<S, T> mapper);
}

public class Escapee1<T> implements Escapee<T> {
  @Override
  public <S> Optional<T> map(S value, Function<S, T> mapper) {
    return Optional.ofNullable(value).map(mapper);
  }
}

In the benchmark, we can simulate conditions where we call between one and four implementations. We should probably expect the benchmark to behave differently when the input value is null because a different branch will be taken. To isolate the difference in throughput just for taking the other branch, the same function, which allocates an Instant, is evaluated on either branch. No attempt is made to make the branch unpredictable since it’s beside the point. Instant.now() is chosen because it is volatile and impure, meaning that its evaluation shouldn’t be eliminated by some other optimisation.

  @State(Scope.Benchmark)
  public static class InstantEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<Instant>[] escapees;
    int size = 4;
    String input;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValue(InstantEscapeeState state, Blackhole bh) {
    for (Escapee<Instant> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> Instant.now()).orElseGet(Instant::now));
    }
  }

Based on common knowledge about C2’s inlining capabilities, we should expect scenarios THREE and FOUR not to inline, whereas ONE should be inlined, and TWO should be inlined with a conditional. Verifying this well known outcome by printing inlining with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining is trivial. See Aleksey Shipilёv’s authoritative post for reference.

The benchmark is run with the following arguments. Tiered compilation is disabled to bypass C1. A large heap is allocated to avoid measuring garbage collection pauses, and the low overhead SerialGC is selected to minimise interference from instrumented write barriers.

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValue$

Despite there being little absolute difference in throughput (the scenarios where we expect inlining to occur have slightly higher throughputs than when we expect inlining not to take place), the results are quite interesting.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: isPresent	Param: scenario
mapValue	thrpt	1	15	24.013132	0.459482	ops/us	true	ONE
mapValue	thrpt	1	15	22.448583	0.430733	ops/us	true	TWO
mapValue	thrpt	1	15	20.291617	0.898656	ops/us	true	THREE
mapValue	thrpt	1	15	20.651088	0.552091	ops/us	true	FOUR
mapValue	thrpt	1	15	24.625237	0.535002	ops/us	false	ONE
mapValue	thrpt	1	15	24.039407	0.432007	ops/us	false	TWO
mapValue	thrpt	1	15	21.976675	0.741998	ops/us	false	THREE
mapValue	thrpt	1	15	22.183469	0.43514	ops/us	false	FOUR

The megamorphic cases are slightly faster when the input value is null, which highlights how easy it would be to not capture the relevant effects at all. When the input value is always null, and when there is only one implementation and the input value is not null, the normalised allocation rate are all 24B/op, and just over half that of the non-null input multi implementation scenarios, which are all about the same at 40B/op.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: isPresent	Param: scenario
mapValue:·gc.alloc.rate.norm	thrpt	1	15	24.000017	0.000001	B/op	true	ONE
mapValue:·gc.alloc.rate.norm	thrpt	1	15	40.000018	0.000001	B/op	true	TWO
mapValue:·gc.alloc.rate.norm	thrpt	1	15	40.00002	0.000001	B/op	true	THREE
mapValue:·gc.alloc.rate.norm	thrpt	1	15	40.00002	0.000001	B/op	true	FOUR
mapValue:·gc.alloc.rate.norm	thrpt	1	15	24.000017	0.000001	B/op	false	ONE
mapValue:·gc.alloc.rate.norm	thrpt	1	15	24.000017	0.000001	B/op	false	TWO
mapValue:·gc.alloc.rate.norm	thrpt	1	15	24.000019	0.000001	B/op	false	THREE
mapValue:·gc.alloc.rate.norm	thrpt	1	15	24.000019	0.000001	B/op	false	FOUR

24B/op is the size of instances of the Instant class (when a simple garbage collector like SerialGC is used), which contains an 8 byte number of seconds since 1970 and a 4 byte number of nanoseconds, plus a 12 byte object header. So the wrapper type can’t have been allocated in these cases! 40B/op includes the 16 bytes taken up by the materialised Optional (12 bytes for the header and 4 bytes for a compressed reference to the Instant). The difference is caused by the limitations of escape analysis: it can’t prove allocation is unnecessary whenever the allocating method can’t be inlined because the call is opaque to the caller, and incidentally gives up when the allocation takes place within a conditional statement. In scenario TWO, a conditional statement is introduced by inlining two possible implementations, which means each operation allocates the 16 bytes required for the optional.

The signal is fairly weak in this benchmark, and is almost entirely masked by the fact the benchmark will allocate a 24 byte Instant per invocation. To accentuate the difference, we can isolate background allocation from the benchmark and track the same metrics.

  @State(Scope.Benchmark)
  public static class StringEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<String>[] escapees;
    int size = 4;
    String input;
    String ifPresent;
    String ifAbsent;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      ifPresent = UUID.randomUUID().toString();
      ifAbsent = UUID.randomUUID().toString();
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValueNoAllocation(StringEscapeeState state, Blackhole bh) {
    for (Escapee<String> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> state.ifPresent).orElseGet(() -> state.ifAbsent));
    }
  }

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-string.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValueNoAllocation

While even the cost of very low intensity realistic work (allocating a timestamp) is enough to mollify failure to inline, when the virtual call is a no-op we can make its impact look quite severe. ONE and TWO are much faster because they at least eliminate the virtual function call in each case, no matter whether the input is null or not.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: isPresent	Param: scenario
mapValueNoAllocation	thrpt	1	15	206.913491	3.003555	ops/us	true	ONE
mapValueNoAllocation	thrpt	1	15	162.014816	4.353872	ops/us	true	TWO
mapValueNoAllocation	thrpt	1	15	77.959095	2.174789	ops/us	true	THREE
mapValueNoAllocation	thrpt	1	15	77.845562	3.592952	ops/us	true	FOUR
mapValueNoAllocation	thrpt	1	15	202.016045	2.830117	ops/us	false	ONE
mapValueNoAllocation	thrpt	1	15	198.241125	2.351662	ops/us	false	TWO
mapValueNoAllocation	thrpt	1	15	88.187145	3.908423	ops/us	false	THREE
mapValueNoAllocation	thrpt	1	15	89.715024	2.234652	ops/us	false	FOUR

It’s easy to imagine that allocation has been curtailed, only to be caught out by the limitations of escape analysis in the presence of polymorphism. In scenario ONE, there is never any allocation: escape analysis must have worked. In scenario TWO, because of the inlined conditional, the 16 byte Optional is allocated once per invocation with non-null input, and when the input is always null, there are fewer allocations. However, when the inlining doesn’t work in scenarios THREE and FOUR, an extra 16 bytes is allocated once per invocation, but it’s not related to inlining. The unintentional 16 bytes comes from capturing the variable in each case (a 12 byte header and 4 byte compressed reference to the String), but how often do you check your benchmarks to ensure you are measuring what you think you are?

Benchmark	Mode	Threads	Samples	Score	Unit	Param: isPresent	Param: scenario
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	0.000002	B/op	true	ONE
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	16.000003	B/op	true	TWO
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	32.000005	B/op	true	THREE
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	32.000005	B/op	true	FOUR
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	0.000002	B/op	false	ONE
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	0.000002	B/op	false	TWO
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	16.000005	B/op	false	THREE
mapValueNoAllocation:·gc.alloc.rate.norm	thrpt	1	15	16.000005	B/op	false	FOUR

It’s not the sort of thing that can be exploited in real programs, but it looks as if allocations are better eliminated when the method, be it virtual or inlined, only ever sees a null value. Actually, Optional.empty() always returns the same instance, so there were no allocations in the first place.

Having contrived a case to accentuate the effect, it’s worth noting that the impact of failure to inline is smaller than the difference in the cost of allocating an instance and storing the value with different garbage collectors, which is a cost some developers seem to be unaware of.

  @State(Scope.Benchmark)
  public static class InstantStoreEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    int size = 4;
    String input;
    Escapee<Instant>[] escapees;
    Instant[] target;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      target = new Instant[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapAndStoreValue(InstantStoreEscapeeState state, Blackhole bh) {
    for (int i = 0; i < state.escapees.length; ++i) {
      state.target[i] = state.escapees[i].map(state.input, x -> Instant.now()).orElseGet(Instant::now);
    }
    bh.consume(state.target);
  }

I run the same benchmark in two modes:

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-serial.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapAndStoreValue$

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-g1.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseG1GC -mx8G" EscapeeBenchmark.mapAndStoreValue$

The cost of changing the garbage collector when triggering the write barriers (simple in the case of the serial collector and complex in the case of G1) is about as large as the cost of missing out on inlining. Note that this is not an argument that garbage collector overhead is unacceptable!

Benchmark	GC	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: isPresent	Param: scenario
mapAndStoreValue	SerialGC	thrpt	1	15	23.739993	0.297493	ops/us	true	ONE
mapAndStoreValue	SerialGC	thrpt	1	15	22.41715	0.502928	ops/us	true	TWO
mapAndStoreValue	SerialGC	thrpt	1	15	21.096494	0.629228	ops/us	true	THREE
mapAndStoreValue	SerialGC	thrpt	1	15	20.656528	0.604725	ops/us	true	FOUR
mapAndStoreValue	SerialGC	thrpt	1	15	24.098976	0.479819	ops/us	false	ONE
mapAndStoreValue	SerialGC	thrpt	1	15	23.759017	0.460972	ops/us	false	TWO
mapAndStoreValue	SerialGC	thrpt	1	15	21.473803	0.411786	ops/us	false	THREE
mapAndStoreValue	SerialGC	thrpt	1	15	21.524173	0.393322	ops/us	false	FOUR
mapAndStoreValue	G1GC	thrpt	1	15	20.522258	0.463444	ops/us	true	ONE
mapAndStoreValue	G1GC	thrpt	1	15	18.520677	0.229133	ops/us	true	TWO
mapAndStoreValue	G1GC	thrpt	1	15	18.359042	0.276809	ops/us	true	THREE
mapAndStoreValue	G1GC	thrpt	1	15	18.446654	0.272189	ops/us	true	FOUR
mapAndStoreValue	G1GC	thrpt	1	15	20.768856	0.496087	ops/us	false	ONE
mapAndStoreValue	G1GC	thrpt	1	15	20.277051	0.411466	ops/us	false	TWO
mapAndStoreValue	G1GC	thrpt	1	15	18.875519	0.399535	ops/us	false	THREE
mapAndStoreValue	G1GC	thrpt	1	15	18.824234	0.657469	ops/us	false	FOUR

Inlining makes escape analysis more effective by allowing the analysis to apply to larger blocks of code, but is only effective when only one implementation is used. The marginal benefit decreases in the presence of even trivial allocation, but can be expected to increase with the size of the eliminated allocation. The difference can even be smaller than the runtime cost of write barriers in some garbage collectors. My benchmarks are on github, they were run with OpenJDK 11+28 on Ubuntu 18.04.2 LTS.

Perhaps this analysis is facile; many optimisations more powerful than escape analysis depend on inlining. The next post in the series will be on the benefits, or lack thereof, of inlining a reduction operation such as a hash code.