There’s a lot of folklore about the importance of inlining in the JVM. Undoubtedly, inlining can improve performance by removing the overhead of function calls, but, more importantly, various optimisations depend on it and just don’t happen without it. However, I think the importance of inlining is often overstated, especially considering the trade off between flexibility and ability to inline. This post is the first in a series where I use JMH to run simple experiments to assess the impact of failure to inline on C2’s ability to optimise programs. This post is about how inlining affects escape analysis, and whether you should care.

Inlining is the process of replacing function calls with the function’s code, much like denormalisation of databases. Just as database denormalisation can eliminate the overhead of performing joins at the expense of increasing the level of data duplication and therefore database size, inlining removes the overhead of function calls, at the expense of the amount of space required to represent the program. The analogy breaks down because copying the function’s code into the call site also aids an optimising compiler like C2 by increasing the scope of what can be optimised within a method, so C2 does this aggressively. It’s well known that there are two ways to confound inlining: code size (InlineSmallCode sets the limit of what can be inlined to 2KB by default), and having lots of polymorphism. Failure to inline can also be provoked by the JMH annotation @CompilerControl(DONT_INLINE).

In the first benchmark, I will look at a contrived example of the kind of small method you may find in Java code written in a functional style. Functional programming exploits monads, which represent a generic computation as a wrapper type, a wrapping operation known as the unit function, and a way to compose functions applied to the wrapper type, known as the bind function. You can also think of them as burritos. Some monadic types common in functionally tinged Java are Either (contains an instance of one type or another), Try (produces an output or an exception) and Optional which exists in the JDK. One drawback of monadic types in Java is that the wrapper type needs to be materialised (rather than exist only as a figment of the compiler’s imagination) and risks being allocated.

Here is an interface exposing a method returning an Optional intended to safely map a potentially null value of type S to type Optional<T> via a mapping between the unwrapped types S and T. To avoid measuring the cost of different implementations, it is implemented the same way three times to reach the threshold where Hotspot will give up on inlining calls to the escapee.

public interface Escapee<T> {
  <S> Optional<T> map(S value, Function<S, T> mapper);
}

public class Escapee1<T> implements Escapee<T> {
  @Override
  public <S> Optional<T> map(S value, Function<S, T> mapper) {
    return Optional.ofNullable(value).map(mapper);
  }
}

In the benchmark, we can simulate conditions where we call between one and four implementations. We should probably expect the benchmark to behave differently when the input value is null because a different branch will be taken. To isolate the difference in throughput just for taking the other branch, the same function, which allocates an Instant, is evaluated on either branch. No attempt is made to make the branch unpredictable since it’s beside the point. Instant.now() is chosen because it is volatile and impure, meaning that its evaluation shouldn’t be eliminated by some other optimisation.

  @State(Scope.Benchmark)
  public static class InstantEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<Instant>[] escapees;
    int size = 4;
    String input;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValue(InstantEscapeeState state, Blackhole bh) {
    for (Escapee<Instant> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> Instant.now()).orElseGet(Instant::now));
    }
  }

Based on common knowledge about C2’s inlining capabilities, we should expect scenarios THREE and FOUR not to inline, whereas ONE should be inlined, and TWO should be inlined with a conditional. Verifying this well known outcome by printing inlining with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining is trivial. See Aleksey Shipilёv’s authoritative post for reference.

The benchmark is run with the following arguments. Tiered compilation is disabled to bypass C1. A large heap is allocated to avoid measuring garbage collection pauses, and the low overhead SerialGC is selected to minimise interference from instrumented write barriers.

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValue$

Despite there being little absolute difference in throughput (the scenarios where we expect inlining to occur have slightly higher throughputs than when we expect inlining not to take place), the results are quite interesting.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValue thrpt 1 15 24.013132 0.459482 ops/us true ONE
mapValue thrpt 1 15 22.448583 0.430733 ops/us true TWO
mapValue thrpt 1 15 20.291617 0.898656 ops/us true THREE
mapValue thrpt 1 15 20.651088 0.552091 ops/us true FOUR
mapValue thrpt 1 15 24.625237 0.535002 ops/us false ONE
mapValue thrpt 1 15 24.039407 0.432007 ops/us false TWO
mapValue thrpt 1 15 21.976675 0.741998 ops/us false THREE
mapValue thrpt 1 15 22.183469 0.43514 ops/us false FOUR

The megamorphic cases are slightly faster when the input value is null, which highlights how easy it would be to not capture the relevant effects at all. When the input value is always null, and when there is only one implementation and the input value is not null, the normalised allocation rate are all 24B/op, and just over half that of the non-null input multi implementation scenarios, which are all about the same at 40B/op.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op true ONE
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.000018 0.000001 B/op true TWO
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true THREE
mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true FOUR
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false ONE
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false TWO
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false THREE
mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false FOUR

24B/op is the size of instances of the Instant class (when a simple garbage collector like SerialGC is used), which contains an 8 byte number of seconds since 1970 and a 4 byte number of nanoseconds, plus a 12 byte object header. So the wrapper type can’t have been allocated in these cases! 40B/op includes the 16 bytes taken up by the materialised Optional (12 bytes for the header and 4 bytes for a compressed reference to the Instant). The difference is caused by the limitations of escape analysis: it can’t prove allocation is unnecessary whenever the allocating method can’t be inlined because the call is opaque to the caller, and incidentally gives up when the allocation takes place within a conditional statement. In scenario TWO, a conditional statement is introduced by inlining two possible implementations, which means each operation allocates the 16 bytes required for the optional.

The signal is fairly weak in this benchmark, and is almost entirely masked by the fact the benchmark will allocate a 24 byte Instant per invocation. To accentuate the difference, we can isolate background allocation from the benchmark and track the same metrics.

  @State(Scope.Benchmark)
  public static class StringEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<String>[] escapees;
    int size = 4;
    String input;
    String ifPresent;
    String ifAbsent;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      ifPresent = UUID.randomUUID().toString();
      ifAbsent = UUID.randomUUID().toString();
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValueNoAllocation(StringEscapeeState state, Blackhole bh) {
    for (Escapee<String> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> state.ifPresent).orElseGet(() -> state.ifAbsent));
    }
  }
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-string.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValueNoAllocation

While even the cost of very low intensity realistic work (allocating a timestamp) is enough to mollify failure to inline, when the virtual call is a no-op we can make its impact look quite severe. ONE and TWO are much faster because they at least eliminate the virtual function call in each case, no matter whether the input is null or not.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValueNoAllocation thrpt 1 15 206.913491 3.003555 ops/us true ONE
mapValueNoAllocation thrpt 1 15 162.014816 4.353872 ops/us true TWO
mapValueNoAllocation thrpt 1 15 77.959095 2.174789 ops/us true THREE
mapValueNoAllocation thrpt 1 15 77.845562 3.592952 ops/us true FOUR
mapValueNoAllocation thrpt 1 15 202.016045 2.830117 ops/us false ONE
mapValueNoAllocation thrpt 1 15 198.241125 2.351662 ops/us false TWO
mapValueNoAllocation thrpt 1 15 88.187145 3.908423 ops/us false THREE
mapValueNoAllocation thrpt 1 15 89.715024 2.234652 ops/us false FOUR

It’s easy to imagine that allocation has been curtailed, only to be caught out by the limitations of escape analysis in the presence of polymorphism. In scenario ONE, there is never any allocation: escape analysis must have worked. In scenario TWO, because of the inlined conditional, the 16 byte Optional is allocated once per invocation with non-null input, and when the input is always null, there are fewer allocations. However, when the inlining doesn’t work in scenarios THREE and FOUR, an extra 16 bytes is allocated once per invocation, but it’s not related to inlining. The unintentional 16 bytes comes from capturing the variable in each case (a 12 byte header and 4 byte compressed reference to the String), but how often do you check your benchmarks to ensure you are measuring what you think you are?

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op true ONE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000003 0 B/op true TWO
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true THREE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true FOUR
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false ONE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false TWO
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false THREE
mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false FOUR

It’s not the sort of thing that can be exploited in real programs, but it looks as if allocations are better eliminated when the method, be it virtual or inlined, only ever sees a null value. Actually, Optional.empty() always returns the same instance, so there were no allocations in the first place.

Having contrived a case to accentuate the effect, it’s worth noting that the impact of failure to inline is smaller than the difference in the cost of allocating an instance and storing the value with different garbage collectors, which is a cost some developers seem to be unaware of.

  @State(Scope.Benchmark)
  public static class InstantStoreEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    int size = 4;
    String input;
    Escapee<Instant>[] escapees;
    Instant[] target;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      target = new Instant[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapAndStoreValue(InstantStoreEscapeeState state, Blackhole bh) {
    for (int i = 0; i < state.escapees.length; ++i) {
      state.target[i] = state.escapees[i].map(state.input, x -> Instant.now()).orElseGet(Instant::now);
    }
    bh.consume(state.target);
  }

I run the same benchmark in two modes:

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-serial.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapAndStoreValue$
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-g1.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseG1GC -mx8G" EscapeeBenchmark.mapAndStoreValue$

The cost of changing the garbage collector when triggering the write barriers (simple in the case of the serial collector and complex in the case of G1) is about as large as the cost of missing out on inlining. Note that this is not an argument that garbage collector overhead is unacceptable!

Benchmark GC Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario
mapAndStoreValue SerialGC thrpt 1 15 23.739993 0.297493 ops/us true ONE
mapAndStoreValue SerialGC thrpt 1 15 22.41715 0.502928 ops/us true TWO
mapAndStoreValue SerialGC thrpt 1 15 21.096494 0.629228 ops/us true THREE
mapAndStoreValue SerialGC thrpt 1 15 20.656528 0.604725 ops/us true FOUR
mapAndStoreValue SerialGC thrpt 1 15 24.098976 0.479819 ops/us false ONE
mapAndStoreValue SerialGC thrpt 1 15 23.759017 0.460972 ops/us false TWO
mapAndStoreValue SerialGC thrpt 1 15 21.473803 0.411786 ops/us false THREE
mapAndStoreValue SerialGC thrpt 1 15 21.524173 0.393322 ops/us false FOUR
mapAndStoreValue G1GC thrpt 1 15 20.522258 0.463444 ops/us true ONE
mapAndStoreValue G1GC thrpt 1 15 18.520677 0.229133 ops/us true TWO
mapAndStoreValue G1GC thrpt 1 15 18.359042 0.276809 ops/us true THREE
mapAndStoreValue G1GC thrpt 1 15 18.446654 0.272189 ops/us true FOUR
mapAndStoreValue G1GC thrpt 1 15 20.768856 0.496087 ops/us false ONE
mapAndStoreValue G1GC thrpt 1 15 20.277051 0.411466 ops/us false TWO
mapAndStoreValue G1GC thrpt 1 15 18.875519 0.399535 ops/us false THREE
mapAndStoreValue G1GC thrpt 1 15 18.824234 0.657469 ops/us false FOUR

Inlining makes escape analysis more effective by allowing the analysis to apply to larger blocks of code, but is only effective when only one implementation is used. The marginal benefit decreases in the presence of even trivial allocation, but can be expected to increase with the size of the eliminated allocation. The difference can even be smaller than the runtime cost of write barriers in some garbage collectors. My benchmarks are on github, they were run with OpenJDK 11+28 on Ubuntu 18.04.2 LTS.

Perhaps this analysis is facile; many optimisations more powerful than escape analysis depend on inlining. The next post in the series will be on the benefits, or lack thereof, of inlining a reduction operation such as a hash code.