Does Inlined Mean Streamlined? Part 1: Escape Analysis
There’s a lot of folklore about the importance of inlining in the JVM. Undoubtedly, inlining can improve performance by removing the overhead of function calls, but, more importantly, various optimisations depend on it and just don’t happen without it. However, I think the importance of inlining is often overstated, especially considering the trade off between flexibility and ability to inline. This post is the first in a series where I use JMH to run simple experiments to assess the impact of failure to inline on C2’s ability to optimise programs. This post is about how inlining affects escape analysis, and whether you should care.
Inlining is the process of replacing function calls with the function’s code, much like denormalisation of databases. Just as database denormalisation can eliminate the overhead of performing joins at the expense of increasing the level of data duplication and therefore database size, inlining removes the overhead of function calls, at the expense of the amount of space required to represent the program. The analogy breaks down because copying the function’s code into the call site also aids an optimising compiler like C2 by increasing the scope of what can be optimised within a method, so C2 does this aggressively. It’s well known that there are two ways to confound inlining: code size (InlineSmallCode
sets the limit of what can be inlined to 2KB by default), and having lots of polymorphism. Failure to inline can also be provoked by the JMH annotation @CompilerControl(DONT_INLINE)
.
In the first benchmark, I will look at a contrived example of the kind of small method you may find in Java code written in a functional style. Functional programming exploits monads, which represent a generic computation as a wrapper type, a wrapping operation known as the unit function, and a way to compose functions applied to the wrapper type, known as the bind function. You can also think of them as burritos. Some monadic types common in functionally tinged Java are Either
(contains an instance of one type or another), Try
(produces an output or an exception) and Optional
which exists in the JDK. One drawback of monadic types in Java is that the wrapper type needs to be materialised (rather than exist only as a figment of the compiler’s imagination) and risks being allocated.
Here is an interface exposing a method returning an Optional
intended to safely map a potentially null value of type S
to type Optional<T>
via a mapping between the unwrapped types S
and T
. To avoid measuring the cost of different implementations, it is implemented the same way three times to reach the threshold where Hotspot will give up on inlining calls to the escapee.
public interface Escapee<T> {
<S> Optional<T> map(S value, Function<S, T> mapper);
}
public class Escapee1<T> implements Escapee<T> {
@Override
public <S> Optional<T> map(S value, Function<S, T> mapper) {
return Optional.ofNullable(value).map(mapper);
}
}
In the benchmark, we can simulate conditions where we call between one and four implementations. We should probably expect the benchmark to behave differently when the input value is null because a different branch will be taken. To isolate the difference in throughput just for taking the other branch, the same function, which allocates an Instant
, is evaluated on either branch. No attempt is made to make the branch unpredictable since it’s beside the point. Instant.now()
is chosen because it is volatile and impure, meaning that its evaluation shouldn’t be eliminated by some other optimisation.
@State(Scope.Benchmark)
public static class InstantEscapeeState {
@Param({"ONE", "TWO", "THREE", "FOUR"})
Scenario scenario;
@Param({"true", "false"})
boolean isPresent;
Escapee<Instant>[] escapees;
int size = 4;
String input;
@Setup(Level.Trial)
public void init() {
escapees = new Escapee[size];
scenario.fill(escapees);
input = isPresent ? "" : null;
}
}
@Benchmark
@OperationsPerInvocation(4)
public void mapValue(InstantEscapeeState state, Blackhole bh) {
for (Escapee<Instant> escapee : state.escapees) {
bh.consume(escapee.map(state.input, x -> Instant.now()).orElseGet(Instant::now));
}
}
Based on common knowledge about C2’s inlining capabilities, we should expect scenarios THREE and FOUR not to inline, whereas ONE should be inlined, and TWO should be inlined with a conditional. Verifying this well known outcome by printing inlining with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
is trivial. See Aleksey Shipilёv’s authoritative post for reference.
The benchmark is run with the following arguments. Tiered compilation is disabled to bypass C1. A large heap is allocated to avoid measuring garbage collection pauses, and the low overhead SerialGC is selected to minimise interference from instrumented write barriers.
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee.csv -prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValue$
Despite there being little absolute difference in throughput (the scenarios where we expect inlining to occur have slightly higher throughputs than when we expect inlining not to take place), the results are quite interesting.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: isPresent | Param: scenario |
---|---|---|---|---|---|---|---|---|
mapValue | thrpt | 1 | 15 | 24.013132 | 0.459482 | ops/us | true | ONE |
mapValue | thrpt | 1 | 15 | 22.448583 | 0.430733 | ops/us | true | TWO |
mapValue | thrpt | 1 | 15 | 20.291617 | 0.898656 | ops/us | true | THREE |
mapValue | thrpt | 1 | 15 | 20.651088 | 0.552091 | ops/us | true | FOUR |
mapValue | thrpt | 1 | 15 | 24.625237 | 0.535002 | ops/us | false | ONE |
mapValue | thrpt | 1 | 15 | 24.039407 | 0.432007 | ops/us | false | TWO |
mapValue | thrpt | 1 | 15 | 21.976675 | 0.741998 | ops/us | false | THREE |
mapValue | thrpt | 1 | 15 | 22.183469 | 0.43514 | ops/us | false | FOUR |
The megamorphic cases are slightly faster when the input value is null, which highlights how easy it would be to not capture the relevant effects at all. When the input value is always null, and when there is only one implementation and the input value is not null, the normalised allocation rate are all 24B/op, and just over half that of the non-null input multi implementation scenarios, which are all about the same at 40B/op.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: isPresent | Param: scenario |
---|---|---|---|---|---|---|---|---|
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 24.000017 | 0.000001 | B/op | true | ONE |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 40.000018 | 0.000001 | B/op | true | TWO |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 40.00002 | 0.000001 | B/op | true | THREE |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 40.00002 | 0.000001 | B/op | true | FOUR |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 24.000017 | 0.000001 | B/op | false | ONE |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 24.000017 | 0.000001 | B/op | false | TWO |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 24.000019 | 0.000001 | B/op | false | THREE |
mapValue:·gc.alloc.rate.norm | thrpt | 1 | 15 | 24.000019 | 0.000001 | B/op | false | FOUR |
24B/op is the size of instances of the Instant
class (when a simple garbage collector like SerialGC is used), which contains an 8 byte number of seconds since 1970 and a 4 byte number of nanoseconds, plus a 12 byte object header. So the wrapper type can’t have been allocated in these cases! 40B/op includes the 16 bytes taken up by the materialised Optional
(12 bytes for the header and 4 bytes for a compressed reference to the Instant
). The difference is caused by the limitations of escape analysis: it can’t prove allocation is unnecessary whenever the allocating method can’t be inlined because the call is opaque to the caller, and incidentally gives up when the allocation takes place within a conditional statement. In scenario TWO, a conditional statement is introduced by inlining two possible implementations, which means each operation allocates the 16 bytes required for the optional.
The signal is fairly weak in this benchmark, and is almost entirely masked by the fact the benchmark will allocate a 24 byte Instant
per invocation. To accentuate the difference, we can isolate background allocation from the benchmark and track the same metrics.
@State(Scope.Benchmark)
public static class StringEscapeeState {
@Param({"ONE", "TWO", "THREE", "FOUR"})
Scenario scenario;
@Param({"true", "false"})
boolean isPresent;
Escapee<String>[] escapees;
int size = 4;
String input;
String ifPresent;
String ifAbsent;
@Setup(Level.Trial)
public void init() {
escapees = new Escapee[size];
scenario.fill(escapees);
ifPresent = UUID.randomUUID().toString();
ifAbsent = UUID.randomUUID().toString();
input = isPresent ? "" : null;
}
}
@Benchmark
@OperationsPerInvocation(4)
public void mapValueNoAllocation(StringEscapeeState state, Blackhole bh) {
for (Escapee<String> escapee : state.escapees) {
bh.consume(escapee.map(state.input, x -> state.ifPresent).orElseGet(() -> state.ifAbsent));
}
}
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-string.csv -prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValueNoAllocation
While even the cost of very low intensity realistic work (allocating a timestamp) is enough to mollify failure to inline, when the virtual call is a no-op we can make its impact look quite severe. ONE and TWO are much faster because they at least eliminate the virtual function call in each case, no matter whether the input is null or not.
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: isPresent | Param: scenario |
---|---|---|---|---|---|---|---|---|
mapValueNoAllocation | thrpt | 1 | 15 | 206.913491 | 3.003555 | ops/us | true | ONE |
mapValueNoAllocation | thrpt | 1 | 15 | 162.014816 | 4.353872 | ops/us | true | TWO |
mapValueNoAllocation | thrpt | 1 | 15 | 77.959095 | 2.174789 | ops/us | true | THREE |
mapValueNoAllocation | thrpt | 1 | 15 | 77.845562 | 3.592952 | ops/us | true | FOUR |
mapValueNoAllocation | thrpt | 1 | 15 | 202.016045 | 2.830117 | ops/us | false | ONE |
mapValueNoAllocation | thrpt | 1 | 15 | 198.241125 | 2.351662 | ops/us | false | TWO |
mapValueNoAllocation | thrpt | 1 | 15 | 88.187145 | 3.908423 | ops/us | false | THREE |
mapValueNoAllocation | thrpt | 1 | 15 | 89.715024 | 2.234652 | ops/us | false | FOUR |
It’s easy to imagine that allocation has been curtailed, only to be caught out by the limitations of escape analysis in the presence of polymorphism. In scenario ONE, there is never any allocation: escape analysis must have worked. In scenario TWO, because of the inlined conditional, the 16 byte Optional
is allocated once per invocation with non-null input, and when the input is always null, there are fewer allocations. However, when the inlining doesn’t work in scenarios THREE and FOUR, an extra 16 bytes is allocated once per invocation, but it’s not related to inlining. The unintentional 16 bytes comes from capturing the variable in each case (a 12 byte header and 4 byte compressed reference to the String
), but how often do you check your benchmarks to ensure you are measuring what you think you are?
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: isPresent | Param: scenario |
---|---|---|---|---|---|---|---|---|
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 0.000002 | 0 | B/op | true | ONE |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 16.000003 | 0 | B/op | true | TWO |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 32.000005 | 0 | B/op | true | THREE |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 32.000005 | 0 | B/op | true | FOUR |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 0.000002 | 0 | B/op | false | ONE |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 0.000002 | 0 | B/op | false | TWO |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 16.000005 | 0 | B/op | false | THREE |
mapValueNoAllocation:·gc.alloc.rate.norm | thrpt | 1 | 15 | 16.000005 | 0 | B/op | false | FOUR |
It’s not the sort of thing that can be exploited in real programs, but it looks as if allocations are better eliminated when the method, be it virtual or inlined, only ever sees a null value. Actually, Optional.empty()
always returns the same instance, so there were no allocations in the first place.
Having contrived a case to accentuate the effect, it’s worth noting that the impact of failure to inline is smaller than the difference in the cost of allocating an instance and storing the value with different garbage collectors, which is a cost some developers seem to be unaware of.
@State(Scope.Benchmark)
public static class InstantStoreEscapeeState {
@Param({"ONE", "TWO", "THREE", "FOUR"})
Scenario scenario;
@Param({"true", "false"})
boolean isPresent;
int size = 4;
String input;
Escapee<Instant>[] escapees;
Instant[] target;
@Setup(Level.Trial)
public void init() {
escapees = new Escapee[size];
target = new Instant[size];
scenario.fill(escapees);
input = isPresent ? "" : null;
}
}
@Benchmark
@OperationsPerInvocation(4)
public void mapAndStoreValue(InstantStoreEscapeeState state, Blackhole bh) {
for (int i = 0; i < state.escapees.length; ++i) {
state.target[i] = state.escapees[i].map(state.input, x -> Instant.now()).orElseGet(Instant::now);
}
bh.consume(state.target);
}
I run the same benchmark in two modes:
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-serial.csv -prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapAndStoreValue$
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-g1.csv -prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseG1GC -mx8G" EscapeeBenchmark.mapAndStoreValue$
The cost of changing the garbage collector when triggering the write barriers (simple in the case of the serial collector and complex in the case of G1) is about as large as the cost of missing out on inlining. Note that this is not an argument that garbage collector overhead is unacceptable!
Benchmark | GC | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit | Param: isPresent | Param: scenario |
---|---|---|---|---|---|---|---|---|---|
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 23.739993 | 0.297493 | ops/us | true | ONE |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 22.41715 | 0.502928 | ops/us | true | TWO |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 21.096494 | 0.629228 | ops/us | true | THREE |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 20.656528 | 0.604725 | ops/us | true | FOUR |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 24.098976 | 0.479819 | ops/us | false | ONE |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 23.759017 | 0.460972 | ops/us | false | TWO |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 21.473803 | 0.411786 | ops/us | false | THREE |
mapAndStoreValue | SerialGC | thrpt | 1 | 15 | 21.524173 | 0.393322 | ops/us | false | FOUR |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 20.522258 | 0.463444 | ops/us | true | ONE |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 18.520677 | 0.229133 | ops/us | true | TWO |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 18.359042 | 0.276809 | ops/us | true | THREE |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 18.446654 | 0.272189 | ops/us | true | FOUR |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 20.768856 | 0.496087 | ops/us | false | ONE |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 20.277051 | 0.411466 | ops/us | false | TWO |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 18.875519 | 0.399535 | ops/us | false | THREE |
mapAndStoreValue | G1GC | thrpt | 1 | 15 | 18.824234 | 0.657469 | ops/us | false | FOUR |
Inlining makes escape analysis more effective by allowing the analysis to apply to larger blocks of code, but is only effective when only one implementation is used. The marginal benefit decreases in the presence of even trivial allocation, but can be expected to increase with the size of the eliminated allocation. The difference can even be smaller than the runtime cost of write barriers in some garbage collectors. My benchmarks are on github, they were run with OpenJDK 11+28 on Ubuntu 18.04.2 LTS.
Perhaps this analysis is facile; many optimisations more powerful than escape analysis depend on inlining. The next post in the series will be on the benefits, or lack thereof, of inlining a reduction operation such as a hash code.