Performance Myths and Continuous Profiling

My last post was about five very simple things you can do to avoid Java programs from being slower than they need to be. The reception to this post was mixed. Some readers agreed that the problems mentioned in the post were indeed very common, while others suggested more common inefficient patterns to avoid. For example, I could have suggested to precompile regular expressions, or not to program by exception, or to avoid String.format, but I felt this was all well covered already. The aim of the post was to help Java programmers to program defensively for efficiency - just as many do for correctness - by giving an idea of what some common things cost. However, several readers dismissed the content of the post as premature optimisation.

It’s interesting how negative responses to posts about minor software efficiencies can be. There will be various causes for this reaction, but a common theme is the existence of performance myths perpetuated by bad blog posts. I certainly hope that I’m not perpetuating performance myths but this position is understandable.

Performance myths can be split into two categories, and I’ll give an example of each:

Those that were never true
Those that used to be true

I think it’s the second category which cause the strongest reactions because they counteract future improvements. In this post I take a look at examples of both before discussing continuous profiling since the items I chose were based on things I’ve seen show up in profiles a lot.

Never true: prefer pre-increment to post-increment for loop induction variables
True once: prefer StringBuilder to string concatenation
Continuous profiling

Never true: prefer pre-increment to post-increment for loop induction variables

I have encountered claims that loops which use pre-increment for loop induction variables are faster than those which use post-increment. So:

for (int i = 0; i < size; i++) {
   doIt();  
}

would be slightly slower than:

for (int i = 0; i < size; ++i) {
   doIt();  
}

Even if this were true, the loop body would dominate the cost of control flow in most cases. It’s worth looking at the origin of the myth and the reasoning behind it though. The myth comes from C++, where it’s possible to overload increment operators. As you are probably aware, post-increment yields the value before performing the increment. This may lead to copying the value, so unless the compiler eliminates the copy because it’s not observable, pre-increment would have an advantage. I’m not going to dig in to how often or when this is true in C++ because it has no relevance to programs written in Java.

Since the JIT compiler optimises loops, it’s very unlikely that it will generate code corresponding to either i++ or ++i. This can be seen by analysing a simple benchmark:

@State(Scope.Benchmark)
public class Increments {

  @Param("1024")
  int size;

  private int[] input;
  private int[] output;

  @Setup(Level.Trial)
  public void setup() {
    input = ThreadLocalRandom.current().ints(size).toArray();
    output = new int[size];
  }

  @Benchmark
  public void autovecPre(Blackhole bh) {
    for (int i = 0; i < input.length; ++i) {
      output[i] += input[i];
    }
    bh.consume(output);
  }

  @Benchmark
  public void autovecPost(Blackhole bh) {
    for (int i = 0; i < input.length; i++) {
      output[i] += input[i];
    }
    bh.consume(output);
  }

  @Benchmark
  public int reducePre(Blackhole bh) {
    int sum = 0;
    for (int i = 0; i < input.length; ++i) {
      sum += Integer.bitCount(input[i]);
    }
    return sum;
  }

  @Benchmark
  public int reducePost(Blackhole bh) {
    int sum = 0;
    for (int i = 0; i < input.length; i++) {
      sum += Integer.bitCount(input[i]);
    }
    return sum;
  }

  @Benchmark
  public void blackholedPre(Blackhole bh) {
    for (int i = 0; i < input.length; ++i) {
      bh.consume(i);
    }
  }

  @Benchmark
  public void blackholedPost(Blackhole bh) {
    for (int i = 0; i < input.length; i++) {
      bh.consume(i);
    }
  }
}

The benchmark consists of a pre- and post-increment version of three loops selected carefully. Loops autovecPre and autovecPost can be autovectorised. Loops reducePre and reducePost can’t be autovectorised (because the operation in the loop body can’t, but fortunately this will no longer be the case on some hardware, see JDK-8278868) but can be unrolled. Using Blackhold::consume prevents loop unrolling, so different code will be generated in blackholedPre and blackholedPost than in the other loops.

Firstly, there is no difference in the scores for these benchmarks (pre is slower once but it’s just noise):

Benchmark                  (size)  Mode  Cnt     Score   Error  Units
Increments.autovecPost       1024  avgt    5   156.744 ± 0.416  ns/op
Increments.autovecPre        1024  avgt    5   158.390 ± 0.096  ns/op
Increments.blackholedPost    1024  avgt    5  2798.470 ± 2.657  ns/op
Increments.blackholedPre     1024  avgt    5  2798.259 ± 2.553  ns/op
Increments.reducePost        1024  avgt    5   456.012 ± 0.267  ns/op
Increments.reducePre         1024  avgt    5   455.922 ± 0.299  ns/op

This is because the loops are JIT-compiled exactly the same way.

Here’s autovecPre and autovecPost, where i is incremented by 64 at a time: add $0x40,%edx.

....[Hottest Region 1]..............................................................................
c2, level 4, inc.generated.Increments_autovecPost_jmhTest::autovecPost_avgt_jmhStub, version 472 (464 bytes) 

                       0x00007fd6f435aff4: vzeroupper 
                       0x00007fd6f435aff7: add    $0x30,%rsp
                       0x00007fd6f435affb: pop    %rbp
                       0x00007fd6f435affc: mov    0x108(%r15),%r10
                       0x00007fd6f435b003: test   %eax,(%r10)        
                       0x00007fd6f435b006: retq   
                       0x00007fd6f435b007: nop                       
                                                                     
                                                                     
         ↗         ↗   0x00007fd6f435b008: vmovdqu 0x10(%r8,%rdx,4),%ymm0
  0.73%  │         │   0x00007fd6f435b00f: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
  0.75%  │         │   0x00007fd6f435b016: vmovdqu %ymm0,0x10(%r8,%rdx,4)
         │         │                                                 
         │         │                                                 
         │         │   0x00007fd6f435b01d: add    $0x8,%edx          
         │         │                                                 
         │         │                                                 
  0.66%  │         │   0x00007fd6f435b020: cmp    %r9d,%edx
         ╰         │   0x00007fd6f435b023: jl     0x00007fd6f435b008 
                   │                                                 
                   │                                                 
                 ↗ │↗  0x00007fd6f435b025: cmp    %r11d,%edx
          ╭      │ ││  0x00007fd6f435b028: jge    0x00007fd6f435b03d
  0.02%   │      │ ││  0x00007fd6f435b02a: xchg   %ax,%ax            
          │      │ ││                                                
          │      │ ││                                                
          │↗     │ ││  0x00007fd6f435b02c: mov    0x10(%r10,%rdx,4),%ebx
  0.54%   ││     │ ││  0x00007fd6f435b031: add    %ebx,0x10(%r8,%rdx,4)
          ││     │ ││                                                
          ││     │ ││                                                
  0.58%   ││     │ ││  0x00007fd6f435b036: inc    %edx               
          ││     │ ││                                                
          ││     │ ││                                                
  0.54%   ││     │ ││  0x00007fd6f435b038: cmp    %r11d,%edx
          │╰     │ ││  0x00007fd6f435b03b: jl     0x00007fd6f435b02c 
          │      │ ││                                                
          │      │ ││                                                
          ↘ ↗    │ ││  0x00007fd6f435b03d: mov    %rcx,%rdx
  0.23%     │    │ ││  0x00007fd6f435b040: shl    $0x3,%rdx          
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007fd6f435b044: mov    0x10(%rsp),%rsi
  0.10%     │    │ ││  0x00007fd6f435b049: data16 xchg %ax,%ax
            │    │ ││  0x00007fd6f435b04c: vzeroupper 
  0.29%     │    │ ││  0x00007fd6f435b04f: callq  0x00007fd6ec89cb00 
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007fd6f435b054: mov    0x40(%rsp),%r10
  0.17%     │    │ ││  0x00007fd6f435b059: movzbl 0x94(%r10),%r10d   
            │    │ ││                                                
            │    │ ││                                                
  0.10%     │    │ ││  0x00007fd6f435b061: mov    0x108(%r15),%r11
            │    │ ││  0x00007fd6f435b068: add    $0x1,%rbp          
            │    │ ││                                                
            │    │ ││                                                
  0.08%     │    │ ││  0x00007fd6f435b06c: test   %eax,(%r11)        
            │    │ ││  0x00007fd6f435b06f: test   %r10d,%r10d
            │    │ ││  0x00007fd6f435b072: jne    0x00007fd6f435afcf 
            │    │ ││                                                
  0.02%     │    │ ││  0x00007fd6f435b078: mov    0x50(%rsp),%r10
            │    │ ││  0x00007fd6f435b07d: mov    0x10(%r10),%r10d   
            │    │ ││                                                
            │    │ ││                                                
  0.12%     │    │ ││  0x00007fd6f435b081: mov    0xc(%r12,%r10,8),%r11d  
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007fd6f435b086: mov    0x50(%rsp),%r8
  0.06%     │    │ ││  0x00007fd6f435b08b: mov    0x14(%r8),%ecx     
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007fd6f435b08f: test   %r11d,%r11d
            ╰    │ ││  0x00007fd6f435b092: jbe    0x00007fd6f435b03d  
                 │ ││                                                
                 │ ││                                                
  0.23%          │ ││  0x00007fd6f435b094: mov    0xc(%r12,%rcx,8),%r8d  
                 │ ││                                                
                 │ ││                                                
                 │ ││                                                
                 │ ││  0x00007fd6f435b099: test   %r8d,%r8d
             ╭   │ ││  0x00007fd6f435b09c: jbe    0x00007fd6f435b1ed
  0.08%      │   │ ││  0x00007fd6f435b0a2: mov    %r11d,%r9d
             │   │ ││  0x00007fd6f435b0a5: dec    %r9d
  0.17%      │   │ ││  0x00007fd6f435b0a8: cmp    %r8d,%r9d
             │╭  │ ││  0x00007fd6f435b0ab: jae    0x00007fd6f435b1ed
             ││  │ ││  0x00007fd6f435b0b1: cmp    %r11d,%r9d
             ││╭ │ ││  0x00007fd6f435b0b4: jae    0x00007fd6f435b1ed
  0.08%      │││ │ ││  0x00007fd6f435b0ba: shl    $0x3,%r10
             │││ │ ││  0x00007fd6f435b0be: lea    (%r12,%rcx,8),%r8
  0.19%      │││ │ ││  0x00007fd6f435b0c2: mov    %r8d,%r9d
             │││ │ ││  0x00007fd6f435b0c5: shr    $0x2,%r9d
  0.06%      │││ │ ││  0x00007fd6f435b0c9: and    $0x7,%r9d
             │││ │ ││  0x00007fd6f435b0cd: mov    $0x3,%ebx
  0.19%      │││ │ ││  0x00007fd6f435b0d2: sub    %r9d,%ebx
             │││ │ ││  0x00007fd6f435b0d5: and    $0x7,%ebx
  0.06%      │││ │ ││  0x00007fd6f435b0d8: inc    %ebx
             │││ │ ││  0x00007fd6f435b0da: cmp    %r11d,%ebx
  0.08%      │││ │ ││  0x00007fd6f435b0dd: cmovg  %r11d,%ebx
             │││ │ ││  0x00007fd6f435b0e1: xor    %edx,%edx          
             │││ │ ││                                                
             │││ │ ││                                                
  0.48%      │││↗│ ││  0x00007fd6f435b0e3: mov    0x10(%r10,%rdx,4),%edi
             │││││ ││  0x00007fd6f435b0e8: add    %edi,0x10(%r8,%rdx,4)  
             │││││ ││                                                
             │││││ ││                                                
  0.95%      │││││ ││  0x00007fd6f435b0ed: inc    %edx               
             │││││ ││                                                
             │││││ ││                                                
             │││││ ││  0x00007fd6f435b0ef: cmp    %ebx,%edx
             │││╰│ ││  0x00007fd6f435b0f1: jl     0x00007fd6f435b0e3  
             │││ │ ││                                                
             │││ │ ││                                                
  0.02%      │││ │ ││  0x00007fd6f435b0f3: mov    %r11d,%ebx
             │││ │ ││  0x00007fd6f435b0f6: add    $0xffffffc1,%ebx
  0.21%      │││ │ ││  0x00007fd6f435b0f9: cmp    %ebx,%edx
             │││ ╰ ││  0x00007fd6f435b0fb: jge    0x00007fd6f435b025  
             │││   ││                                                
             │││   ││                                                
  2.90%      │││  ↗││  0x00007fd6f435b101: vmovdqu 0x10(%r8,%rdx,4),%ymm0
  0.04%      │││  │││  0x00007fd6f435b108: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
 10.08%      │││  │││  0x00007fd6f435b10f: vmovdqu %ymm0,0x10(%r8,%rdx,4)
  1.00%      │││  │││  0x00007fd6f435b116: vmovdqu 0x30(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b11d: vpaddd 0x30(%r10,%rdx,4),%ymm0,%ymm0
  9.27%      │││  │││  0x00007fd6f435b124: vmovdqu %ymm0,0x30(%r8,%rdx,4)
  0.79%      │││  │││  0x00007fd6f435b12b: vmovdqu 0x50(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b132: vpaddd 0x50(%r10,%rdx,4),%ymm0,%ymm0
  8.07%      │││  │││  0x00007fd6f435b139: vmovdqu %ymm0,0x50(%r8,%rdx,4)
  0.89%      │││  │││  0x00007fd6f435b140: vmovdqu 0x70(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b147: vpaddd 0x70(%r10,%rdx,4),%ymm0,%ymm0
  9.58%      │││  │││  0x00007fd6f435b14e: vmovdqu %ymm0,0x70(%r8,%rdx,4)
  0.52%      │││  │││  0x00007fd6f435b155: vmovdqu 0x90(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b15f: vpaddd 0x90(%r10,%rdx,4),%ymm0,%ymm0
  8.34%      │││  │││  0x00007fd6f435b169: vmovdqu %ymm0,0x90(%r8,%rdx,4)
  0.64%      │││  │││  0x00007fd6f435b173: vmovdqu 0xb0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b17d: vpaddd 0xb0(%r10,%rdx,4),%ymm0,%ymm0
  9.71%      │││  │││  0x00007fd6f435b187: vmovdqu %ymm0,0xb0(%r8,%rdx,4)
  0.58%      │││  │││  0x00007fd6f435b191: vmovdqu 0xd0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b19b: vpaddd 0xd0(%r10,%rdx,4),%ymm0,%ymm0
 14.68%      │││  │││  0x00007fd6f435b1a5: vmovdqu %ymm0,0xd0(%r8,%rdx,4)
  0.54%      │││  │││  0x00007fd6f435b1af: vmovdqu 0xf0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007fd6f435b1b9: vpaddd 0xf0(%r10,%rdx,4),%ymm0,%ymm0
  9.46%      │││  │││  0x00007fd6f435b1c3: vmovdqu %ymm0,0xf0(%r8,%rdx,4)  
             │││  │││                                                
             │││  │││                                                
  0.52%      │││  │││  0x00007fd6f435b1cd: add    $0x40,%edx         
             │││  │││                                                
             │││  │││                                                
             │││  │││  0x00007fd6f435b1d0: cmp    %ebx,%edx
             │││  ╰││  0x00007fd6f435b1d2: jl     0x00007fd6f435b101
  0.23%      │││   ││  0x00007fd6f435b1d8: mov    %r11d,%r9d
             │││   ││  0x00007fd6f435b1db: add    $0xfffffff9,%r9d
  0.10%      │││   ││  0x00007fd6f435b1df: cmp    %r9d,%edx
             │││   ╰│  0x00007fd6f435b1e2: jl     0x00007fd6f435b008
             │││    ╰  0x00007fd6f435b1e8: jmpq   0x00007fd6f435b025 
             │││                                                     
             │││                                                     
             ↘↘↘       0x00007fd6f435b1ed: mov    $0xffffff7e,%esi
                       0x00007fd6f435b1f2: mov    %r11d,0x18(%rsp)
                       0x00007fd6f435b1f7: nop
                       0x00007fd6f435b1f8: vzeroupper 
                       0x00007fd6f435b1fb: callq  0x00007fd6ec89d280  
                                                                     
....................................................................................................
 95.73%  <total for region 1>

....[Hottest Region 1]..............................................................................
c2, level 4, inc.generated.Increments_autovecPre_jmhTest::autovecPre_avgt_jmhStub, version 460 (464 bytes) 
             
                       0x00007f44cc35d2f4: vzeroupper 
                       0x00007f44cc35d2f7: add    $0x30,%rsp
                       0x00007f44cc35d2fb: pop    %rbp
                       0x00007f44cc35d2fc: mov    0x108(%r15),%r10
                       0x00007f44cc35d303: test   %eax,(%r10)        
                       0x00007f44cc35d306: retq   
                       0x00007f44cc35d307: nop                       
                                                                     
                                                                     
  0.77%  ↗         ↗   0x00007f44cc35d308: vmovdqu 0x10(%r8,%rdx,4),%ymm0
  0.25%  │         │   0x00007f44cc35d30f: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
  3.27%  │         │   0x00007f44cc35d316: vmovdqu %ymm0,0x10(%r8,%rdx,4)
         │         │                                                 
         │         │                                                 
  0.29%  │         │   0x00007f44cc35d31d: add    $0x8,%edx          
         │         │                                                 
         │         │                                                 
  0.10%  │         │   0x00007f44cc35d320: cmp    %r9d,%edx
         ╰         │   0x00007f44cc35d323: jl     0x00007f44cc35d308  
                   │                                                 
                   │                                                 
  0.19%          ↗ │↗  0x00007f44cc35d325: cmp    %r11d,%edx
          ╭      │ ││  0x00007f44cc35d328: jge    0x00007f44cc35d33d
          │      │ ││  0x00007f44cc35d32a: xchg   %ax,%ax            
          │      │ ││                                                
          │      │ ││                                                
  0.50%   │↗     │ ││  0x00007f44cc35d32c: mov    0x10(%r10,%rdx,4),%ebx
          ││     │ ││  0x00007f44cc35d331: add    %ebx,0x10(%r8,%rdx,4)
          ││     │ ││                                                
          ││     │ ││                                                
  1.06%   ││     │ ││  0x00007f44cc35d336: inc    %edx               
          ││     │ ││                                                
          ││     │ ││                                                
          ││     │ ││  0x00007f44cc35d338: cmp    %r11d,%edx
          │╰     │ ││  0x00007f44cc35d33b: jl     0x00007f44cc35d32c 
          │      │ ││                                                
          │      │ ││                                                
  0.10%   ↘ ↗    │ ││  0x00007f44cc35d33d: mov    %rcx,%rdx
            │    │ ││  0x00007f44cc35d340: shl    $0x3,%rdx          
            │    │ ││                                                
            │    │ ││                                                
  0.17%     │    │ ││  0x00007f44cc35d344: mov    0x10(%rsp),%rsi
            │    │ ││  0x00007f44cc35d349: data16 xchg %ax,%ax
  0.10%     │    │ ││  0x00007f44cc35d34c: vzeroupper 
  0.10%     │    │ ││  0x00007f44cc35d34f: callq  0x00007f44c489cb00 
            │    │ ││                                                
            │    │ ││                                               
            │    │ ││                                               
            │    │ ││                                               
  0.02%     │    │ ││  0x00007f44cc35d354: mov    0x40(%rsp),%r10
  0.08%     │    │ ││  0x00007f44cc35d359: movzbl 0x94(%r10),%r10d   
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007f44cc35d361: mov    0x108(%r15),%r11
  0.04%     │    │ ││  0x00007f44cc35d368: add    $0x1,%rbp          
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007f44cc35d36c: test   %eax,(%r11)        
  0.25%     │    │ ││  0x00007f44cc35d36f: test   %r10d,%r10d
            │    │ ││  0x00007f44cc35d372: jne    0x00007f44cc35d2cf 
            │    │ ││                                                
            │    │ ││  0x00007f44cc35d378: mov    0x50(%rsp),%r10
  0.04%     │    │ ││  0x00007f44cc35d37d: mov    0x10(%r10),%r10d   
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││  0x00007f44cc35d381: mov    0xc(%r12,%r10,8),%r11d
            │    │ ││                                                
            │    │ ││                                                
            │    │ ││                                                
  0.15%     │    │ ││  0x00007f44cc35d386: mov    0x50(%rsp),%r8
            │    │ ││  0x00007f44cc35d38b: mov    0x14(%r8),%ecx     
            │    │ ││                                                
            │    │ ││                                                
  0.04%     │    │ ││  0x00007f44cc35d38f: test   %r11d,%r11d
            ╰    │ ││  0x00007f44cc35d392: jbe    0x00007f44cc35d33d 
                 │ ││                                                
                 │ ││                                                
  0.06%          │ ││  0x00007f44cc35d394: mov    0xc(%r12,%rcx,8),%r8d
                 │ ││                                                
                 │ ││                                                
                 │ ││                                                
  0.27%          │ ││  0x00007f44cc35d399: test   %r8d,%r8d
             ╭   │ ││  0x00007f44cc35d39c: jbe    0x00007f44cc35d4ed
  0.15%      │   │ ││  0x00007f44cc35d3a2: mov    %r11d,%r9d
             │   │ ││  0x00007f44cc35d3a5: dec    %r9d
  0.17%      │   │ ││  0x00007f44cc35d3a8: cmp    %r8d,%r9d
             │╭  │ ││  0x00007f44cc35d3ab: jae    0x00007f44cc35d4ed
  0.08%      ││  │ ││  0x00007f44cc35d3b1: cmp    %r11d,%r9d
             ││╭ │ ││  0x00007f44cc35d3b4: jae    0x00007f44cc35d4ed
  0.19%      │││ │ ││  0x00007f44cc35d3ba: shl    $0x3,%r10
             │││ │ ││  0x00007f44cc35d3be: lea    (%r12,%rcx,8),%r8
             │││ │ ││  0x00007f44cc35d3c2: mov    %r8d,%r9d
  0.12%      │││ │ ││  0x00007f44cc35d3c5: shr    $0x2,%r9d
  0.04%      │││ │ ││  0x00007f44cc35d3c9: and    $0x7,%r9d
             │││ │ ││  0x00007f44cc35d3cd: mov    $0x3,%ebx
             │││ │ ││  0x00007f44cc35d3d2: sub    %r9d,%ebx
  0.10%      │││ │ ││  0x00007f44cc35d3d5: and    $0x7,%ebx
  0.17%      │││ │ ││  0x00007f44cc35d3d8: inc    %ebx
             │││ │ ││  0x00007f44cc35d3da: cmp    %r11d,%ebx
             │││ │ ││  0x00007f44cc35d3dd: cmovg  %r11d,%ebx
  0.06%      │││ │ ││  0x00007f44cc35d3e1: xor    %edx,%edx          
             │││ │ ││                                                
             │││ │ ││                                                
  0.19%      │││↗│ ││  0x00007f44cc35d3e3: mov    0x10(%r10,%rdx,4),%edi
  0.21%      │││││ ││  0x00007f44cc35d3e8: add    %edi,0x10(%r8,%rdx,4)
             │││││ ││                                                
             │││││ ││                                                
  0.50%      │││││ ││  0x00007f44cc35d3ed: inc    %edx               
             │││││ ││                                                
             │││││ ││                                                
  0.21%      │││││ ││  0x00007f44cc35d3ef: cmp    %ebx,%edx
             │││╰│ ││  0x00007f44cc35d3f1: jl     0x00007f44cc35d3e3 
             │││ │ ││                                                
             │││ │ ││                                                
  0.08%      │││ │ ││  0x00007f44cc35d3f3: mov    %r11d,%ebx
             │││ │ ││  0x00007f44cc35d3f6: add    $0xffffffc1,%ebx
             │││ │ ││  0x00007f44cc35d3f9: cmp    %ebx,%edx
             │││ ╰ ││  0x00007f44cc35d3fb: jge    0x00007f44cc35d325  
             │││   ││                                                
             │││   ││                                                
  3.25%      │││  ↗││  0x00007f44cc35d401: vmovdqu 0x10(%r8,%rdx,4),%ymm0
  0.42%      │││  │││  0x00007f44cc35d408: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
  9.16%      │││  │││  0x00007f44cc35d40f: vmovdqu %ymm0,0x10(%r8,%rdx,4)
  1.29%      │││  │││  0x00007f44cc35d416: vmovdqu 0x30(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d41d: vpaddd 0x30(%r10,%rdx,4),%ymm0,%ymm0
  8.47%      │││  │││  0x00007f44cc35d424: vmovdqu %ymm0,0x30(%r8,%rdx,4)
  0.60%      │││  │││  0x00007f44cc35d42b: vmovdqu 0x50(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d432: vpaddd 0x50(%r10,%rdx,4),%ymm0,%ymm0
 13.43%      │││  │││  0x00007f44cc35d439: vmovdqu %ymm0,0x50(%r8,%rdx,4)
  1.04%      │││  │││  0x00007f44cc35d440: vmovdqu 0x70(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d447: vpaddd 0x70(%r10,%rdx,4),%ymm0,%ymm0
  7.91%      │││  │││  0x00007f44cc35d44e: vmovdqu %ymm0,0x70(%r8,%rdx,4)
  0.58%      │││  │││  0x00007f44cc35d455: vmovdqu 0x90(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d45f: vpaddd 0x90(%r10,%rdx,4),%ymm0,%ymm0
  9.66%      │││  │││  0x00007f44cc35d469: vmovdqu %ymm0,0x90(%r8,%rdx,4)
  0.58%      │││  │││  0x00007f44cc35d473: vmovdqu 0xb0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d47d: vpaddd 0xb0(%r10,%rdx,4),%ymm0,%ymm0
  9.14%      │││  │││  0x00007f44cc35d487: vmovdqu %ymm0,0xb0(%r8,%rdx,4)
  0.50%      │││  │││  0x00007f44cc35d491: vmovdqu 0xd0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d49b: vpaddd 0xd0(%r10,%rdx,4),%ymm0,%ymm0
  9.08%      │││  │││  0x00007f44cc35d4a5: vmovdqu %ymm0,0xd0(%r8,%rdx,4)
  0.83%      │││  │││  0x00007f44cc35d4af: vmovdqu 0xf0(%r8,%rdx,4),%ymm0
             │││  │││  0x00007f44cc35d4b9: vpaddd 0xf0(%r10,%rdx,4),%ymm0,%ymm0
  8.52%      │││  │││  0x00007f44cc35d4c3: vmovdqu %ymm0,0xf0(%r8,%rdx,4)
             │││  │││                                                
             │││  │││                                                
  0.37%      │││  │││  0x00007f44cc35d4cd: add    $0x40,%edx         
             │││  │││                                                
             │││  │││                                                
             │││  │││  0x00007f44cc35d4d0: cmp    %ebx,%edx
             │││  ╰││  0x00007f44cc35d4d2: jl     0x00007f44cc35d401
  0.15%      │││   ││  0x00007f44cc35d4d8: mov    %r11d,%r9d
             │││   ││  0x00007f44cc35d4db: add    $0xfffffff9,%r9d
             │││   ││  0x00007f44cc35d4df: cmp    %r9d,%edx
             │││   ╰│  0x00007f44cc35d4e2: jl     0x00007f44cc35d308
             │││    ╰  0x00007f44cc35d4e8: jmpq   0x00007f44cc35d325 
             │││                                                     
             │││                                                     
             ↘↘↘       0x00007f44cc35d4ed: mov    $0xffffff7e,%esi
                       0x00007f44cc35d4f2: mov    %r11d,0x18(%rsp)
                       0x00007f44cc35d4f7: nop
                       0x00007f44cc35d4f8: vzeroupper 
....................................................................................................
 95.13%  <total for region 1>

I haven’t included the perfasm output for the other loops, but the compiled code is also identical. In reducePre and reducePost the loop is unrolled so i is incremented 8 at a time (add $0x8,%esi). In the loops contrived not to be unrolled, the increments are performed one at a time using inc %ebp, but the code is still identical.

Even though this is a myth, does it actually cause any harm? Of course not, loops optimised under this delusion are compiled the same way. The only negative scenario would be if a misguided team member were to insist on this practice being followed.

True once: prefer StringBuilder to string concatenation

A very long time ago, in JDK1.4, when concatenating Strings, a StringBuffer was used.

public class StringExample {
  
  public static String concatenate(String left, String right) {
    return left + right;
  }

  public static String concatenateStringBuilder(String left, String right) {
    return new StringBuilder(left.length() + right.length())
            .append(left).append(right).toString();
  }
}

The top method above would use StringBuffer, which has synchronized methods:

  public static java.lang.String concatenate(java.lang.String, java.lang.String);
    Code:
       0: new           #2                  // class java/lang/StringBuffer
       3: dup
       4: invokespecial #3                  // Method java/lang/StringBuffer."<init>":()V
       7: aload_0
       8: invokevirtual #4                  // Method java/lang/StringBuffer.append:(Ljava/lang/String;)Ljava/lang/StringBuffer;
      11: aload_1
      12: invokevirtual #4                  // Method java/lang/StringBuffer.append:(Ljava/lang/String;)Ljava/lang/StringBuffer;
      15: invokevirtual #5                  // Method java/lang/StringBuffer.toString:()Ljava/lang/String;
      18: areturn

  public static java.lang.String concatenateStringBuilder(java.lang.String, java.lang.String);
    Code:
       0: new           #6                  // class java/lang/StringBuilder
       3: dup
       4: aload_0
       5: invokevirtual #7                  // Method java/lang/String.length:()I
       8: aload_1
       9: invokevirtual #7                  // Method java/lang/String.length:()I
      12: iadd
      13: invokespecial #8                  // Method java/lang/StringBuilder."<init>":(I)V
      16: aload_0
      17: invokevirtual #9                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      20: aload_1
      21: invokevirtual #9                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      24: invokevirtual #10                 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;

This meant that using a StringBuilder at language level 1.4 would be beneficial. If you learnt Java when 1.4 was commonly used, you might think this is an essential optimisation.

Later, the generated bytecode was improved, so that StringBuilder would be used instead. In JDK1.8 (and earlier) the bytecode looks like this:

  public static java.lang.String concatenate(java.lang.String, java.lang.String);
    Code:
       0: new           #2                  // class java/lang/StringBuilder
       3: dup
       4: invokespecial #3                  // Method java/lang/StringBuilder."<init>":()V
       7: aload_0
       8: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      11: aload_1
      12: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      15: invokevirtual #5                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      18: areturn

  public static java.lang.String concatenateStringBuilder(java.lang.String, java.lang.String);
    Code:
       0: new           #2                  // class java/lang/StringBuilder
       3: dup
       4: aload_0
       5: invokevirtual #6                  // Method java/lang/String.length:()I
       8: aload_1
       9: invokevirtual #6                  // Method java/lang/String.length:()I
      12: iadd
      13: invokespecial #7                  // Method java/lang/StringBuilder."<init>":(I)V
      16: aload_0
      17: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      20: aload_1
      21: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      24: invokevirtual #5                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;

Maybe there’s a tiny benefit in sizing the StringBuilder, but now the code is harder to read for marginal gain. In JDK9, String concatenation was reworked, so the StringBuilder code is at a disadvantage to the more readable concatenation code:

  public static java.lang.String concatenate(java.lang.String, java.lang.String);
    Code:
       0: aload_0
       1: aload_1
       2: invokedynamic #2,  0              // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
       7: areturn

  public static java.lang.String concatenateStringBuilder(java.lang.String, java.lang.String);
    Code:
       0: new           #3                  // class java/lang/StringBuilder
       3: dup
       4: aload_0
       5: invokevirtual #4                  // Method java/lang/String.length:()I
       8: aload_1
       9: invokevirtual #4                  // Method java/lang/String.length:()I
      12: iadd
      13: invokespecial #5                  // Method java/lang/StringBuilder."<init>":(I)V
      16: aload_0
      17: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      20: aload_1
      21: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      24: invokevirtual #7                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;

The invokedynamic bytecode instruction, which can be used to drive better optimisations, is used instead of the StringBuilder. The confusing thing is this only applies when the number of strings is known, so not in loops, but in most cases it’s now better just to use +. I won’t bother to measure this progress because the people who made these changes to javac did that at the time, but the point is that even the bytecode can change in such a way to render old optimisations neutral or negative.

In my last post, I tried to avoid putting forward items which might become untrue in the future, but didn’t quite manage this. I predict that four of the five items will stand the test of time, but item three (avoid iterating over Enum.values()) may eventually become a myth if frozen arrays become a reality in the future.

Continuous profiling

What these myths have in common is they focus on low level technical trivia rather than fundamental algorithmic or engineering constraints, and trivial details are as likely to be misunderstood as they are to change. There are more important factors to system efficiency than low-level programming details. For instance, being required by specification to do something wasteful by an external system is a problem which should be solved architecturally. If architectural blunders create bottlenecks, it’s very costly to fix.

Assuming a sensible architecture, the causes of many costs are neither dictated by nor are observable to external systems, so can be changed easily. Efficiency at this level depends on choices of algorithm, data structure, and level of mechanical sympathy. These kinds of problems can be surfaced by using continuous profiling, and are often surprising. Continuous profiling, by its nature, tends to highlight what’s important to reducing overall cost, and filters out distracting and obvious fixed costs in favour of unexpected unit costs. My impression is that many engineers believe profiling to be too expensive to run in production, but I think these engineers’ minds would be blown by how much CPU time is spent collecting logs, metrics, and traces if they tried it. The reality is that it costs about 3% CPU utilisation and will lead to double-digit savings which translate to significantly reduced running costs for all but the leanest applications.

I think a diagram, drawn by Aleksey Shipilёv several years ago, explains where continuous profiling is useful:

I relax by drawing stuff on my large whiteboard. Here's "Perf Work Phase Diagram" for you. pic.twitter.com/RnITiLoNFH
— Aleksey Shipilëv (@shipilev) March 18, 2015

Essentially, continuous profiling will help you get from an Amazingly stupid program to Beauty Peak by making it obvious where the costs are; the majority of programs needn’t go beyond Beauty Peak. This process doesn’t consist entirely of making the kinds of mundane point substitutions in my last post. It’s often necessary to take a step back and ask why something costly is happening at all, and rework the algorithm, rather than think about cost reduction.

Despite all this, once you have data about what your program spends its time doing, many of the culprits will be mundane unless you have a very interesting program. When there’s no increase in complexity incurred in using a more efficient idiom, there’s no need to wait for its necessity to be proven by a profiler. In any case, continuous profiling should help calibrate your sense of which idioms are costly, and avoid the myths.