Loop Fission

Loop fission is a process normally applied by a compiler to loops to make them faster. The idea is to split larger loop bodies which perform several distinct tasks into separate loops, in the hope that the individual loops can be optimised more effectively in isolation. As far as I’m aware, C2, Hotspot’s JIT compiler, doesn’t do this so you have to do it yourself. I came across a couple of cases where this was profitable recently, and this post uses these as examples.

XOR
Group By

XOR

One of the operations which can be performed between two RoaringBitmap objects is an XOR or symmetric difference. After applying a destructive (i.e. one of the bitmaps will be mutated) XOR between two bitmaps, the mutated bitmap will contain a bit wherever the bitmaps differed before the XOR. One of the three container types in a RoaringBitmap is a BitmapContainer which contains a bitmap, a dense collection of bits. When a destructive XOR is applied between two BitmapContainers, the mutated container must have its bits updated, but they also need to be counted. This can be done in one loop or two separate loops, as in the representative benchmark below:

@State(Scope.Benchmark)
public class XorCount {

  @Param({"256", "512", "1024", "2048", "4096"})
  int size;

  private long[] left;
  private long[] right;

  @Setup(Level.Trial)
  public void setup() {
    left = new long[size];
    right = new long[size];
    for (int i = 0; i < size; i++) {
      left[i] = ThreadLocalRandom.current().nextLong();
      right[i] = ThreadLocalRandom.current().nextLong();
    }
  }


  @Benchmark
  public int fused() {
    int count = 0;
    for (int i = 0; i < left.length & i < right.length; i++) {
      left[i] ^= right[i];
      count += Long.bitCount(left[i]);
    }
    return count;
  }

  @Benchmark
  public int fissured() {
    for (int i = 0; i < left.length & i < right.length; i++) {
      left[i] ^= right[i];
    }
    int count = 0;
    for (long l : left) {
      count += Long.bitCount(l);
    }
    return count;
  }
}

It seems intuitive that fused would be faster; there is only one pass over the data, but this isn’t necessarily the case.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: size
XorCount.fissured	avgt	1	5	146.324870	0.166643	ns/op	256
XorCount.fissured	avgt	1	5	283.583423	0.156367	ns/op	512
XorCount.fissured	avgt	1	5	564.507047	0.512299	ns/op	1024
XorCount.fissured	avgt	1	5	1154.083245	6.005591	ns/op	2048
XorCount.fissured	avgt	1	5	2805.933619	10.880235	ns/op	4096
XorCount.fused	avgt	1	5	229.713707	0.138824	ns/op	256
XorCount.fused	avgt	1	5	462.435821	1.015399	ns/op	512
XorCount.fused	avgt	1	5	908.624781	0.759116	ns/op	1024
XorCount.fused	avgt	1	5	1802.717233	3.106054	ns/op	2048
XorCount.fused	avgt	1	5	3585.279127	6.716832	ns/op	4096

XorCount

For small sizes, the fissured loop is noticeably faster, with convergence as the lengths grow. Eventually, the bitset would get so large that doing a single pass would make more sense as it makes better use of cache, but in a BitmapContainer the length is fixed to 1024 elements, so this is immaterial.

Why is the fissured loop faster? This is because of the loop which performs the XOR operation:

    for (int i = 0; i < left.length & i < right.length; i++) {
      left[i] ^= right[i];
    }

This loop can be autovectorised - many XORs can be computed in a single instruction. On my laptop, running JDK11 on Ubuntu, 256 bits or 4 longs are operated on from each array at a time.

  1.11%  │    │││ │↗      ││  0x00007f1bc835b892: vmovdqu 0x10(%rdi,%rdx,8),%ymm0
  0.31%  │    │││ ││      ││  0x00007f1bc835b898: vpxor  0x10(%r11,%rdx,8),%ymm0,%ymm0
  0.40%  │    │││ ││      ││  0x00007f1bc835b89f: vmovdqu %ymm0,0x10(%r11,%rdx,8)
  1.09%  │    │││ ││      ││  0x00007f1bc835b8a6: vmovdqu 0x30(%rdi,%rdx,8),%ymm0
  0.44%  │    │││ ││      ││  0x00007f1bc835b8ac: vpxor  0x30(%r11,%rdx,8),%ymm0,%ymm0
  1.69%  │    │││ ││      ││  0x00007f1bc835b8b3: vmovdqu %ymm0,0x30(%r11,%rdx,8)
  0.42%  │    │││ ││      ││  0x00007f1bc835b8ba: vmovdqu 0x50(%rdi,%rdx,8),%ymm0
  0.31%  │    │││ ││      ││  0x00007f1bc835b8c0: vpxor  0x50(%r11,%rdx,8),%ymm0,%ymm0
  0.77%  │    │││ ││      ││  0x00007f1bc835b8c7: vmovdqu %ymm0,0x50(%r11,%rdx,8)
  0.54%  │    │││ ││      ││  0x00007f1bc835b8ce: vmovdqu 0x70(%rdi,%rdx,8),%ymm0
  0.42%  │    │││ ││      ││  0x00007f1bc835b8d4: vpxor  0x70(%r11,%rdx,8),%ymm0,%ymm0
  2.28%  │    │││ ││      ││  0x00007f1bc835b8db: vmovdqu %ymm0,0x70(%r11,%rdx,8)
  0.33%  │    │││ ││      ││  0x00007f1bc835b8e2: vmovdqu 0x90(%rdi,%rdx,8),%ymm0
  0.46%  │    │││ ││      ││  0x00007f1bc835b8eb: vpxor  0x90(%r11,%rdx,8),%ymm0,%ymm0
  1.13%  │    │││ ││      ││  0x00007f1bc835b8f5: vmovdqu %ymm0,0x90(%r11,%rdx,8)
  0.19%  │    │││ ││      ││  0x00007f1bc835b8ff: vmovdqu 0xb0(%rdi,%rdx,8),%ymm0
  0.36%  │    │││ ││      ││  0x00007f1bc835b908: vpxor  0xb0(%r11,%rdx,8),%ymm0,%ymm0
  2.38%  │    │││ ││      ││  0x00007f1bc835b912: vmovdqu %ymm0,0xb0(%r11,%rdx,8)
  0.13%  │    │││ ││      ││  0x00007f1bc835b91c: vmovdqu 0xd0(%rdi,%rdx,8),%ymm0
  0.40%  │    │││ ││      ││  0x00007f1bc835b925: vpxor  0xd0(%r11,%rdx,8),%ymm0,%ymm0
  1.44%  │    │││ ││      ││  0x00007f1bc835b92f: vmovdqu %ymm0,0xd0(%r11,%rdx,8)
  0.15%  │    │││ ││      ││  0x00007f1bc835b939: vmovdqu 0xf0(%rdi,%rdx,8),%ymm0
  0.23%  │    │││ ││      ││  0x00007f1bc835b942: vpxor  0xf0(%r11,%rdx,8),%ymm0,%ymm0
  2.84%  │    │││ ││      ││  0x00007f1bc835b94c: vmovdqu %ymm0,0xf0(%r11,%rdx,8)

The other loop doesn’t get the same treatment from the compiler:

    int count = 0;
    for (long l : left) {
      count += Long.bitCount(l);
    }
    return count;

Each of the popcnt instructions below operate on 64 bits each.

  0.09%  │││              ↗   0x00007f3b1c35bd02: popcnt 0x28(%r11,%rbx,8),%r8
 12.50%  │││              │   0x00007f3b1c35bd09: popcnt 0x20(%r11,%rbx,8),%rdi
  0.02%  │││              │   0x00007f3b1c35bd10: popcnt 0x48(%r11,%rbx,8),%rdx
 12.86%  │││              │   0x00007f3b1c35bd17: popcnt 0x40(%r11,%rbx,8),%r9
  0.02%  │││              │   0x00007f3b1c35bd1e: popcnt 0x38(%r11,%rbx,8),%rax
  0.02%  │││              │   0x00007f3b1c35bd25: popcnt 0x30(%r11,%rbx,8),%rsi
         │││              │   0x00007f3b1c35bd2c: popcnt 0x18(%r11,%rbx,8),%r13
  7.61%  │││              │   0x00007f3b1c35bd33: popcnt 0x10(%r11,%rbx,8),%rbp
         │││              │   0x00007f3b1c35bd3a: add    %ecx,%ebp
  0.07%  │││              │   0x00007f3b1c35bd3c: add    %ebp,%r13d
  0.02%  │││              │   0x00007f3b1c35bd3f: add    %edi,%r13d
  8.25%  │││              │   0x00007f3b1c35bd42: add    %r8d,%r13d
  0.07%  │││              │   0x00007f3b1c35bd45: add    %r13d,%esi
  0.07%  │││              │   0x00007f3b1c35bd48: add    %esi,%eax
  7.94%  │││              │   0x00007f3b1c35bd4a: add    %eax,%r9d
  8.10%  │││              │   0x00007f3b1c35bd4d: add    %r9d,%edx

When the loops are fused (or aren’t fissured) the XORs aren’t vectorised and only operate on 64 bits at a time.

  0.31%    ↗│││   0x00007f6eac35baa0: mov    0x10(%r10,%r11,8),%rsi
  2.28%    ││││   0x00007f6eac35baa5: xor    0x10(%rbx,%r11,8),%rsi  
           ││││                                                 
           ││││                                                 
  7.69%    ││││   0x00007f6eac35baaa: mov    %rsi,0x10(%rbx,%r11,8)
           ││││                                                 
           ││││                                                 
  2.06%    ││││   0x00007f6eac35baaf: popcnt %rsi,%rsi
  8.92%    ││││   0x00007f6eac35bab4: add    %esi,%edx

In the fused loop, the loop advances at the pace of the slowest operation. At some point, the array would get so large that not doing two passes over the array would trump better code generation, which might be why C2 is cautious here. Notably, the popcount operation can be vectorised and clang does so, so the fused loop written in C++ and compiled with clang would be much faster. clang will also perform loop fission (which it calls distribution).
This is a good example of performance intuition from one language not carrying over into another.

Group By

I recently started working on Apache Pinot, which is a scalable OLAP store. In a group by query, values from one column are aggregated by values in another, and in Pinot this is performed on a block of data at a time. One of the steps in the aggregation is to copy a set of dictionary ids corresponding to the groups and to mark which groups are present in the block so they can be iterated over when aggregating the other column. The total number of groups for the entire column is known from statistics collected about the values, but not at the block level. The fused and fissured operations look like this (resetting the array would not happen for real but is necessary to make the benchmark stable):

@State(Scope.Benchmark)
public class GroupCount {

  @Param({"8", "64", "128"})
  int groups;
  @Param({"256", "512", "1024", "2048", "4096"})
  int length;

  private byte[] source;
  private byte[] dest;
  private boolean[] presence;

  @Setup(Level.Trial)
  public void setup() {
    source = new byte[length];
    dest = new byte[length];
    presence = new boolean[groups];
    SplittableRandom random = new SplittableRandom(42);
    for (int i = 0; i < source.length; i++) {
      source[i] = (byte) random.nextInt(groups);
    }
  }

  @Benchmark
  public void fused(Blackhole bh) {
    int numGroups = 0;
    for (int i = 0; i < source.length & i < dest.length; i++) {
      dest[i] = source[i];
      if (numGroups < groups && !presence[source[i]]) {
        presence[source[i] & 0xFF] = true;
        numGroups++;
      }
    }
    bh.consume(presence);
    Arrays.fill(presence, false);
  }

  @Benchmark
  public void fissured(Blackhole bh) {
    System.arraycopy(source, 0, dest, 0, source.length);
    int numGroups = 0;
    for (int i = 0; i < source.length & i < dest.length & numGroups < groups; i++) {
      if (!presence[source[i]]) {
        presence[source[i] & 0xFF] = true;
        numGroups++;
      }
    }
    bh.consume(presence);
    Arrays.fill(presence, false);
  }
}

This is a little more interesting than the other loop. Firstly, splitting the loops makes it obvious that the copy part is a manual System.arraycopy. The second loop can terminate whenever numGroups == groups without consuming the entire input, but in the fused loop has to carry on to ensure the entire (slow) manual array copy can take place. Whilst fission makes the copy fast, and an obvious candidate for removal, fission allows the second loop to terminate early.

Early termination isn’t necessarily a good thing because if the number of groups is high enough for it to be likely that no single block will contain all the groups, the loop won’t exit early. Worse, with early termination, the exit condition from the loop is data dependent, which prevents aggressive unrolling, so the loop will actually be slower when it doesn’t exit early.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: groups	Param: length
GroupCount.fissured	avgt	1	5	41.606830	0.187490	ns/op	8	256
GroupCount.fissured	avgt	1	5	44.518117	0.183473	ns/op	8	512
GroupCount.fissured	avgt	1	5	50.099552	0.263780	ns/op	8	1024
GroupCount.fissured	avgt	1	5	61.991607	0.254863	ns/op	8	2048
GroupCount.fissured	avgt	1	5	134.196339	0.138997	ns/op	8	4096
GroupCount.fissured	avgt	1	5	542.240261	1.993239	ns/op	64	256
GroupCount.fissured	avgt	1	5	494.577003	0.331404	ns/op	64	512
GroupCount.fissured	avgt	1	5	501.147187	0.265733	ns/op	64	1024
GroupCount.fissured	avgt	1	5	510.997369	1.264461	ns/op	64	2048
GroupCount.fissured	avgt	1	5	575.261847	0.583042	ns/op	64	4096
GroupCount.fissured	avgt	1	5	603.379813	1.389718	ns/op	128	256
GroupCount.fissured	avgt	1	5	1121.185683	26.927230	ns/op	128	512
GroupCount.fissured	avgt	1	5	1132.562254	36.868698	ns/op	128	1024
GroupCount.fissured	avgt	1	5	1141.871722	18.612546	ns/op	128	2048
GroupCount.fissured	avgt	1	5	1198.111175	49.512727	ns/op	128	4096
GroupCount.fused	avgt	1	5	197.811972	0.352569	ns/op	8	256
GroupCount.fused	avgt	1	5	344.481666	0.602220	ns/op	8	512
GroupCount.fused	avgt	1	5	642.021575	2.235174	ns/op	8	1024
GroupCount.fused	avgt	1	5	1236.629115	7.446193	ns/op	8	2048
GroupCount.fused	avgt	1	5	2478.591752	2.470746	ns/op	8	4096
GroupCount.fused	avgt	1	5	380.130086	1.642884	ns/op	64	256
GroupCount.fused	avgt	1	5	441.293203	1.776462	ns/op	64	512
GroupCount.fused	avgt	1	5	793.848668	17.267290	ns/op	64	1024
GroupCount.fused	avgt	1	5	1265.961589	5.980144	ns/op	64	2048
GroupCount.fused	avgt	1	5	2340.019651	4.249029	ns/op	64	4096
GroupCount.fused	avgt	1	5	290.768967	9.365530	ns/op	128	256
GroupCount.fused	avgt	1	5	752.002294	0.905900	ns/op	128	512
GroupCount.fused	avgt	1	5	1110.519415	1.109794	ns/op	128	1024
GroupCount.fused	avgt	1	5	1887.382144	48.908744	ns/op	128	2048
GroupCount.fused	avgt	1	5	2550.872751	23.857060	ns/op	128	4096

results

As can be seen above, loop fission changes things a lot, and the effect of the number of groups is more pronounced in the fissured implementation. When the ratio between length and groups is high, the fissured implementation wins (see 4096/8 134ns vs 2479ns). When the ratio is low, the effect is less extreme but in the opposite direction (see 512/128 1121ns vs 752ns). If you like looking at things like perfasm output to compare the counted and non–counted loop, it is here. Given that there are statistics about the number of groups, and the block size is a small and bounded value, fission allows a data driven decision to be made for whether to attempt to exit the second loop early.