I recently read a very interesting blog post about exposing Intel SIMD intrinsics via a fork of the Scala compiler (scala-virtualized), which reports multiplicative improvements in throughput over HotSpot JIT compiled code. The academic paper (SIMD Intrinsics on Managed Language Runtimes), which has been accepted at CGO 2018, proposes a powerful alternative to the traditional JVM approach of pairing dumb programmers with a (hopefully) smart JIT compiler. Lightweight Modular Staging (LMS) allows the generation of an executable binary from a high level representation: handcrafted representations of vectorised algorithms, written in a dialect of Scala, can be compiled natively and later invoked with a single JNI call. This approach bypasses C2 without incurring excessive JNI costs. The freely available benchmarks can be easily run to reproduce the results in the paper, which is an achievement in itself, but some of the Java implementations used as baselines look less efficient than they could be. This post is about improving the efficiency of the Java matrix multiplication the LMS generated code is benchmarked against. Despite finding edge cases where autovectorisation fails, I find it is possible to get performance comparable to LMS with plain Java (and a JDK upgrade).

Two implementations of Java matrix multiplication are provided in the NGen benchmarks: JMMM.baseline - a naive but cache unfriendly matrix multiplication - and JMMM.blocked which is supplied as an improvement. JMMM.blocked is something of a local maximum because it does manual loop unrolling: this actually removes the trigger for autovectorisation analysis. I provide a simple and cache-efficient Java implementation (with the same asymptotic complexity, the improvement is just technical) and benchmark these implementations using JDK8 and the soon to be released JDK10 separately.

   public void fast(float[] a, float[] b, float[] c, int n) {
   int in = 0;
   for (int i = 0; i < n; ++i) {
       int kn = 0;
       for (int k = 0; k < n; ++k) {
           float aik = a[in + k];
           for (int j = 0; j < n; ++j) {
               c[in + j] += aik * b[kn + j];
           }
           kn += n;
       }
       in += n;
    }
}

With JDK 1.8.0_131, the “fast” implementation is only 2x faster than the blocked algorithm; this is nowhere near fast enough to match LMS. In fact, LMS does a lot better than 5x blocked (6x-8x) on my Skylake laptop at 2.6GHz, and performs between 2x and 4x better than the improved implementation. Flops / Cycle is calculated as size ^ 3 * 2 / CPU frequency Hz.

====================================================
Benchmarking MMM.jMMM.fast (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.4994459272
          32 | 1.0666533335
          64 | 0.9429120397
         128 | 0.9692385519
         192 | 0.9796619688
         256 | 1.0141446247
         320 | 0.9894415771
         384 | 1.0046245750
         448 | 1.0221353392
         512 | 0.9943527764
         576 | 0.9952093603
         640 | 0.9854689714
         704 | 0.9947153752
         768 | 1.0197765248
         832 | 1.0479691069
         896 | 1.0060121097
         960 | 0.9937347412
        1024 | 0.9056494897
====================================================

====================================================
Benchmarking MMM.nMMM.blocked (LMS generated)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.2500390686
          32 | 3.9999921875
          64 | 4.1626523901
         128 | 4.4618695374
         192 | 3.9598982956
         256 | 4.3737341517
         320 | 4.2412225389
         384 | 3.9640163416
         448 | 4.0957167537
         512 | 3.3801071278
         576 | 4.1869326167
         640 | 3.8225244883
         704 | 3.8648224140
         768 | 3.5240611589
         832 | 3.7941562681
         896 | 3.1735179981
         960 | 2.5856903789
        1024 | 1.7817152313
====================================================

====================================================
Benchmarking MMM.jMMM.blocked (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.3333854248
          32 | 0.6336670915
          64 | 0.5733484649
         128 | 0.5987433798
         192 | 0.5819900921
         256 | 0.5473562109
         320 | 0.5623263520
         384 | 0.5583823292
         448 | 0.5657882256
         512 | 0.5430879470
         576 | 0.5269635678
         640 | 0.5595204791
         704 | 0.5297557807
         768 | 0.5493631388
         832 | 0.5471832673
         896 | 0.4769554752
         960 | 0.4985080443
        1024 | 0.4014589400
====================================================

JDK10 is about to be released so it’s worth looking at the effect of recent improvements to C2, including better use of AVX2 and support for vectorised FMA. Since LMS depends on scala-virtualized, which currently only supports Scala 2.11, the LMS implementation cannot be run with a more recent JDK so its performance running in JDK10 could only be extrapolated. Since its raison d’être is to bypass C2, it could be reasonably assumed it is insulated from JVM performance improvements (or regressions). Measurements of floating point operations per cycle provide a sensible comparison, in any case.

Moving away from ScalaMeter, I created a JMH benchmark to see how matrix multiplication behaves in JDK10.

@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class MMM {

  @Param({"8", "32", "64", "128", "192", "256", "320", "384", "448", "512" , "576", "640", "704", "768", "832", "896", "960", "1024"})
  int size;

  private float[] a;
  private float[] b;
  private float[] c;

  @Setup(Level.Trial)
  public void init() {
    a = DataUtil.createFloatArray(size * size);
    b = DataUtil.createFloatArray(size * size);
    c = new float[size * size];
  }

  @Benchmark
  public void fast(Blackhole bh) {
    fast(a, b, c, size);
    bh.consume(c);
  }

  @Benchmark
  public void baseline(Blackhole bh) {
    baseline(a, b, c, size);
    bh.consume(c);
  }


  @Benchmark
  public void blocked(Blackhole bh) {
    blocked(a, b, c, size);
    bh.consume(c);
  }

  //
  // Baseline implementation of a Matrix-Matrix-Multiplication
  //
  public void baseline (float[] a, float[] b, float[] c, int n){
    for (int i = 0; i < n; i += 1) {
      for (int j = 0; j < n; j += 1) {
        float sum = 0.0f;
        for (int k = 0; k < n; k += 1) {
          sum += a[i * n + k] * b[k * n + j];
        }
        c[i * n + j] = sum;
      }
    }
  }

  //
  // Blocked version of MMM, reference implementation available at:
  // http://csapp.cs.cmu.edu/2e/waside/waside-blocking.pdf
  //
  public void blocked(float[] a, float[] b, float[] c, int n) {
    int BLOCK_SIZE = 8;
    for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
      for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
        for (int i = 0; i < n; i++) {
          for (int j = jj; j < jj + BLOCK_SIZE; ++j) {
            float sum = c[i * n + j];
            for (int k = kk; k < kk + BLOCK_SIZE; ++k) {
              sum += a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sum;
          }
        }
      }
    }
  }

  public void fast(float[] a, float[] b, float[] c, int n) {
    int in = 0;
    for (int i = 0; i < n; ++i) {
      int kn = 0;
      for (int k = 0; k < n; ++k) {
        float aik = a[in + k];
        for (int j = 0; j < n; ++j) {
          c[in + j] = Math.fma(aik,  b[kn + j], c[in + j]);
        }
        kn += n;
      }
      in += n;
    }
  }
}
Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size Ratio to blocked Flops/Cycle
baseline thrpt 1 10 1228544.82 38793.17392 ops/s 8 1.061598336 0.483857652
baseline thrpt 1 10 22973.03402 1012.043446 ops/s 32 1.302266947 0.57906183
baseline thrpt 1 10 2943.088879 221.57475 ops/s 64 1.301414733 0.593471609
baseline thrpt 1 10 358.010135 9.342801 ops/s 128 1.292889618 0.577539747
baseline thrpt 1 10 105.758366 4.275503 ops/s 192 1.246415143 0.575804515
baseline thrpt 1 10 41.465557 1.112753 ops/s 256 1.430003946 0.535135851
baseline thrpt 1 10 20.479081 0.462547 ops/s 320 1.154267894 0.516198866
baseline thrpt 1 10 11.686685 0.263476 ops/s 384 1.186535349 0.509027985
baseline thrpt 1 10 7.344184 0.269656 ops/s 448 1.166421127 0.507965526
baseline thrpt 1 10 3.545153 0.108086 ops/s 512 0.81796657 0.366017216
baseline thrpt 1 10 3.789384 0.130934 ops/s 576 1.327168294 0.557048123
baseline thrpt 1 10 1.981957 0.040136 ops/s 640 1.020965271 0.399660104
baseline thrpt 1 10 1.76672 0.036386 ops/s 704 1.168272442 0.474179037
baseline thrpt 1 10 1.01026 0.049853 ops/s 768 0.845514112 0.352024966
baseline thrpt 1 10 1.115814 0.03803 ops/s 832 1.148752171 0.494331667
baseline thrpt 1 10 0.703561 0.110626 ops/s 896 0.938435436 0.389298235
baseline thrpt 1 10 0.629896 0.052448 ops/s 960 1.081741651 0.428685898
baseline thrpt 1 10 0.407772 0.019079 ops/s 1024 1.025356561 0.336801424
blocked thrpt 1 10 1157259.558 49097.48711 ops/s 8 1 0.455782226
blocked thrpt 1 10 17640.8025 1226.401298 ops/s 32 1 0.444656782
blocked thrpt 1 10 2261.453481 98.937035 ops/s 64 1 0.456020355
blocked thrpt 1 10 276.906961 22.851857 ops/s 128 1 0.446704605
blocked thrpt 1 10 84.850033 4.441454 ops/s 192 1 0.461968485
blocked thrpt 1 10 28.996813 7.585551 ops/s 256 1 0.374219842
blocked thrpt 1 10 17.742052 0.627629 ops/s 320 1 0.447208892
blocked thrpt 1 10 9.84942 0.367603 ops/s 384 1 0.429003641
blocked thrpt 1 10 6.29634 0.402846 ops/s 448 1 0.435490676
blocked thrpt 1 10 4.334105 0.384849 ops/s 512 1 0.447472097
blocked thrpt 1 10 2.85524 0.199102 ops/s 576 1 0.419726816
blocked thrpt 1 10 1.941258 0.10915 ops/s 640 1 0.391453182
blocked thrpt 1 10 1.51225 0.076621 ops/s 704 1 0.40588053
blocked thrpt 1 10 1.194847 0.063147 ops/s 768 1 0.416344283
blocked thrpt 1 10 0.971327 0.040421 ops/s 832 1 0.430320551
blocked thrpt 1 10 0.749717 0.042997 ops/s 896 1 0.414837526
blocked thrpt 1 10 0.582298 0.016725 ops/s 960 1 0.39629231
blocked thrpt 1 10 0.397688 0.043639 ops/s 1024 1 0.328472491
fast thrpt 1 10 1869676.345 76416.50848 ops/s 8 1.615606743 0.736364837
fast thrpt 1 10 48485.47216 1301.926828 ops/s 32 2.748484496 1.222132271
fast thrpt 1 10 6431.341657 153.905413 ops/s 64 2.843897392 1.296875098
fast thrpt 1 10 840.601821 45.998723 ops/s 128 3.035683242 1.356053685
fast thrpt 1 10 260.386996 13.022418 ops/s 192 3.068790745 1.417684611
fast thrpt 1 10 107.895708 6.584674 ops/s 256 3.720950575 1.392453537
fast thrpt 1 10 56.245336 2.729061 ops/s 320 3.170170846 1.417728592
fast thrpt 1 10 32.917996 2.196624 ops/s 384 3.342125323 1.433783932
fast thrpt 1 10 20.960189 2.077684 ops/s 448 3.328948087 1.449725854
fast thrpt 1 10 14.005186 0.7839 ops/s 512 3.231390564 1.445957112
fast thrpt 1 10 8.827584 0.883654 ops/s 576 3.091713481 1.297675056
fast thrpt 1 10 7.455607 0.442882 ops/s 640 3.840605937 1.503417416
fast thrpt 1 10 5.322894 0.464362 ops/s 704 3.519850554 1.428638807
fast thrpt 1 10 4.308522 0.153846 ops/s 768 3.605919419 1.501303934
fast thrpt 1 10 3.375274 0.106715 ops/s 832 3.474910097 1.495325228
fast thrpt 1 10 2.320152 0.367881 ops/s 896 3.094703735 1.28379924
fast thrpt 1 10 2.057478 0.150198 ops/s 960 3.533376381 1.400249889
fast thrpt 1 10 1.66255 0.181116 ops/s 1024 4.180538513 1.3731919

Interestingly, the blocked algorithm is now the worst native JVM implementation. The code generated by C2 got a lot faster, but peaks at 1.5 flops/cycle, which still doesn’t compete with LMS. Why? Taking a look at the assembly, it’s clear that the autovectoriser choked on the array offsets and produced scalar SSE2 code, just like the implementations in the paper. I wasn’t expecting this.

vmovss  xmm5,dword ptr [rdi+rcx*4+10h]
vfmadd231ss xmm5,xmm6,xmm2
vmovss  dword ptr [rdi+rcx*4+10h],xmm5

Is this the end of the story? No, with some hacks and the cost of array allocation and a copy or two, autovectorisation can be tricked into working again to generate faster code:

    public void fast(float[] a, float[] b, float[] c, int n) {
        float[] bBuffer = new float[n];
        float[] cBuffer = new float[n];
        int in = 0;
        for (int i = 0; i < n; ++i) {
            int kn = 0;
            for (int k = 0; k < n; ++k) {
                float aik = a[in + k];
                System.arraycopy(b, kn, bBuffer, 0, n);
                saxpy(n, aik, bBuffer, cBuffer);
                kn += n;
            }
            System.arraycopy(cBuffer, 0, c, in, n); 
            Arrays.fill(cBuffer, 0f);
            in += n;
        }
    }

    private void saxpy(int n, float aik, float[] b, float[] c) {
        for (int i = 0; i < n; ++i) {
            c[i] += aik * b[i];
        }
    }

Adding this hack into the NGen benchmark (back in JDK 1.8.0_131) I get closer to the LMS generated code, and beat it beyond L3 cache residency (6MB). LMS is still faster when both matrices fit in L3 concurrently, but by percentage points rather than a multiple. The cost of the hacky array buffers gives the game up for small matrices.

====================================================
Benchmarking MMM.jMMM.fast (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.2500390686
          32 | 0.7710872405
          64 | 1.1302489072
         128 | 2.5113453810
         192 | 2.9525859816
         256 | 3.1180920385
         320 | 3.1081563593
         384 | 3.1458423577
         448 | 3.0493148252
         512 | 3.0551158263
         576 | 3.1430376938
         640 | 3.2169923048
         704 | 3.1026513283
         768 | 2.4190053777
         832 | 3.3358586705
         896 | 3.0755689237
         960 | 2.9996690697
        1024 | 2.2935654309
====================================================

====================================================
Benchmarking MMM.nMMM.blocked (LMS generated)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 1.0001562744
          32 | 5.3330416826
          64 | 5.8180867784
         128 | 5.1717318641
         192 | 5.1639907462
         256 | 4.3418618628
         320 | 5.2536572701
         384 | 4.0801359215
         448 | 4.1337007093
         512 | 3.2678160754
         576 | 3.7973028890
         640 | 3.3557513664
         704 | 4.0103133240
         768 | 3.4188362575
         832 | 3.2189488327
         896 | 3.2316685219
         960 | 2.9985655539
        1024 | 1.7750946796
====================================================

With the benchmark below I calculate flops/cycle with improved JDK10 autovectorisation.

  @Benchmark
  public void fastBuffered(Blackhole bh) {
    fastBuffered(a, b, c, size);
    bh.consume(c);
  }

  public void fastBuffered(float[] a, float[] b, float[] c, int n) {
    float[] bBuffer = new float[n];
    float[] cBuffer = new float[n];
    int in = 0;
    for (int i = 0; i < n; ++i) {
      int kn = 0;
      for (int k = 0; k < n; ++k) {
        float aik = a[in + k];
        System.arraycopy(b, kn, bBuffer, 0, n);
        saxpy(n, aik, bBuffer, cBuffer);
        kn += n;
      }
      System.arraycopy(cBuffer, 0, c, in, n);
      Arrays.fill(cBuffer, 0f);
      in += n;
    }
  }

  private void saxpy(int n, float aik, float[] b, float[] c) {
    for (int i = 0; i < n; ++i) {
      c[i] = Math.fma(aik, b[i], c[i]);
    }
  }

Just as in the modified NGen benchmark, this starts paying off once the matrices have 64 rows and columns. Finally, and it took an upgrade and a hack, I breached 4 Flops per cycle:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size Flops / Cycle
fastBuffered thrpt 1 10 1047184.034 63532.95095 ops/s 8 0.412429404
fastBuffered thrpt 1 10 58373.56367 3239.615866 ops/s 32 1.471373026
fastBuffered thrpt 1 10 12099.41654 497.33988 ops/s 64 2.439838038
fastBuffered thrpt 1 10 2136.50264 105.038006 ops/s 128 3.446592911
fastBuffered thrpt 1 10 673.470622 102.577237 ops/s 192 3.666730488
fastBuffered thrpt 1 10 305.541519 25.959163 ops/s 256 3.943181586
fastBuffered thrpt 1 10 158.437372 6.708384 ops/s 320 3.993596774
fastBuffered thrpt 1 10 88.283718 7.58883 ops/s 384 3.845306266
fastBuffered thrpt 1 10 58.574507 4.248521 ops/s 448 4.051345968
fastBuffered thrpt 1 10 37.183635 4.360319 ops/s 512 3.839002314
fastBuffered thrpt 1 10 29.949884 0.63346 ops/s 576 4.40270151
fastBuffered thrpt 1 10 20.715833 4.175897 ops/s 640 4.177331789
fastBuffered thrpt 1 10 10.824837 0.902983 ops/s 704 2.905333492
fastBuffered thrpt 1 10 8.285254 1.438701 ops/s 768 2.886995686
fastBuffered thrpt 1 10 6.17029 0.746537 ops/s 832 2.733582608
fastBuffered thrpt 1 10 4.828872 1.316901 ops/s 896 2.671937962
fastBuffered thrpt 1 10 3.6343 1.293923 ops/s 960 2.473381573
fastBuffered thrpt 1 10 2.458296 0.171224 ops/s 1024 2.030442485

The code generated for the core of the loop looks better now:

vmovdqu ymm1,ymmword ptr [r13+r11*4+10h]
vfmadd231ps ymm1,ymm3,ymmword ptr [r14+r11*4+10h]
vmovdqu ymmword ptr [r13+r11*4+10h],ymm1                                               

These benchmark results can be compared on a line chart.

Given this improvement, it would be exciting to see how LMS can profit from JDK9 or JDK10 - does LMS provide the impetus to resume maintenance of scala-virtualized? L3 cache, which the LMS generated code seems to depend on for throughput, is typically shared between cores: a single thread rarely enjoys exclusive access. I would like to see benchmarks for the LMS generated code in the presence of concurrency.