Spliterator Characteristics and Performance

The streams API has been around for a while now, and I’m a big fan of it. It allows for a clean declarative programming style, which permits various optimisations to occur, and keeps the pastafarians at bay. I also think the Stream is the perfect abstraction for data interchange across API boundaries. This is partly because a Stream is lazy, meaning you don’t need to pay for consumption until you actually need to, and partly because a Stream can only be used once and there can be no ambiguity about ownership. If you supply a Stream to an API, you must expect that it has been used and so must discard it. This almost entirely eradicates defensive copies and can mean that no intermediate data structures need ever exist. Despite my enthusiasm for this abstraction, there’s some weirdness in this API when you scratch beneath surface.

I wanted to find a way to quickly run length encode an IntStream and found it difficult to make this code as fast as I’d like it to be. The code is too slow because it’s necessary to inspect each int, even when there is enough context available to potentially apply optimisations such as skipping over ranges. It’s likely that I am experiencing the friction of treating Stream as a data interchange format, which wasn’t one of its design goals, but this led me to investigate spliterator characteristics (there is no contiguous characteristic, which could speed up RLE greatly) and their relationship with performance.

Spliterator Characteristics

Streams have spliterators, which control iteration and splitting behaviour. If you want to process a stream in parallel, it is the spliterator which dictates how to split the stream, if possible. There’s more to a spliterator than parallel execution though, and each single threaded execution can be optimised based on the characteristics bit mask. The different characteristics are as follows:

ORDERED promises that there is an order. For instance, trySplit is guaranteed to give a prefix of elements.
DISTINCT a promise that each element in the stream is unique.
SORTED a promise that the stream is already sorted.
SIZED promises the size of the stream is known. This is not true when a call to iterate generates the stream.
NONNULL promises that no elements in the stream are null.
IMMUTABLE promises the underlying data will not change.
CONCURRENT promises that the underlying data can be modified concurrently. Must not also be IMMUTABLE.
SUBSIZED promises that the sizes of splits are known, must also be SIZED.

There’s javadoc for all of these flags, which should be your point of reference, and you need to read it because you wouldn’t guess based on relative performance. For instance, IntStream.range(inclusive, exclusive) creates an RangeIntSpliterator with the characteristics ORDERED | SIZED | SUBSIZED | IMMUTABLE | NONNULL | DISTINCT | SORTED. This means that this stream has no duplicates, no nulls, is already sorted in natural order, the size is known, and it will be chunked deterministically. The data and the iteration order never change, and if we split it, we will always get the same first chunk. So these code snippets should have virtually the same performance:

    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

    @Benchmark
    public long countRangeDistinct() {
        return IntStream.range(0, size).distinct().count();
    }

This is completely at odds with observations. Even though the elements are already distinct, and metadata exists to support this, requesting the distinct elements decimates performance.

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: size
countRange	thrpt	1	10	49.465729	1.804123	ops/us	262144
countRangeDistinct	thrpt	1	10	0.000395	0.000002	ops/us	262144

It turns out this is because IntStream.distinct has a one-size-fits-all implementation which completely ignores the DISTINCT characteristic, and goes ahead and boxes the entire range.

    // from java.util.stream.IntPipeline
    @Override
    public final IntStream distinct() {
        // While functional and quick to implement, this approach is not very efficient.
        // An efficient version requires an int-specific map/set implementation.
        return boxed().distinct().mapToInt(i -> i);
    }

JDK-8193641 is related to this issue.

There is even more observable weirdness. If we wanted to calculate the sum of the first 1000 natural numbers, these two snippets should have the same performance. Requesting what should be a redundant sort doubles the throughput.

    @Benchmark 
    public long headSum() {
        return IntStream.range(0, size).limit(1000).sum();
    }

    @Benchmark
    public long sortedHeadSum() {
        return IntStream.range(0, size).sorted().limit(1000).sum();
    }

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: size
headSum	thrpt	1	10	0.209763	0.002478	ops/us	262144
sortedHeadSum	thrpt	1	10	0.584227	0.006004	ops/us	262144

In fact, you would have a hard time finding a relationship between Spliterator characteristics and performance, but you can see cases of characteristics driving optimisations if you look hard enough, such as in IntStream.count, where the SIZED characteristic is used.

    // see java.util.stream.ReduceOps.makeIntCounting
    @Override
    public <P_IN> Long evaluateSequential(PipelineHelper<Integer> helper, Spliterator<P_IN> spliterator) {
        if (StreamOpFlag.SIZED.isKnown(helper.getStreamAndOpFlags()))
            return spliterator.getExactSizeIfKnown();
        return super.evaluateSequential(helper, spliterator);
    }

This is a measurably worthwhile optimisation, when benchmarked against the unsized spliterator created by IntStream.iterate:

    @Benchmark
    public long countIterator() {
        return IntStream.iterate(0, i -> i < size, i -> i + 1).count();
    }

    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: size
countIterator	thrpt	1	10	0.001198	0.001629	ops/us	262144
countRange	thrpt	1	10	43.166065	4.628715	ops/us	262144

What about limit, that’s supposed to be useful for speeding up streams by limiting the amount of work done? Unfortunately not. It actually makes things potentially much worse. In SliceOps.flags, we see it will actually disable SIZED operations:

    //see java.util.stream.SliceOps
    private static int flags(long limit) {
        return StreamOpFlag.NOT_SIZED | ((limit != -1) ? StreamOpFlag.IS_SHORT_CIRCUIT : 0);
    }

This has a significant effect on performance, as can be seen in the following benchmark:

    @Benchmark
    public long countRange() {
        return IntStream.range(0, size).count();
    }

    @Benchmark
    public long countHalfRange() {
        return IntStream.range(0, size).limit(size / 2).count();
    }

Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit	Param: size
countHalfRange	thrpt	1	10	0.003632	0.003363	ops/us	262144
countRange	thrpt	1	10	44.859998	6.191411	ops/us	262144

This is captured in JDK-8265029 and fixed by 3427

It’s almost as if there were grand plans involving characteristic based optimisation, and perhaps time ran out (IntStream.distinct has a very apologetic comment) or others were better on paper than in reality. In any case, it looks like they aren’t as influential as you might expect. Given that the relationship between the characteristics which exist and performance is flaky at best, it’s unlikely that a new one would get implemented, but I think the characteristic CONTIGUOUS is missing.