Building RoaringBitmaps from Streams
RoaringBitmap is a fast compressed bitset format. In the Java implementation of Roaring, it was until recently preferential to build a bitset in one go from sorted data; there were performance penalties of varying magnitude for incremental or unordered insertions. In a recent pull request, I wanted to improve incremental monotonic insertion so I could build bitmaps from streams, but sped up unsorted batch creation significantly by accident.
Incremental Ordered Insertion
If you want to build a bitmap, you can do so efficiently with the RoaringBitmap.bitmapOf
factory method.
int[] data = ...
RoaringBitmap bitmap = RoaringBitmap.bitmapOf(data);
However, I often find that I want to stream integers into a bitmap. Given that the integers being inserted into a bitmap often represent indices into an array, such a stream is likely to be monotonic. You might implement this like so:
IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
stream.forEach(bitmap::add);
While this is OK, it has a few inefficiencies compared to the batch creation method.
- Indirection: the container being written to must be located on each insertion
- Eagerness: the cardinality must be kept up to date on each insertion
- Allocation pressure: the best container type can't be known in advance. Choice of container may change as data is inserted, this requires allocations of new instances.
You could also collect the stream into an int[]
and use the batch method, but it could be a large temporary object with obvious drawbacks.
OrderedWriter
The solution I proposed is to create a writer object (OrderedWriter) which allocates a small buffer of 8KB, to use as a bitmap large enough to cover 16 bits. The stream to bitmap code becomes:
IntStream stream = ...
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
stream.forEach(writer::add);
writer.flush(); // clear the buffer out
This is implemented so that changes in key (where the most significant 16 bits of each integer is stored) trigger a flush of the buffer.
public void add(int value) {
short key = Util.highbits(value);
short low = Util.lowbits(value);
if (key != currentKey) {
if (Util.compareUnsigned(key, currentKey) < 0) {
throw new IllegalStateException("Must write in ascending key order");
}
flush();
}
int ulow = low & 0xFFFF;
bitmap[(ulow >>> 6)] |= (1L << ulow);
currentKey = key;
dirty = true;
}
When a flush occurs, a container type is chosen and appended to the bitmap’s prefix index.
public void flush() {
if (dirty) {
RoaringArray highLowContainer = underlying.highLowContainer;
// we check that it's safe to append since RoaringArray.append does no validation
if (highLowContainer.size > 0) {
short key = highLowContainer.getKeyAtIndex(highLowContainer.size - 1);
if (Util.compareUnsigned(currentKey, key) <= 0) {
throw new IllegalStateException("Cannot write " + currentKey + " after " + key);
}
}
highLowContainer.append(currentKey, chooseBestContainer());
clearBitmap();
dirty = false;
}
}
There are significant performance advantages in this approach. There is no indirection cost, and no searches in the prefix index for containers: the writes are just buffered. The buffer is small enough to fit in cache, and containers only need to be created when the writer is flushed, which happens whenever a new key is seen, or when flush
is called manually. During a flush, the cardinality can be computed in one go, the best container can be chosen, and run optimisation only has to happen once. Computing the cardinality is the only bottleneck - it requires 1024 calls to Long.bitCount
which can’t be vectorised in a language like Java. It can’t be incremented on insertion without either sacrificing idempotence or incurring the cost of a membership check. After the flush, the buffer needs to be cleared, using a call to Arrays.fill
which is vectorised. So, despite the cost of the buffer, this can be quite efficient.
This approach isn’t universally applicable. For instance, you must write data in ascending order of the most significant 16 bits. You must also remember to flush the writer when you’re finished: until you’ve called flush, the data in the last container may not be in the bitmap. For my particular use case, this is reasonable. However, there are times when this is not fit for purpose, such as if you are occasionally inserting values and expect them to be available to queries immediately. In general, if you don’t know when you’ll stop adding data to the bitmap, this isn’t a good fit because you won’t know when to call flush.
Benchmark
I benchmarked the two approaches, varying bitmap sizes and randomness (likelihood of there not being a compressible run), and was amazed to find that this approach actually beats having a sorted array and using RoaringBitmap.bitmapOf
. Less surprising was beating the existing API for incremental adds (this was the goal in the first place). Lower is better:
Benchmark | (randomness) | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|---|
buildRoaringBitmap | 0.1 | 10000 | avgt | 5 | 54.263 | 3.393 | us/op |
buildRoaringBitmap | 0.1 | 100000 | avgt | 5 | 355.188 | 15.234 | us/op |
buildRoaringBitmap | 0.1 | 1000000 | avgt | 5 | 3567.839 | 135.149 | us/op |
buildRoaringBitmap | 0.1 | 10000000 | avgt | 5 | 31982.046 | 1227.325 | us/op |
buildRoaringBitmap | 0.5 | 10000 | avgt | 5 | 53.855 | 0.887 | us/op |
buildRoaringBitmap | 0.5 | 100000 | avgt | 5 | 357.671 | 14.111 | us/op |
buildRoaringBitmap | 0.5 | 1000000 | avgt | 5 | 3556.152 | 243.671 | us/op |
buildRoaringBitmap | 0.5 | 10000000 | avgt | 5 | 34385.971 | 3864.143 | us/op |
buildRoaringBitmap | 0.9 | 10000 | avgt | 5 | 59.354 | 10.385 | us/op |
buildRoaringBitmap | 0.9 | 100000 | avgt | 5 | 374.245 | 54.485 | us/op |
buildRoaringBitmap | 0.9 | 1000000 | avgt | 5 | 3712.684 | 657.964 | us/op |
buildRoaringBitmap | 0.9 | 10000000 | avgt | 5 | 37223.976 | 4691.297 | us/op |
incrementalNativeAdd | 0.1 | 10000 | avgt | 5 | 115.213 | 31.909 | us/op |
incrementalNativeAdd | 0.1 | 100000 | avgt | 5 | 911.925 | 127.922 | us/op |
incrementalNativeAdd | 0.1 | 1000000 | avgt | 5 | 8889.49 | 320.821 | us/op |
incrementalNativeAdd | 0.1 | 10000000 | avgt | 5 | 102819.877 | 14247.868 | us/op |
incrementalNativeAdd | 0.5 | 10000 | avgt | 5 | 116.878 | 28.232 | us/op |
incrementalNativeAdd | 0.5 | 100000 | avgt | 5 | 947.076 | 128.255 | us/op |
incrementalNativeAdd | 0.5 | 1000000 | avgt | 5 | 7190.443 | 202.012 | us/op |
incrementalNativeAdd | 0.5 | 10000000 | avgt | 5 | 98843.303 | 4325.924 | us/op |
incrementalNativeAdd | 0.9 | 10000 | avgt | 5 | 101.694 | 6.579 | us/op |
incrementalNativeAdd | 0.9 | 100000 | avgt | 5 | 816.411 | 65.678 | us/op |
incrementalNativeAdd | 0.9 | 1000000 | avgt | 5 | 9114.624 | 412.152 | us/op |
incrementalNativeAdd | 0.9 | 10000000 | avgt | 5 | 108793.694 | 22562.527 | us/op |
incrementalUseOrderedWriter | 0.1 | 10000 | avgt | 5 | 23.573 | 5.962 | us/op |
incrementalUseOrderedWriter | 0.1 | 100000 | avgt | 5 | 289.588 | 36.814 | us/op |
incrementalUseOrderedWriter | 0.1 | 1000000 | avgt | 5 | 2785.659 | 49.385 | us/op |
incrementalUseOrderedWriter | 0.1 | 10000000 | avgt | 5 | 29489.758 | 2601.39 | us/op |
incrementalUseOrderedWriter | 0.5 | 10000 | avgt | 5 | 23.57 | 1.536 | us/op |
incrementalUseOrderedWriter | 0.5 | 100000 | avgt | 5 | 276.488 | 9.662 | us/op |
incrementalUseOrderedWriter | 0.5 | 1000000 | avgt | 5 | 2799.408 | 198.77 | us/op |
incrementalUseOrderedWriter | 0.5 | 10000000 | avgt | 5 | 28313.626 | 1976.042 | us/op |
incrementalUseOrderedWriter | 0.9 | 10000 | avgt | 5 | 22.345 | 1.574 | us/op |
incrementalUseOrderedWriter | 0.9 | 100000 | avgt | 5 | 280.205 | 36.987 | us/op |
incrementalUseOrderedWriter | 0.9 | 1000000 | avgt | 5 | 2779.732 | 93.456 | us/op |
incrementalUseOrderedWriter | 0.9 | 10000000 | avgt | 5 | 30568.591 | 2140.826 | us/op |
These benchmarks don’t go far enough to support replacing RoaringBitmap.bitmapOf
.
Unsorted Input Data
In the cases benchmarked, this approach seems to be worthwhile. I can’t actually think of a case where someone would want to build a bitmap from unsorted data, but it occurred to me that this approach might be fast enough to cover the cost of a sort. OrderedWriter
is also relaxed enough that it only needs the most significant 16 bits to be monotonic, so a full sort isn’t necessary. Implementing a radix sort on the most significant 16 bits (stable in the least significant 16 bits), prior to incremental insertion via an OrderedWriter
, leads to huge increases in performance over RoaringBitmap.bitmapOf
. The implementation is as follows:
public static RoaringBitmap bitmapOfUnordered(final int... data) {
partialRadixSort(data);
RoaringBitmap bitmap = new RoaringBitmap();
OrderedWriter writer = new OrderedWriter(bitmap);
for (int i : data) {
writer.add(i);
}
writer.flush();
return bitmap;
}
It did very well, according to benchmarks, even against various implementations of sort prior to RoaringBitmap.bitmapOf
. Lower is better:
Benchmark | (randomness) | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|---|
bitmapOf | 0.1 | 10000 | avgt | 5 | 1058.106 | 76.013 | us/op |
bitmapOf | 0.1 | 100000 | avgt | 5 | 12323.905 | 976.68 | us/op |
bitmapOf | 0.1 | 1000000 | avgt | 5 | 171812.526 | 9593.879 | us/op |
bitmapOf | 0.1 | 10000000 | avgt | 5 | 3376296.157 | 170362.195 | us/op |
bitmapOf | 0.5 | 10000 | avgt | 5 | 1096.663 | 477.795 | us/op |
bitmapOf | 0.5 | 100000 | avgt | 5 | 12836.177 | 1674.54 | us/op |
bitmapOf | 0.5 | 1000000 | avgt | 5 | 171998.126 | 4176 | us/op |
bitmapOf | 0.5 | 10000000 | avgt | 5 | 3707804.439 | 974532.361 | us/op |
bitmapOf | 0.9 | 10000 | avgt | 5 | 1124.881 | 65.673 | us/op |
bitmapOf | 0.9 | 100000 | avgt | 5 | 14585.589 | 1894.788 | us/op |
bitmapOf | 0.9 | 1000000 | avgt | 5 | 198506.813 | 8552.218 | us/op |
bitmapOf | 0.9 | 10000000 | avgt | 5 | 3723942.934 | 423704.363 | us/op |
bitmapOfUnordered | 0.1 | 10000 | avgt | 5 | 174.583 | 17.475 | us/op |
bitmapOfUnordered | 0.1 | 100000 | avgt | 5 | 1768.613 | 86.543 | us/op |
bitmapOfUnordered | 0.1 | 1000000 | avgt | 5 | 17889.705 | 135.714 | us/op |
bitmapOfUnordered | 0.1 | 10000000 | avgt | 5 | 192645.352 | 6482.726 | us/op |
bitmapOfUnordered | 0.5 | 10000 | avgt | 5 | 157.351 | 3.254 | us/op |
bitmapOfUnordered | 0.5 | 100000 | avgt | 5 | 1674.919 | 90.138 | us/op |
bitmapOfUnordered | 0.5 | 1000000 | avgt | 5 | 16900.458 | 778.999 | us/op |
bitmapOfUnordered | 0.5 | 10000000 | avgt | 5 | 185399.32 | 4383.485 | us/op |
bitmapOfUnordered | 0.9 | 10000 | avgt | 5 | 145.642 | 1.257 | us/op |
bitmapOfUnordered | 0.9 | 100000 | avgt | 5 | 1515.845 | 82.914 | us/op |
bitmapOfUnordered | 0.9 | 1000000 | avgt | 5 | 15807.597 | 811.048 | us/op |
bitmapOfUnordered | 0.9 | 10000000 | avgt | 5 | 167863.49 | 3501.132 | us/op |
partialSortThenBitmapOf | 0.1 | 10000 | avgt | 5 | 1060.152 | 168.802 | us/op |
partialSortThenBitmapOf | 0.1 | 100000 | avgt | 5 | 10942.731 | 347.583 | us/op |
partialSortThenBitmapOf | 0.1 | 1000000 | avgt | 5 | 100606.506 | 24705.341 | us/op |
partialSortThenBitmapOf | 0.1 | 10000000 | avgt | 5 | 1035448.545 | 157383.713 | us/op |
partialSortThenBitmapOf | 0.5 | 10000 | avgt | 5 | 1029.883 | 100.291 | us/op |
partialSortThenBitmapOf | 0.5 | 100000 | avgt | 5 | 10472.509 | 832.719 | us/op |
partialSortThenBitmapOf | 0.5 | 1000000 | avgt | 5 | 101144.032 | 16908.087 | us/op |
partialSortThenBitmapOf | 0.5 | 10000000 | avgt | 5 | 958242.087 | 39650.946 | us/op |
partialSortThenBitmapOf | 0.9 | 10000 | avgt | 5 | 1008.413 | 70.999 | us/op |
partialSortThenBitmapOf | 0.9 | 100000 | avgt | 5 | 10458.34 | 600.416 | us/op |
partialSortThenBitmapOf | 0.9 | 1000000 | avgt | 5 | 103945.644 | 2026.26 | us/op |
partialSortThenBitmapOf | 0.9 | 10000000 | avgt | 5 | 1065638.269 | 102257.059 | us/op |
setupCost | 0.1 | 10000 | avgt | 5 | 6.577 | 0.121 | us/op |
setupCost | 0.1 | 100000 | avgt | 5 | 61.378 | 24.113 | us/op |
setupCost | 0.1 | 1000000 | avgt | 5 | 1021.588 | 536.68 | us/op |
setupCost | 0.1 | 10000000 | avgt | 5 | 13182.341 | 196.773 | us/op |
setupCost | 0.5 | 10000 | avgt | 5 | 7.139 | 2.216 | us/op |
setupCost | 0.5 | 100000 | avgt | 5 | 60.847 | 23.395 | us/op |
setupCost | 0.5 | 1000000 | avgt | 5 | 800.888 | 14.711 | us/op |
setupCost | 0.5 | 10000000 | avgt | 5 | 13431.625 | 553.44 | us/op |
setupCost | 0.9 | 10000 | avgt | 5 | 6.599 | 0.09 | us/op |
setupCost | 0.9 | 100000 | avgt | 5 | 60.946 | 22.511 | us/op |
setupCost | 0.9 | 1000000 | avgt | 5 | 813.445 | 4.896 | us/op |
setupCost | 0.9 | 10000000 | avgt | 5 | 13374.943 | 349.314 | us/op |
sortThenBitmapOf | 0.1 | 10000 | avgt | 5 | 636.23 | 13.423 | us/op |
sortThenBitmapOf | 0.1 | 100000 | avgt | 5 | 7411.756 | 174.264 | us/op |
sortThenBitmapOf | 0.1 | 1000000 | avgt | 5 | 92299.305 | 3651.161 | us/op |
sortThenBitmapOf | 0.1 | 10000000 | avgt | 5 | 1096374.443 | 162575.234 | us/op |
sortThenBitmapOf | 0.5 | 10000 | avgt | 5 | 634.957 | 47.447 | us/op |
sortThenBitmapOf | 0.5 | 100000 | avgt | 5 | 7939.074 | 409.328 | us/op |
sortThenBitmapOf | 0.5 | 1000000 | avgt | 5 | 93505.427 | 5409.749 | us/op |
sortThenBitmapOf | 0.5 | 10000000 | avgt | 5 | 1147933.592 | 57485.51 | us/op |
sortThenBitmapOf | 0.9 | 10000 | avgt | 5 | 661.072 | 6.717 | us/op |
sortThenBitmapOf | 0.9 | 100000 | avgt | 5 | 7915.506 | 356.148 | us/op |
sortThenBitmapOf | 0.9 | 1000000 | avgt | 5 | 93403.343 | 5454.583 | us/op |
sortThenBitmapOf | 0.9 | 10000000 | avgt | 5 | 1095960.734 | 85753.917 | us/op |
It looks like there are good performance gains available here, but these things tend to depend on particular data sets. I would be interested in hearing from anyone who has tried to use this class in a real application.