I discovered a curious resource for audio-visualising sort algorithms, which is exciting for two reasons. The first is that I finally feel like I understand Alexander Scriabin: he was not a composer. He discovered Tim Sort 80 years before Tim Peters and called it Black Mass. (If you aren’t familiar with the piece, fast-forward to 1:40 to hear the congruence.)

The second reason was that I noticed Radix Sort (LSD). While it was an affront to my senses, it used a mere 800 array accesses and no comparisons! I was unaware of this algorithm so delved deeper and implemented it for integers, and benchmarked my code against Arrays.sort.

It is taken as given by many (myself included, or am I just projecting my thoughts on to others?) that $latex O(n \log n)$ is the best you can do in a sort algorithm. But this is actually only true for sort algorithms which depend on comparison. If you can afford to restrict the data types your sort algorithm supports to types with a positional interpretation (java.util can’t because it needs to be ubiquitous and maintainable), you can get away with a linear time algorithm.

Radix sort, along with the closely related counting sort, does not use comparisons. Instead, the data is interpreted as a fixed length string of symbols. For each position, the cumulative histogram of symbols is computed to calculate sort indices. While the data needs to be scanned several times, the algorithm scales linearly and the overhead of the multiple scans is amortised for large arrays.

As you can see on Wikipedia, there are two kinds of radix sort: Least Significant Digit and Most Significant Digit. This dichotomy relates to the order the (representational) string of symbols is traversed in. I implemented and benchmarked the LSD version for integers.

#### Implementation

The implementation interprets an integer as the concatenation of n bit string symbols of fixed size size 32/n. It performs n passes over the array, starting with the least significant bits, which it modifies in place. For each pass the data is scanned three times, in order to:

1. Compute the cumulative histogram over the symbols in their natural sort order.
2. Copy the value with symbol k to the mth position in a buffer, where m is defined by the cumulative density of k.
3. Copy the buffer back into the original array.

The implementation, which won’t work unless the chunks are proper divisors of 32, is below. The bonus (or caveat) is that it automatically supports unsigned integers. The code could be modified slightly to work with signed integers at a performance cost.

import java.util.Arrays;

this(Byte.SIZE);
}

}

public void sort(int[] data) {
int[] histogram = new int[(1 << radix) + 1];
int shift = 0;
int[] copy = new int[data.length];
while (shift < Integer.SIZE) {
Arrays.fill(histogram, 0);
for (int i = 0; i < data.length; ++i) {
++histogram[((data[i] & mask) >> shift) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < data.length; ++i) {
copy[histogram[(data[i] & mask) >> shift]++] = data[i];
}
for (int i = 0; i < data.length; ++i) {
data[i] = copy[i];
}
}
}
}


The time complexity is obviously linear, a temporary buffer is allocated, but in comparison to Arrays.sort it looks fairly spartan. Instinctively, cache locality looks fairly poor because the second inner loop of the three jumps all over the place. Will this implementation beat Arrays.sort (for integers)?

#### Benchmark

The algorithm is measured using arrays of random positive integers, for which both algorithms are equivalent, from a range of sizes. This isn’t always the best idea (the Tim Sort algorithm comes into its own on nearly sorted data), so take the result below with a pinch of salt. Care must be taken to copy the array in the benchmark since both algorithms are in-place.

public void launchBenchmark(String... jvmArgs) throws Exception {
Options opt = new OptionsBuilder()
.include(this.getClass().getName() + ".*")
.mode(Mode.SampleTime)
.mode(Mode.Throughput)
.timeUnit(TimeUnit.MILLISECONDS)
.measurementTime(TimeValue.seconds(10))
.warmupIterations(10)
.measurementIterations(10)
.forks(1)
.shouldFailOnError(true)
.shouldDoGC(true)
.jvmArgs(jvmArgs)
.resultFormat(ResultFormatType.CSV)
.build();

new Runner(opt).run();
}

@Benchmark
public void Arrays_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
Arrays.sort(array);
bh.consume(array);
}

@Benchmark
public void Radix_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
bh.consume(array);
}

public static class Data {

@Param({"100", "1000", "10000", "100000", "1000000"})
int size;

int[] data;

@Setup(Level.Trial)
public void init() {
data = createArray(size);
}
}

public static int[] createArray(int size) {
int[] array = new int[size];
Random random = new Random(0);
for (int i = 0; i < size; ++i) {
array[i] = Math.abs(random.nextInt());
}
return array;
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Arrays_Sort thrpt 1 10 1304.687189 147.380334 ops/ms 100
Arrays_Sort thrpt 1 10 78.518664 9.339994 ops/ms 1000
Arrays_Sort thrpt 1 10 1.700208 0.091836 ops/ms 10000
Arrays_Sort thrpt 1 10 0.133835 0.007146 ops/ms 100000
Arrays_Sort thrpt 1 10 0.010560 0.000409 ops/ms 1000000
Radix_Sort thrpt 1 10 404.807772 24.930898 ops/ms 100
Radix_Sort thrpt 1 10 51.787409 4.881181 ops/ms 1000
Radix_Sort thrpt 1 10 6.065590 0.576709 ops/ms 10000
Radix_Sort thrpt 1 10 0.620338 0.068776 ops/ms 100000
Radix_Sort thrpt 1 10 0.043098 0.004481 ops/ms 1000000
Arrays_Sort sample 1 3088586 0.000902 0.000018 ms/op 100
Arrays_Sort·p0.00 sample 1 1 0.000394 ms/op 100
Arrays_Sort·p0.50 sample 1 1 0.000790 ms/op 100
Arrays_Sort·p0.90 sample 1 1 0.000791 ms/op 100
Arrays_Sort·p0.95 sample 1 1 0.001186 ms/op 100
Arrays_Sort·p0.99 sample 1 1 0.001974 ms/op 100
Arrays_Sort·p0.999 sample 1 1 0.020128 ms/op 100
Arrays_Sort·p0.9999 sample 1 1 0.084096 ms/op 100
Arrays_Sort·p1.00 sample 1 1 4.096000 ms/op 100
Arrays_Sort sample 1 2127794 0.011876 0.000037 ms/op 1000
Arrays_Sort·p0.00 sample 1 1 0.007896 ms/op 1000
Arrays_Sort·p0.50 sample 1 1 0.009872 ms/op 1000
Arrays_Sort·p0.90 sample 1 1 0.015408 ms/op 1000
Arrays_Sort·p0.95 sample 1 1 0.024096 ms/op 1000
Arrays_Sort·p0.99 sample 1 1 0.033920 ms/op 1000
Arrays_Sort·p0.999 sample 1 1 0.061568 ms/op 1000
Arrays_Sort·p0.9999 sample 1 1 0.894976 ms/op 1000
Arrays_Sort·p1.00 sample 1 1 4.448256 ms/op 1000
Arrays_Sort sample 1 168991 0.591169 0.001671 ms/op 10000
Arrays_Sort·p0.00 sample 1 1 0.483840 ms/op 10000
Arrays_Sort·p0.50 sample 1 1 0.563200 ms/op 10000
Arrays_Sort·p0.90 sample 1 1 0.707584 ms/op 10000
Arrays_Sort·p0.95 sample 1 1 0.766976 ms/op 10000
Arrays_Sort·p0.99 sample 1 1 0.942080 ms/op 10000
Arrays_Sort·p0.999 sample 1 1 2.058273 ms/op 10000
Arrays_Sort·p0.9999 sample 1 1 7.526102 ms/op 10000
Arrays_Sort·p1.00 sample 1 1 46.333952 ms/op 10000
Arrays_Sort sample 1 13027 7.670135 0.021512 ms/op 100000
Arrays_Sort·p0.00 sample 1 1 6.356992 ms/op 100000
Arrays_Sort·p0.50 sample 1 1 7.634944 ms/op 100000
Arrays_Sort·p0.90 sample 1 1 8.454144 ms/op 100000
Arrays_Sort·p0.95 sample 1 1 8.742502 ms/op 100000
Arrays_Sort·p0.99 sample 1 1 9.666560 ms/op 100000
Arrays_Sort·p0.999 sample 1 1 12.916883 ms/op 100000
Arrays_Sort·p0.9999 sample 1 1 28.037900 ms/op 100000
Arrays_Sort·p1.00 sample 1 1 28.573696 ms/op 100000
Arrays_Sort sample 1 1042 96.278673 0.603645 ms/op 1000000
Arrays_Sort·p0.00 sample 1 1 86.114304 ms/op 1000000
Arrays_Sort·p0.50 sample 1 1 94.896128 ms/op 1000000
Arrays_Sort·p0.90 sample 1 1 104.293990 ms/op 1000000
Arrays_Sort·p0.95 sample 1 1 106.430464 ms/op 1000000
Arrays_Sort·p0.99 sample 1 1 111.223767 ms/op 1000000
Arrays_Sort·p0.999 sample 1 1 134.172770 ms/op 1000000
Arrays_Sort·p0.9999 sample 1 1 134.742016 ms/op 1000000
Arrays_Sort·p1.00 sample 1 1 134.742016 ms/op 1000000
Radix_Sort sample 1 2240042 0.002941 0.000033 ms/op 100
Radix_Sort·p0.00 sample 1 1 0.001578 ms/op 100
Radix_Sort·p0.50 sample 1 1 0.002368 ms/op 100
Radix_Sort·p0.90 sample 1 1 0.003556 ms/op 100
Radix_Sort·p0.95 sample 1 1 0.004344 ms/op 100
Radix_Sort·p0.99 sample 1 1 0.011056 ms/op 100
Radix_Sort·p0.999 sample 1 1 0.027232 ms/op 100
Radix_Sort·p0.9999 sample 1 1 0.731127 ms/op 100
Radix_Sort·p1.00 sample 1 1 5.660672 ms/op 100
Radix_Sort sample 1 2695825 0.018553 0.000038 ms/op 1000
Radix_Sort·p0.00 sample 1 1 0.013424 ms/op 1000
Radix_Sort·p0.50 sample 1 1 0.016576 ms/op 1000
Radix_Sort·p0.90 sample 1 1 0.025280 ms/op 1000
Radix_Sort·p0.95 sample 1 1 0.031200 ms/op 1000
Radix_Sort·p0.99 sample 1 1 0.050944 ms/op 1000
Radix_Sort·p0.999 sample 1 1 0.082944 ms/op 1000
Radix_Sort·p0.9999 sample 1 1 0.830295 ms/op 1000
Radix_Sort·p1.00 sample 1 1 6.660096 ms/op 1000
Radix_Sort sample 1 685589 0.145695 0.000234 ms/op 10000
Radix_Sort·p0.00 sample 1 1 0.112512 ms/op 10000
Radix_Sort·p0.50 sample 1 1 0.128000 ms/op 10000
Radix_Sort·p0.90 sample 1 1 0.196608 ms/op 10000
Radix_Sort·p0.95 sample 1 1 0.225792 ms/op 10000
Radix_Sort·p0.99 sample 1 1 0.309248 ms/op 10000
Radix_Sort·p0.999 sample 1 1 0.805888 ms/op 10000
Radix_Sort·p0.9999 sample 1 1 1.818141 ms/op 10000
Radix_Sort·p1.00 sample 1 1 14.401536 ms/op 10000
Radix_Sort sample 1 60843 1.641961 0.005783 ms/op 100000
Radix_Sort·p0.00 sample 1 1 1.251328 ms/op 100000
Radix_Sort·p0.50 sample 1 1 1.542144 ms/op 100000
Radix_Sort·p0.90 sample 1 1 2.002944 ms/op 100000
Radix_Sort·p0.95 sample 1 1 2.375680 ms/op 100000
Radix_Sort·p0.99 sample 1 1 3.447030 ms/op 100000
Radix_Sort·p0.999 sample 1 1 5.719294 ms/op 100000
Radix_Sort·p0.9999 sample 1 1 8.724165 ms/op 100000
Radix_Sort·p1.00 sample 1 1 13.074432 ms/op 100000
Radix_Sort sample 1 4846 20.640787 0.260926 ms/op 1000000
Radix_Sort·p0.00 sample 1 1 14.893056 ms/op 1000000
Radix_Sort·p0.50 sample 1 1 18.743296 ms/op 1000000
Radix_Sort·p0.90 sample 1 1 26.673152 ms/op 1000000
Radix_Sort·p0.95 sample 1 1 30.724915 ms/op 1000000
Radix_Sort·p0.99 sample 1 1 40.470446 ms/op 1000000
Radix_Sort·p0.999 sample 1 1 63.016600 ms/op 1000000
Radix_Sort·p0.9999 sample 1 1 136.052736 ms/op 1000000
Radix_Sort·p1.00 sample 1 1 136.052736 ms/op 1000000

The table tells an interesting story. Arrays.sort is vastly superior for small arrays (the arrays most people have), but for large arrays the custom implementation comes into its own. Interestingly, this is consistent with the computer science. If you need to sort large arrays of (unsigned) integers and care about performance, think about implementing radix sort.