# Sorting Unsigned Integers Faster in Java

I discovered a curious resource for audio-visualising sort algorithms, which is exciting for two reasons. The first is that I finally feel like I understand Alexander Scriabin: he was not a composer. He discovered Tim Sort 80 years before Tim Peters and called it Black Mass. (If you aren’t familiar with the piece, fast-forward to 1:40 to hear the congruence.)

The second reason was that I noticed Radix Sort (LSD). While it was an affront to my senses, it used a mere 800 array accesses and no comparisons! I was unaware of this algorithm so delved deeper and implemented it for integers, and benchmarked my code against `Arrays.sort`.

It is taken as given by many (myself included, or am I just projecting my thoughts on to others?) that $O(n \log n)$ is the best you can do in a sort algorithm. But this is actually only true for sort algorithms which depend on comparison. If you can afford to restrict the data types your sort algorithm supports to types with a positional interpretation (`java.util` can’t because it needs to be ubiquitous and maintainable), you can get away with a linear time algorithm.

Radix sort, along with the closely related counting sort, does not use comparisons. Instead, the data is interpreted as a fixed length string of symbols. For each position, the cumulative histogram of symbols is computed to calculate sort indices. While the data needs to be scanned several times, the algorithm scales linearly and the overhead of the multiple scans is amortised for large arrays.

As you can see on Wikipedia, there are two kinds of radix sort: Least Significant Digit and Most Significant Digit. This dichotomy relates to the order the (representational) string of symbols is traversed in. I implemented and benchmarked the LSD version for integers.

### Implementation

The implementation interprets an integer as the concatenation of n bit string symbols of fixed size size 32/n. It performs n passes over the array, starting with the least significant bits, which it modifies in place. For each pass the data is scanned three times, in order to:

1. Compute the cumulative histogram over the symbols in their natural sort order
2. Copy the value with symbol k to the mth position in a buffer, where m is defined by the cumulative density of k.
3. Copy the buffer back into the original array

The implementation, which won’t work unless the chunks are proper divisors of 32, is below. The bonus (or caveat) is that it automatically supports unsigned integers. The code could be modified slightly to work with signed integers at a performance cost.

```import java.util.Arrays;

this(Byte.SIZE);
}

}

public void sort(int[] data) {
int[] histogram = new int[(1 << radix) + 1];
int shift = 0;
int[] copy = new int[data.length];
while (shift < Integer.SIZE) {
Arrays.fill(histogram, 0);
for (int i = 0; i < data.length; ++i) {
++histogram[((data[i] & mask) >> shift) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < data.length; ++i) {
copy[histogram[(data[i] & mask) >> shift]++] = data[i];
}
for (int i = 0; i < data.length; ++i) {
data[i] = copy[i];
}
}
}
}
```

The time complexity is obviously linear, a temporary buffer is allocated, but in comparison to `Arrays.sort` it looks fairly spartan. Instinctively, cache locality looks fairly poor because the second inner loop of the three jumps all over the place. Will this implementation beat `Arrays.sort` (for integers)?

### Benchmark

The algorithm is measured using arrays of random positive integers, for which both algorithms are equivalent, from a range of sizes. This isn’t always the best idea (the Tim Sort algorithm comes into its own on nearly sorted data), so take the result below with a pinch of salt. Care must be taken to copy the array in the benchmark since both algorithms are in-place.

```public void launchBenchmark(String... jvmArgs) throws Exception {
Options opt = new OptionsBuilder()
.include(this.getClass().getName() + ".*")
.mode(Mode.SampleTime)
.mode(Mode.Throughput)
.timeUnit(TimeUnit.MILLISECONDS)
.measurementTime(TimeValue.seconds(10))
.warmupIterations(10)
.measurementIterations(10)
.forks(1)
.shouldFailOnError(true)
.shouldDoGC(true)
.jvmArgs(jvmArgs)
.resultFormat(ResultFormatType.CSV)
.build();

new Runner(opt).run();
}

@Benchmark
public void Arrays_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
Arrays.sort(array);
bh.consume(array);
}

@Benchmark
public void Radix_Sort(Data data, Blackhole bh) {
int[] array = Arrays.copyOf(data.data, data.size);
bh.consume(array);
}

public static class Data {

@Param({"100", "1000", "10000", "100000", "1000000"})
int size;

int[] data;

@Setup(Level.Trial)
public void init() {
data = createArray(size);
}
}

public static int[] createArray(int size) {
int[] array = new int[size];
Random random = new Random(0);
for (int i = 0; i < size; ++i) {
array[i] = Math.abs(random.nextInt());
}
return array;
}
```
Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Arrays_Sort thrpt 1 10 1304.687189 147.380334 ops/ms 100
Arrays_Sort thrpt 1 10 78.518664 9.339994 ops/ms 1000
Arrays_Sort thrpt 1 10 1.700208 0.091836 ops/ms 10000
Arrays_Sort thrpt 1 10 0.133835 0.007146 ops/ms 100000
Arrays_Sort thrpt 1 10 0.010560 0.000409 ops/ms 1000000
Radix_Sort thrpt 1 10 404.807772 24.930898 ops/ms 100
Radix_Sort thrpt 1 10 51.787409 4.881181 ops/ms 1000
Radix_Sort thrpt 1 10 6.065590 0.576709 ops/ms 10000
Radix_Sort thrpt 1 10 0.620338 0.068776 ops/ms 100000
Radix_Sort thrpt 1 10 0.043098 0.004481 ops/ms 1000000
Arrays_Sort sample 1 3088586 0.000902 0.000018 ms/op 100
Arrays_Sort·p0.00 sample 1 1 0.000394 ms/op 100
Arrays_Sort·p0.50 sample 1 1 0.000790 ms/op 100
Arrays_Sort·p0.90 sample 1 1 0.000791 ms/op 100
Arrays_Sort·p0.95 sample 1 1 0.001186 ms/op 100
Arrays_Sort·p0.99 sample 1 1 0.001974 ms/op 100
Arrays_Sort·p0.999 sample 1 1 0.020128 ms/op 100
Arrays_Sort·p0.9999 sample 1 1 0.084096 ms/op 100
Arrays_Sort·p1.00 sample 1 1 4.096000 ms/op 100
Arrays_Sort sample 1 2127794 0.011876 0.000037 ms/op 1000
Arrays_Sort·p0.00 sample 1 1 0.007896 ms/op 1000
Arrays_Sort·p0.50 sample 1 1 0.009872 ms/op 1000
Arrays_Sort·p0.90 sample 1 1 0.015408 ms/op 1000
Arrays_Sort·p0.95 sample 1 1 0.024096 ms/op 1000
Arrays_Sort·p0.99 sample 1 1 0.033920 ms/op 1000
Arrays_Sort·p0.999 sample 1 1 0.061568 ms/op 1000
Arrays_Sort·p0.9999 sample 1 1 0.894976 ms/op 1000
Arrays_Sort·p1.00 sample 1 1 4.448256 ms/op 1000
Arrays_Sort sample 1 168991 0.591169 0.001671 ms/op 10000
Arrays_Sort·p0.00 sample 1 1 0.483840 ms/op 10000
Arrays_Sort·p0.50 sample 1 1 0.563200 ms/op 10000
Arrays_Sort·p0.90 sample 1 1 0.707584 ms/op 10000
Arrays_Sort·p0.95 sample 1 1 0.766976 ms/op 10000
Arrays_Sort·p0.99 sample 1 1 0.942080 ms/op 10000
Arrays_Sort·p0.999 sample 1 1 2.058273 ms/op 10000
Arrays_Sort·p0.9999 sample 1 1 7.526102 ms/op 10000
Arrays_Sort·p1.00 sample 1 1 46.333952 ms/op 10000
Arrays_Sort sample 1 13027 7.670135 0.021512 ms/op 100000
Arrays_Sort·p0.00 sample 1 1 6.356992 ms/op 100000
Arrays_Sort·p0.50 sample 1 1 7.634944 ms/op 100000
Arrays_Sort·p0.90 sample 1 1 8.454144 ms/op 100000
Arrays_Sort·p0.95 sample 1 1 8.742502 ms/op 100000
Arrays_Sort·p0.99 sample 1 1 9.666560 ms/op 100000
Arrays_Sort·p0.999 sample 1 1 12.916883 ms/op 100000
Arrays_Sort·p0.9999 sample 1 1 28.037900 ms/op 100000
Arrays_Sort·p1.00 sample 1 1 28.573696 ms/op 100000
Arrays_Sort sample 1 1042 96.278673 0.603645 ms/op 1000000
Arrays_Sort·p0.00 sample 1 1 86.114304 ms/op 1000000
Arrays_Sort·p0.50 sample 1 1 94.896128 ms/op 1000000
Arrays_Sort·p0.90 sample 1 1 104.293990 ms/op 1000000
Arrays_Sort·p0.95 sample 1 1 106.430464 ms/op 1000000
Arrays_Sort·p0.99 sample 1 1 111.223767 ms/op 1000000
Arrays_Sort·p0.999 sample 1 1 134.172770 ms/op 1000000
Arrays_Sort·p0.9999 sample 1 1 134.742016 ms/op 1000000
Arrays_Sort·p1.00 sample 1 1 134.742016 ms/op 1000000
Radix_Sort sample 1 2240042 0.002941 0.000033 ms/op 100
Radix_Sort·p0.00 sample 1 1 0.001578 ms/op 100
Radix_Sort·p0.50 sample 1 1 0.002368 ms/op 100
Radix_Sort·p0.90 sample 1 1 0.003556 ms/op 100
Radix_Sort·p0.95 sample 1 1 0.004344 ms/op 100
Radix_Sort·p0.99 sample 1 1 0.011056 ms/op 100
Radix_Sort·p0.999 sample 1 1 0.027232 ms/op 100
Radix_Sort·p0.9999 sample 1 1 0.731127 ms/op 100
Radix_Sort·p1.00 sample 1 1 5.660672 ms/op 100
Radix_Sort sample 1 2695825 0.018553 0.000038 ms/op 1000
Radix_Sort·p0.00 sample 1 1 0.013424 ms/op 1000
Radix_Sort·p0.50 sample 1 1 0.016576 ms/op 1000
Radix_Sort·p0.90 sample 1 1 0.025280 ms/op 1000
Radix_Sort·p0.95 sample 1 1 0.031200 ms/op 1000
Radix_Sort·p0.99 sample 1 1 0.050944 ms/op 1000
Radix_Sort·p0.999 sample 1 1 0.082944 ms/op 1000
Radix_Sort·p0.9999 sample 1 1 0.830295 ms/op 1000
Radix_Sort·p1.00 sample 1 1 6.660096 ms/op 1000
Radix_Sort sample 1 685589 0.145695 0.000234 ms/op 10000
Radix_Sort·p0.00 sample 1 1 0.112512 ms/op 10000
Radix_Sort·p0.50 sample 1 1 0.128000 ms/op 10000
Radix_Sort·p0.90 sample 1 1 0.196608 ms/op 10000
Radix_Sort·p0.95 sample 1 1 0.225792 ms/op 10000
Radix_Sort·p0.99 sample 1 1 0.309248 ms/op 10000
Radix_Sort·p0.999 sample 1 1 0.805888 ms/op 10000
Radix_Sort·p0.9999 sample 1 1 1.818141 ms/op 10000
Radix_Sort·p1.00 sample 1 1 14.401536 ms/op 10000
Radix_Sort sample 1 60843 1.641961 0.005783 ms/op 100000
Radix_Sort·p0.00 sample 1 1 1.251328 ms/op 100000
Radix_Sort·p0.50 sample 1 1 1.542144 ms/op 100000
Radix_Sort·p0.90 sample 1 1 2.002944 ms/op 100000
Radix_Sort·p0.95 sample 1 1 2.375680 ms/op 100000
Radix_Sort·p0.99 sample 1 1 3.447030 ms/op 100000
Radix_Sort·p0.999 sample 1 1 5.719294 ms/op 100000
Radix_Sort·p0.9999 sample 1 1 8.724165 ms/op 100000
Radix_Sort·p1.00 sample 1 1 13.074432 ms/op 100000
Radix_Sort sample 1 4846 20.640787 0.260926 ms/op 1000000
Radix_Sort·p0.00 sample 1 1 14.893056 ms/op 1000000
Radix_Sort·p0.50 sample 1 1 18.743296 ms/op 1000000
Radix_Sort·p0.90 sample 1 1 26.673152 ms/op 1000000
Radix_Sort·p0.95 sample 1 1 30.724915 ms/op 1000000
Radix_Sort·p0.99 sample 1 1 40.470446 ms/op 1000000
Radix_Sort·p0.999 sample 1 1 63.016600 ms/op 1000000
Radix_Sort·p0.9999 sample 1 1 136.052736 ms/op 1000000
Radix_Sort·p1.00 sample 1 1 136.052736 ms/op 1000000

The table tells an interesting story. `Arrays.sort` is vastly superior for small arrays (the arrays most people have), but for large arrays the custom implementation comes into its own. Interestingly, this is consistent with the computer science. If you need to sort large arrays of (unsigned) integers and care about performance, think about implementing radix sort.

# Comparing Streams with Arrays

Java 8 added some great features which, combined with the vigilance of a responsible adult, make it easier to keep a Java code base civilised. Streams and lambdas, especially the limited support offered for primitive types, are a fantastic addition. Is there an observable performance cost to using these features? For a simple calculation amenable to instruction level parallelism, I compare modern and traditional implementations and observe the differences in instructions generated.

### Sum of Squares

The sum of squares is the building block of a linear regression analysis so is ubiquitous in statistical computing. It is associative and therefore data-parallel. I compare four implementations: a sequential stream wrapping an array, a parallel stream wrapping an array, a generative sequential stream and a traditional `for` loop. The benchmark code is on github.

```    @Benchmark
public void SS_SequentialStream(Data state, Blackhole bh) {
bh.consume(DoubleStream.of(state.data)
.map(x -> x * x)
.reduce((x, y) -> x + y));
}

@Benchmark
public void SS_ParallelStream(Data state, Blackhole bh) {
bh.consume(DoubleStream.of(state.data)
.parallel()
.map(x -> x * x)
.reduce((x, y) -> x + y));
}

@Benchmark
public void SS_ForLoop(Data state, Blackhole bh) {
double result = 0D;
final double[] data = state.data;
for (int i = 0; i < data.length; ++i) {
result += data[i] * data[i];
}
bh.consume(result);
}

@Benchmark
public void SS_GenerativeSequentialStream(Data state, Blackhole bh) {
double[] data = state.data;
bh.consume(IntStream.iterate(0, i -> i + 1)
.limit(1_000_000)
.mapToDouble(i -> data[i])
.map(x -> x * x)
.reduce((x, y) -> x + y));
}
```

I must admit I prefer the readability of the stream versions, but let’s see if there is a comedown after the syntactic sugar rush.

### Running a Benchmark

I compare the three implementations on an array of one million doubles. I am using `JDK 9-ea, VM 9-ea+166` on a fairly powerful laptop with 8 processors:

```\$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
```

Before running the benchmark we might expect the `for` loop and stream to have similar performance, and the parallel version to be about eight times faster. The generative version is very similar to the `for` loop so a slow down might not be expected. To isolate vectorised optimisations, I first run with SuperWord disabled.

Benchmark Mode Count Score Units
SS_ForLoop thrpt 10 0.540 ± 0.072 ops/ms
SS_ParallelStream thrpt 10 2.336 ± 0.412 ops/ms
SS_SequentialStream thrpt 10 0.565 ± 0.052 ops/ms
SS_GenerativeSequentialStream thrpt 10 0.247 ± 0.036 ops/ms

The `for` loop and stream are similar but the stream is better. The parallel version is faster but is either not utilising all of the cores or incurring a cost; the improvement in throughput is only a factor of 4.3. It has never felt like a good idea to execute code blindly on a default fork join pool without safe guards. There is a huge difference between streams which wrap arrays and a stream which iterates over an array, most likely related to cache locality.

What happens when SuperWord optimisations are enabled? Throughput improves except for the generative stream (suggesting no SIMD execution at all), but the `for` loop improves the most.

Benchmark Mode Count Score Units
SS_ForLoop thrpt 10 0.614 ± 0.035 ops/ms
SS_ParallelStream thrpt 10 2.551 ± 0.095 ops/ms
SS_SequentialStream thrpt 10 0.596 ± 0.043 ops/ms
SS_GenerativeSequentialStream thrpt 10 0.240 ± 0.049 ops/ms

The three nines sample is actually worse than the `for` loop in the parallel stream version, whereas the generative stream is off the chart.

Benchmark Mode Score Units
SS_ForLoop·p0.999 sample 5.424 ms/op
SS_ParallelStream·p0.999 sample 6.077 ms/op
SS_SequentialStream·p0.999 sample 6.340 ms/op
SS_GenerativeSequentialStream·p0.999 sample 14.516 ms/op

Inspecting the assembly code generated, it is clear that the `for` loop body is actually only compiling down to SSE here: it’s only loading one double at a time into registers.

```0x000001aada783fca: vmovsd  xmm0,qword ptr [rbp+r13*8+10h] ...
```

But so is the sequential stream – here is the lambda implementing the square:

```0x00000170c11228be: vmulsd  xmm1,xmm0,xmm0    ;...c5
;...fb
;...59
;...c8
;*dmul {reexecute=0 rethrow=0 return_oop=0}
; - com.openkappa.mythbusters.streams.StreamReduction::lambda\$SS_SequentialStream\$0@2 (line 66)
; - com.openkappa.mythbusters.streams.StreamReduction\$\$Lambda\$40/1123395594::applyAsDouble@1
; - java.util.stream.DoublePipeline\$2\$1::accept@12 (line 213)
; - java.util.Spliterators\$DoubleArraySpliterator::forEachRemaining@53 (line 1198)
; - java.util.Spliterator\$OfDouble::forEachRemaining@12 (line 828)
; - java.util.stream.AbstractPipeline::copyInto@32 (line 484)
; - java.util.stream.AbstractPipeline::wrapAndCopyInto@13 (line 474)
; - java.util.stream.ReduceOps\$ReduceOp::evaluateSequential@6 (line 913)
; - java.util.stream.AbstractPipeline::evaluate@88 (line 234)
```

And here is the reduce:

```  0x00000170c1118610: vaddsd  xmm1,xmm1,xmm2    ;...c5
;...f3
;...58
;...ca
Is it mechanically sympathetic to separate these instructions like this? Probably not, which explains the degradation in performance relative to the `for` loop, but it’s not so drastic that a completely different instruction set is being used. Use of the same SSE instructions can be seen with `Stream.parallelStream`.
The same optimisations are applied to streams when they are countable. Arrays seems to do slightly better, but not well enough to eschew streams. Uncountable or opaquely countable streams perform significantly worse than countable streams or traditional code. `Stream.parallelStream` can improve performance but does not make the best use of the cores available and using a global thread pool leads to unpredictability. Balancing performance and readability leads to an approach where data is stored in paginated arrays and accessed and transformed via the streams API.