Comparing Streams with Arrays

Java 8 added some great features which, combined with the vigilance of a responsible adult, make it easier to keep a Java code base civilised. Streams and lambdas, especially the limited support offered for primitive types, are a fantastic addition. Is there an observable performance cost to using these features? For a simple calculation amenable to instruction level parallelism, I compare modern and traditional implementations and observe the differences in instructions generated.

Sum of Squares

The sum of squares is the building block of a linear regression analysis so is ubiquitous in statistical computing. It is associative and therefore data-parallel. I compare four implementations: a sequential stream wrapping an array, a parallel stream wrapping an array, a generative sequential stream and a traditional for loop.

    @Benchmark
    public void SS_SequentialStream(State state, Blackhole bh) {
        bh.consume(DoubleStream.of(state.data)
                               .map(x -> x * x)
                               .reduce((x, y) -> x + y));
    }

    @Benchmark
    public void SS_ParallelStream(State state, Blackhole bh) {
        bh.consume(DoubleStream.of(state.data)
                               .parallel()
                               .map(x -> x * x)
                               .reduce((x, y) -> x + y));
    }

    @Benchmark
    public void SS_ForLoop(State state, Blackhole bh) {
        double result = 0D;
        final double[] data = state.data;
        for (int i = 0; i < data.length; ++i) {
            result += data[i] * data[i];
        }
        bh.consume(result);
    }

    @Benchmark
    public void SS_GenerativeSequentialStream(Data state, Blackhole bh) {
        double[] data = state.data;
        bh.consume(IntStream.iterate(0, i -> i + 1)
                            .limit(1_000_000)
                            .mapToDouble(i -> data[i])
                            .map(x -> x * x)
                            .reduce((x, y) -> x + y));
    }

I must admit I prefer the readability of the stream versions, but let’s see if there is a comedown after the syntactic sugar rush.

Running a Benchmark

I compare the three implementations on an array of one million doubles. I am using JDK 9-ea, VM 9-ea+166 on a fairly powerful laptop with 8 processors:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
cpu MHz         : 2592.000
cache size      : 256 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Before running the benchmark we might expect the for loop and stream to have similar performance, and the parallel version to be about eight times faster. The generative version is very similar to the for loop so a slow down might not be expected. To isolate vectorised optimisations, I first run with SuperWord disabled.

Benchmark Mode Count Score Units
SS_ForLoop thrpt 10 0.540 ± 0.072 ops/ms
SS_ParallelStream thrpt 10 2.336 ± 0.412 ops/ms
SS_SequentialStream thrpt 10 0.565 ± 0.052 ops/ms
SS_GenerativeSequentialStream thrpt 10 0.247 ± 0.036 ops/ms

The for loop and stream are similar but the stream is better. The parallel version is faster but is either not utilising all of the cores or incurring a cost; the improvement in throughput is only a factor of 4.3. It has never felt like a good idea to execute code blindly on a default fork join pool without safe guards. There is a huge difference between streams which wrap arrays and a stream which iterates over an array, most likely related to cache locality.

What happens when SuperWord optimisations are enabled? Throughput improves except for the generative stream (suggesting no SIMD execution at all), but the for loop improves the most. This will most likely be due to the use of AVX instructions.

Benchmark Mode Count Score Units
SS_ForLoop thrpt 10 0.614 ± 0.035 ops/ms
SS_ParallelStream thrpt 10 2.551 ± 0.095 ops/ms
SS_SequentialStream thrpt 10 0.596 ± 0.043 ops/ms
SS_GenerativeSequentialStream thrpt 10 0.240 ± 0.049 ops/ms

The three nines sample is actually worse than the for loop in the parallel stream version, whereas the generative stream is off the chart.

Benchmark Mode Score Units
SS_ForLoop·p0.999 sample 5.424 ms/op
SS_ParallelStream·p0.999 sample 6.077 ms/op
SS_SequentialStream·p0.999 sample 6.340 ms/op
SS_GenerativeSequentialStream·p0.999 sample 14.516 ms/op

Inspecting the assembly code generated, it is clear that the for loop body is compiling down to AVX instructions:

0x000001aada783fca: vmovsd  xmm0,qword ptr [rbp+r13*8+10h] ...
0x000001aada783fd1: vmulsd  xmm0,xmm0,xmm0 ...
0x000001aada783fd5: vmovq   xmm1,rbx ...
0x000001aada783fda: vaddsd  xmm0,xmm1,xmm0 ...

But so is the sequential stream – here is the lambda implementing the square:

0x00000170c11228be: vmulsd  xmm1,xmm0,xmm0    ;...c5
                                                ;...fb
                                                ;...59
                                                ;...c8
                                                ;*dmul {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.mythbusters.streams.StreamReduction::lambda$SS_SequentialStream$0@2 (line 66)
                                                ; - com.openkappa.mythbusters.streams.StreamReduction$$Lambda$40/1123395594::applyAsDouble@1
                                                ; - java.util.stream.DoublePipeline$2$1::accept@12 (line 213)
                                                ; - java.util.Spliterators$DoubleArraySpliterator::forEachRemaining@53 (line 1198)
                                                ; - java.util.Spliterator$OfDouble::forEachRemaining@12 (line 828)
                                                ; - java.util.stream.AbstractPipeline::copyInto@32 (line 484)
                                                ; - java.util.stream.AbstractPipeline::wrapAndCopyInto@13 (line 474)
                                                ; - java.util.stream.ReduceOps$ReduceOp::evaluateSequential@6 (line 913)
                                                ; - java.util.stream.AbstractPipeline::evaluate@88 (line 234)

And here is the reduce:

  0x00000170c1118610: vaddsd  xmm1,xmm1,xmm2    ;...c5
                                                ;...f3
                                                ;...58
                                                ;...ca
                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.openkappa.mythbusters.streams.StreamReduction::lambda$SS_SequentialStream$1@2 (line 66)
                                                ; - com.openkappa.mythbusters.streams.StreamReduction$$Lambda$41/1355169748::applyAsDouble@2
                                                ; - java.util.stream.ReduceOps$12ReducingSink::accept@30 (line 693)
                                                ; - java.util.stream.DoublePipeline$2$1::accept@17 (line 213)
                                                ; - java.util.Spliterators$DoubleArraySpliterator::forEachRemaining@53 (line 1198)

Is it mechanically sympathetic to separate these instructions like this? Probably not, which explains the degradation in performance relative to the for loop, but it’s not so drastic that a completely different instruction set is being used. Use of the same AVX instructions can be seen with Stream.parallelStream.

Conclusion

The same optimisations are applied to streams when they are countable. Arrays seems to do slightly better, but not well enough to eschew streams. Uncountable or opaquely countable streams perform significantly worse than countable streams or traditional code. Stream.parallelStream can improve performance but does not make the best use of the cores available and using a global thread pool leads to unpredictability. Balancing performance and readability leads to an approach where data is stored in paginated arrays and accessed and transformed via the streams API.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s