Advanced Vector Extensions (AVX) has gone through several iterations, doubling the size of SIMD registers available several times. The state of the art is currently AVX-512 which offers 512-bit registers. In parallel, support for auto-vectorisation of handwritten Java has been improving. With an early access JDK9 on an Intel(R) Core(TM) i7-6700HQ CPU I have observed usage of both 256-bit (vector addition) and 128-bit registers (sum of squares). I was very kindly granted access to a Knights Landing server where I ran the same code. I was hoping to see evidence of usage of 512-bit registers, though I didn’t see it directly I did see a strong indication that this occurs.
There is a hierarchy of register types from several iterations of SIMD instruction sets: 128 bits, named xmm, originally from Streaming SIMD Extensions (SSE), 256-bits, named ymm, originally from AVX, and 512-bits, named zmm, introduced with AVX-512. A 512-bit register is enormous; it can contain 16
ints and can be used as a single operand to various vector instructions allowing up to 16
ints to be operated on in parallel. Each of these registers can be used as if it is the predecessor by masking the upper bits.
On my laptop I have seen Java 9 use 128-bit xmm registers in a sum of squares calculation on doubles:
0x000001aada783fca: vmovsd xmm0,qword ptr [rbp+r13*8+10h] ...
0x000001aada783fd1: vmulsd xmm0,xmm0,xmm0 ...
0x000001aada783fd5: vmovq xmm1,rbx ...
0x000001aada783fda: vaddsd xmm0,xmm1,xmm0 ...
As an aside, the weird thing about this code is that only a 64-bit qword, as opposed to an xmmword, is being loaded into xmm0 here. The code generated for floats is very interesting in this respect because it seems to load a full 256-bit ymmword of the array and supplies it as an operand to vmulps.
0x00000245b32aa0dc: vmovdqu ymm1,ymmword ptr [r8+r9*4+10h]
0x00000245b32aa0e3: vmulps ymm1,ymm1,ymm1
0x00000245b32aa0e7: vaddss xmm0,xmm0,xmm1
I have also seen 256-bit ymm registers in use in the simpler float vector addition:
0x000002759215a17a: vmovdqu ymm0,ymmword ptr [rcx+rax*4+50h]
0x000002759215a180: vaddps ymm0,ymm0,ymmword ptr [rbp+rax*4+50h]
0x000002759215a186: vmovdqu ymmword ptr [rbx+rax*4+50h],ymm0
Where a 256-bit ymmword is being used to load chunks of the array into the register. The interesting thing about running the float vector addition case is that you can trace the optimisations being applied as the code runs; starting off with scalar instructions, graduating to AVX instructions on xmm registers, finally learning to utilise the ymm registers. This is quite an impressive feat of the optimiser.
What’s next? 512-bit zmm registers!
Knights Landing has 32 512-bit zmm registers, I try to use them by running the code at github, and observing the assembly code emitted.
About the Box
The Knights Landing server has 256 processors, each with a modest clock speed, but support for AVX-512 core and several extensions (see the flags). It is designed for massively parallel workloads, and, for single threaded code, would not compare favourably with a commodity desktop. But if you can write code to keep 256 cores busy concurrently, it will likely profit from running on a Knights Landing.
processor : 255
vendor_id : GenuineIntel
cpu family : 6
model : 87
model name : Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
stepping : 1
microcode : 0x130
cpu MHz : 1299.695
cache size : 1024 KB
physical id : 0
siblings : 256
core id : 73
cpu cores : 64
apicid : 295
initial apicid : 295
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips : 2600.02
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
I am running on this server simply because it supports AVX-512, and am interested to see if Java can use it. The code is purposefully single threaded, and the point is not to confirm that one of 256 1.3Ghz processors is twice as slow as one of my eight 2.7GHz processors when measured on a single threaded workload, the interesting aspect is the instructions generated.
The operating system is CentOS Linux release 7.2.1511 (Core), the Java version used was Java HotSpot(TM) 64-Bit Server VM (build 9+177, mixed mode).
My observations were mixed: I didn’t see any 512-bit registers being used, some code expected to vectorise didn’t, but, critically, also saw evidence that the hsdis-amd64.so disassembler did not understand some of the assembly used. I hope, just because I like things to work, that this is hiding evidence of 512-bit register use. As you can see, there are a lot of new instructions in AVX-512 which may explain these black holes in the output.
I ran vector addition and sum of squares across a range of vector sizes and primitive types and saw little use of AVX instructions in my handwritten code, let alone evidence of 512-bit zmm register operands, yet extensive use of 256-bit registers for String intrinsics.
The first data point to look at is the relative timings (remember: absolute timings are meaningless because of Knights Landing design motivations, even more so because assembly was being printed during this run) which are interesting: some kind of optimisation is being applied to sum of squares only for
Benchmark (size) Mode Cnt Score Error Units
c.o.s.add.Addition.Add_Doubles 100000 thrpt 10 3.022 ± 0.262 ops/ms
c.o.s.add.Addition.Add_Doubles 1000000 thrpt 10 0.284 ± 0.006 ops/ms
c.o.s.add.Addition.Add_Floats 100000 thrpt 10 6.212 ± 0.521 ops/ms
c.o.s.add.Addition.Add_Floats 1000000 thrpt 10 0.572 ± 0.019 ops/ms
c.o.s.add.Addition.Add_Ints 100000 thrpt 10 6.383 ± 0.445 ops/ms
c.o.s.add.Addition.Add_Ints 1000000 thrpt 10 0.573 ± 0.019 ops/ms
c.o.s.add.Addition.Add_Longs 100000 thrpt 10 3.022 ± 0.241 ops/ms
c.o.s.add.Addition.Add_Longs 1000000 thrpt 10 0.281 ± 0.025 ops/ms
c.o.s.ss.SumOfSquares.SS_Doubles 100000 thrpt 10 2.145 ± 0.004 ops/ms
c.o.s.ss.SumOfSquares.SS_Doubles 1000000 thrpt 10 0.206 ± 0.001 ops/ms
c.o.s.ss.SumOfSquares.SS_Floats 100000 thrpt 10 2.150 ± 0.002 ops/ms
c.o.s.ss.SumOfSquares.SS_Floats 1000000 thrpt 10 0.212 ± 0.001 ops/ms
c.o.s.ss.SumOfSquares.SS_Ints 100000 thrpt 10 16.960 ± 0.043 ops/ms
c.o.s.ss.SumOfSquares.SS_Ints 1000000 thrpt 10 1.015 ± 0.019 ops/ms
c.o.s.ss.SumOfSquares.SS_Longs 100000 thrpt 10 6.379 ± 0.014 ops/ms
c.o.s.ss.SumOfSquares.SS_Longs 1000000 thrpt 10 0.429 ± 0.033 ops/ms
I looked at how SS_Ints was being executed and saw the usage of the vpaddd (addition of packed integers) instruction with xmm register operands, in between two instructions the disassembler (hsdis) seemingly could not read. Maybe these unprintable instructions are from AVX-512 or extensions? It’s impossible to say. The same mechanism, using vpaddq, was not used for
longs, but is used on my laptop.
0x00007fc31576edf1: (bad) ;...0e
0x00007fc31576edf2: vpaddd %xmm4,%xmm1,%xmm1 ;...c5f1fecc
0x00007fc31576edf6: (bad) ;...62
The case of vector addition is less clear; there are many uninterpreted instructions in the output, while it clearly vectorises on my laptop using the largest registers available. The only vector instruction present was vzeroupper, which zeroes the upper 128 bits of a 256-bit register, and is usually used for optimising use of SSE on more modern architectures. Incidentally, there is explicit advice from Intel against using this instruction on Knights Landing. I saw assembly for the scalar implementation of this code (this will always happen prior to optimisation), but there’s no evidence the code was vectorised. However, there were 110 (bad) instructions in the output for float array addition alone.
An interesting presentation, by one of the key Hotspot compiler engineers, outlines some of the limits of SIMD support in Java. It neither includes examples of 512-bit register usage nor states that this is unsupported. Possibly Project Panama‘s
Long8 type will utilise 512-bit registers. I will recompile the disassembler and give JDK9 the benefit of the doubt for now.