Can Java 9 Leverage AVX-512 Yet?


Advanced Vector Extensions (AVX) has gone through several iterations, doubling the size of SIMD registers available several times. The state of the art is currently AVX-512 which offers 512-bit registers. In parallel, support for auto-vectorisation of handwritten Java has been improving. With an early access JDK9 on an Intel(R) Core(TM) i7-6700HQ CPU I have observed usage of both 256-bit (vector addition) and 128-bit registers (sum of squares). I was very kindly granted access to a Knights Landing server where I ran the same code. I was hoping to see evidence of usage of 512-bit registers, though I didn’t see it directly I did see a strong indication that this occurs.

SIMD Registers

There is a hierarchy of register types from several iterations of SIMD instruction sets: 128 bits, named xmm, originally from Streaming SIMD Extensions (SSE), 256-bits, named ymm, originally from AVX, and 512-bits, named zmm, introduced with AVX-512. A 512-bit register is enormous; it can contain 16 ints and can be used as a single operand to various vector instructions allowing up to 16 ints to be operated on in parallel. Each of these registers can be used as if it is the predecessor by masking the upper bits.

On my laptop I have seen Java 9 use 128-bit xmm registers in a sum of squares calculation on doubles:

0x000001aada783fca: vmovsd  xmm0,qword ptr [rbp+r13*8+10h] ...
0x000001aada783fd1: vmulsd  xmm0,xmm0,xmm0 ...
0x000001aada783fd5: vmovq   xmm1,rbx ...
0x000001aada783fda: vaddsd  xmm0,xmm1,xmm0 ...

As an aside, the weird thing about this code is that only a 64-bit qword, as opposed to an xmmword, is being loaded into xmm0 here. The code generated for floats is very interesting in this respect because it seems to load a full 256-bit ymmword of the array and supplies it as an operand to vmulps.

  0x00000245b32aa0dc: vmovdqu ymm1,ymmword ptr [r8+r9*4+10h]                                           
  0x00000245b32aa0e3: vmulps  ymm1,ymm1,ymm1
  0x00000245b32aa0e7: vaddss  xmm0,xmm0,xmm1

I have also seen 256-bit ymm registers in use in the simpler float vector addition:

  0x000002759215a17a: vmovdqu ymm0,ymmword ptr [rcx+rax*4+50h]
  0x000002759215a180: vaddps  ymm0,ymm0,ymmword ptr [rbp+rax*4+50h]
  0x000002759215a186: vmovdqu ymmword ptr [rbx+rax*4+50h],ymm0

Where a 256-bit ymmword is being used to load chunks of the array into the register. The interesting thing about running the float vector addition case is that you can trace the optimisations being applied as the code runs; starting off with scalar instructions, graduating to AVX instructions on xmm registers, finally learning to utilise the ymm registers. This is quite an impressive feat of the optimiser.

What’s next? 512-bit zmm registers!

Knights Landing

Knights Landing has 32 512-bit zmm registers, I try to use them by running the code at github, and observing the assembly code emitted.

About the Box

The Knights Landing server has 256 processors, each with a modest clock speed, but support for AVX-512 core and several extensions (see the flags). It is designed for massively parallel workloads, and, for single threaded code, would not compare favourably with a commodity desktop. But if you can write code to keep 256 cores busy concurrently, it will likely profit from running on a Knights Landing.

processor       : 255
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
stepping        : 1
microcode       : 0x130
cpu MHz         : 1299.695
cache size      : 1024 KB
physical id     : 0
siblings        : 256
core id         : 73
cpu cores       : 64
apicid          : 295
initial apicid  : 295
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips        : 2600.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I am running on this server simply because it supports AVX-512, and am interested to see if Java can use it. The code is purposefully single threaded, and the point is not to confirm that one of 256 1.3Ghz processors is twice as slow as one of my eight 2.7GHz processors when measured on a single threaded workload, the interesting aspect is the instructions generated.

The operating system is CentOS Linux release 7.2.1511 (Core), the Java version used was Java HotSpot(TM) 64-Bit Server VM (build 9+177, mixed mode).

Observations

My observations were mixed: I didn’t see any 512-bit registers being used, some code expected to vectorise didn’t, but, critically, also saw evidence that the hsdis-amd64.so disassembler did not understand some of the assembly used. I hope, just because I like things to work, that this is hiding evidence of 512-bit register use. As you can see, there are a lot of new instructions in AVX-512 which may explain these black holes in the output.

I ran vector addition and sum of squares across a range of vector sizes and primitive types and saw little use of AVX instructions in my handwritten code, let alone evidence of 512-bit zmm register operands, yet extensive use of 256-bit registers for String intrinsics.

The first data point to look at is the relative timings (remember: absolute timings are meaningless because of Knights Landing design motivations, even more so because assembly was being printed during this run) which are interesting: some kind of optimisation is being applied to sum of squares only for ints.

Benchmark                          (size)   Mode  Cnt   Score    Error   Units
c.o.s.add.Addition.Add_Doubles     100000  thrpt   10   3.022 ±  0.262  ops/ms
c.o.s.add.Addition.Add_Doubles    1000000  thrpt   10   0.284 ±  0.006  ops/ms
c.o.s.add.Addition.Add_Floats      100000  thrpt   10   6.212 ±  0.521  ops/ms
c.o.s.add.Addition.Add_Floats     1000000  thrpt   10   0.572 ±  0.019  ops/ms
c.o.s.add.Addition.Add_Ints        100000  thrpt   10   6.383 ±  0.445  ops/ms
c.o.s.add.Addition.Add_Ints       1000000  thrpt   10   0.573 ±  0.019  ops/ms
c.o.s.add.Addition.Add_Longs       100000  thrpt   10   3.022 ±  0.241  ops/ms
c.o.s.add.Addition.Add_Longs      1000000  thrpt   10   0.281 ±  0.025  ops/ms
c.o.s.ss.SumOfSquares.SS_Doubles   100000  thrpt   10   2.145 ±  0.004  ops/ms
c.o.s.ss.SumOfSquares.SS_Doubles  1000000  thrpt   10   0.206 ±  0.001  ops/ms
c.o.s.ss.SumOfSquares.SS_Floats    100000  thrpt   10   2.150 ±  0.002  ops/ms
c.o.s.ss.SumOfSquares.SS_Floats   1000000  thrpt   10   0.212 ±  0.001  ops/ms
c.o.s.ss.SumOfSquares.SS_Ints      100000  thrpt   10  16.960 ±  0.043  ops/ms
c.o.s.ss.SumOfSquares.SS_Ints     1000000  thrpt   10   1.015 ±  0.019  ops/ms
c.o.s.ss.SumOfSquares.SS_Longs     100000  thrpt   10   6.379 ±  0.014  ops/ms
c.o.s.ss.SumOfSquares.SS_Longs    1000000  thrpt   10   0.429 ±  0.033  ops/ms

I looked at how SS_Ints was being executed and saw the usage of the vpaddd (addition of packed integers) instruction with xmm register operands, in between two instructions the disassembler (hsdis) seemingly could not read. Maybe these unprintable instructions are from AVX-512 or extensions? It’s impossible to say. The same mechanism, using vpaddq, was not used for longs, but is used on my laptop.

  0x00007fc31576edf1: (bad)                     ;...0e

  0x00007fc31576edf2: vpaddd %xmm4,%xmm1,%xmm1  ;...c5f1fecc

  0x00007fc31576edf6: (bad)                     ;...62

The case of vector addition is less clear; there are many uninterpreted instructions in the output, while it clearly vectorises on my laptop using the largest registers available. The only vector instruction present was vzeroupper, which zeroes the upper 128 bits of a 256-bit register, and is usually used for optimising use of SSE on more modern architectures. Incidentally, there is explicit advice from Intel against using this instruction on Knights Landing. I saw assembly for the scalar implementation of this code (this will always happen prior to optimisation), but there’s no evidence the code was vectorised. However, there were 110 (bad) instructions in the output for float array addition alone.

An interesting presentation, by one of the key Hotspot compiler engineers, outlines some of the limits of SIMD support in Java. It neither includes examples of 512-bit register usage nor states that this is unsupported. Possibly Project Panama‘s Long8 type will utilise 512-bit registers. I will recompile the disassembler and give JDK9 the benefit of the doubt for now.

2 thoughts on “Can Java 9 Leverage AVX-512 Yet?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s