Still True in Java 9: Handwritten Hash Codes are Faster

One of the things to keep in mind with Java is that the best performance advice for one version may not apply to the next. The JVM has an optimiser which works by detecting intent in byte-code; it does a better job when the programmer is skilled enough to make that intent clear. There have been times when the optimiser has done a bad job, either through bugs or feature gaps, and compensatory idioms emerge. The value of these idioms degrades over time as the optimiser improves; a stroke of ingenuity in one version can become ritualistic nonsense in the next. It’s important not to be that guy who fears adding strings together because it was costly a decade ago, but does it always get better, even with low hanging fruit?

I reproduced the results of an extremely astute optimisation presented in 2015 by Daniel Lemire. I was hoping to see that an improved optimiser in Java 9, having observed several cases of automatic vectorisation, would render this optimisation null.

Hash Codes for Arrays

I encourage you to read the original blog post because it is informative, but for the sake of coherence I will summarise the key point here. Arrays.hashCode implements the following hash code:

    public static int hashCode(int[] a) {
        if (a == null)
            return 0;

        int result = 1;
        for (int element : a)
            result = 31 * result + element;

        return result;

This results in a good hash code, but a scalar internal representation of this code has a problem: a data dependency on the multiplication, which is slower than moving data into registers. A significant and reproducible speed up can be observed when unrolling the loop, which enforces a prefetch of 128 bits of the array into registers before doing the slower multiplications:

    public static int hashCode(int[] a) {
        if (a == null)
            return 0;

        int result = 1;
        int i = 0;
        for (; i + 3 < a.length; i += 4) {
            result = 31 * 31 * 31 * 31 * result
                   + 31 * 31 * 31 * a[i]
                   + 31 * 31 * a[i + 1]
                   + 31 * a[i + 2]
                   + a[i + 3];
        for (; i < a.length; i++) {
            result = 31 * result + a[i];
        return result;

The improvement in performance from this simple change can be confirmed with Java 8 by running a simple benchmark.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.537736 0.382617 ops/us 100
BuiltIn thrpt 1 10 0.804620 0.103037 ops/us 1000
BuiltIn thrpt 1 10 0.092297 0.003947 ops/us 10000
Unrolled thrpt 1 10 14.974398 0.628522 ops/us 100
Unrolled thrpt 1 10 1.514986 0.035759 ops/us 1000
Unrolled thrpt 1 10 0.137408 0.010200 ops/us 10000

The performance improvement is so obvious, and the change so easy to make, that one wonders why JVM vendors didn’t make the change themselves.

Java 9: Universally Better Automatic Optimisations?

As the comments on the blog post suggest, this is a prime candidate for vectorisation. Auto-vectorisation is a thing in Java 9. Using intrinsics or code clean enough to express intent clearly, you can really expect to see good usage of SIMD in Java 9. I was really expecting the situation to reverse in Java 9; but it doesn’t.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.822568 0.381087 ops/us 100
BuiltIn thrpt 1 10 0.949273 0.024021 ops/us 1000
BuiltIn thrpt 1 10 0.093171 0.004502 ops/us 10000
Unrolled thrpt 1 10 13.762617 0.440135 ops/us 100
Unrolled thrpt 1 10 1.501106 0.094897 ops/us 1000
Unrolled thrpt 1 10 0.139963 0.011487 ops/us 10000

This is a still a smart optimisation two years later, but it offends my sensibilities in the same way hints do in SQL – a layer of abstraction has been punctured. I will have to try again in Java 10.

New Methods in Java 9: Math.fma and Arrays.mismatch

There are two noteworthy new methods in Java 9: Arrays.mismatch and Math.fma.


This method takes two primitive arrays, and returns the index of the first differing values. This effectively computes the longest common prefix of the two arrays. This is really quite useful, mostly for text processing but also for Bioinformatics (protein sequencing and so on, much more interesting than the sort of thing I work on). Having worked extensively with Apache HBase (where a vast majority of the API involves manipulating byte arrays) I can think of lots of less interesting use cases for this method.

Looking carefully, you can see that the method calls into the internal ArraysSupport utility class, which will try to perform a vectorised mismatch (an intrinsic candidate). Since this will use AVX instructions, this is very fast; much faster than a handwritten loop.

Let’s measure the boost versus a handwritten loop, testing across a range of common prefices and array lengths of byte[].

    public void Mismatch_Intrinsic(BytePrefixData data, Blackhole bh) {
        bh.consume(Arrays.mismatch(data.data1, data.data2));

    public void Mismatch_Handwritten(BytePrefixData data, Blackhole bh) {
        byte[] data1 = data.data1;
        byte[] data2 = data.data2;
        int length = Math.min(data1.length, data2.length);
        int mismatch = -1;
        for (int i = 0; i < length; ++i) {
            if (data1[i] != data2[i]) {
                mismatch = i;

The results speak for themselves. Irritatingly, there is some duplication in output because I haven’t figured out how to make JMH use a subset of the Cartesian product of its parameters.

Benchmark (prefix) (size) Mode Cnt Score Error Units
Mismatch_Handwritten 10 100 thrpt 10 22.360 ± 0.938 ops/us
Mismatch_Handwritten 10 1000 thrpt 10 2.459 ± 0.256 ops/us
Mismatch_Handwritten 10 10000 thrpt 10 0.255 ± 0.009 ops/us
Mismatch_Handwritten 100 100 thrpt 10 22.763 ± 0.869 ops/us
Mismatch_Handwritten 100 1000 thrpt 10 2.690 ± 0.044 ops/us
Mismatch_Handwritten 100 10000 thrpt 10 0.273 ± 0.008 ops/us
Mismatch_Handwritten 1000 100 thrpt 10 24.970 ± 0.713 ops/us
Mismatch_Handwritten 1000 1000 thrpt 10 2.791 ± 0.066 ops/us
Mismatch_Handwritten 1000 10000 thrpt 10 0.281 ± 0.007 ops/us
Mismatch_Intrinsic 10 100 thrpt 10 89.169 ± 2.759 ops/us
Mismatch_Intrinsic 10 1000 thrpt 10 26.995 ± 0.501 ops/us
Mismatch_Intrinsic 10 10000 thrpt 10 3.553 ± 0.065 ops/us
Mismatch_Intrinsic 100 100 thrpt 10 83.037 ± 5.590 ops/us
Mismatch_Intrinsic 100 1000 thrpt 10 26.249 ± 0.714 ops/us
Mismatch_Intrinsic 100 10000 thrpt 10 3.523 ± 0.122 ops/us
Mismatch_Intrinsic 1000 100 thrpt 10 87.921 ± 6.566 ops/us
Mismatch_Intrinsic 1000 1000 thrpt 10 25.812 ± 0.442 ops/us
Mismatch_Intrinsic 1000 10000 thrpt 10 4.177 ± 0.059 ops/us

Why is there such a big difference? Look at how the score decreases as a function of array length, even when the common prefix, and therefore the number of comparisons required, is small: clearly the performance of this algorithm depends on the efficiency of memory access. Arrays.mismatch optimises this, reading qwords of the array into SIMD registers. Working one long at a time, it is possible to compute the XOR in a single instruction to determine if it’s even necessary to look at each byte.

java/util/ArraysSupport.vectorizedMismatch(Ljava/lang/Object;JLjava/lang/Object;JII)I  [0x000002bd9215a820, 0x000002bd9215aa78]  600 bytes
Argument 0 is unknown.RIP: 0x2bd9215a820 Code size: 0x00000258
[Entry Point]
[Verified Entry Point]
  # {method} {0x000002bda79cbf68} 'vectorizedMismatch' '(Ljava/lang/Object;JLjava/lang/Object;JII)I' in 'java/util/ArraysSupport'
  # parm0:    rdx:rdx   = 'java/lang/Object'
  # parm1:    r8:r8     = long
  # parm2:    r9:r9     = 'java/lang/Object'
  # parm3:    rdi:rdi   = long
  # parm4:    rsi       = int
  # parm5:    rcx       = int
  #           [sp+0x60]  (sp of caller)
  0x000002bd9215a820: mov     dword ptr [rsp+0ffffffffffff9000h],eax

  0x000002bd9215a827: push    rbp               ;...55

  0x000002bd9215a828: sub     rsp,50h           ;...48
                                                ;*synchronization entry
                                                ; - java.util.ArraysSupport::vectorizedMismatch@-1 (line 120)

  0x000002bd9215a82c: mov     r10,rdi           ;...4c

  0x000002bd9215a82f: vmovq   xmm2,r9           ;...c4

  0x000002bd9215a834: vmovq   xmm1,rdx          ;...c4

  0x000002bd9215a839: mov     r14d,ecx          ;...44

  0x000002bd9215a83c: vmovd   xmm0,esi          ;...c5

  0x000002bd9215a840: mov     r9d,3h            ;...41

  0x000002bd9215a846: sub     r9d,ecx           ;...44
                                                ;*isub {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.util.ArraysSupport::vectorizedMismatch@5 (line 120)

  0x000002bd9215a849: mov     edx,esi           ;...8b

  0x000002bd9215a84b: mov     ecx,r9d           ;...41

  0x000002bd9215a84e: sar     edx,cl            ;...d3
                                                ;*ishr {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.util.ArraysSupport::vectorizedMismatch@17 (line 122)

  0x000002bd9215a850: mov     eax,1h            ;...b8

  0x000002bd9215a855: xor     edi,edi           ;...33

  0x000002bd9215a857: test    edx,edx           ;...85

  0x000002bd9215a859: jle     2bd9215a97ah      ;...0f

The code for this benchmark is at github.


In comparison to users of some languages, Java programmers are lackadaisical about floating point errors. It’s a good job that historically Java hasn’t been considered suitable for the implementation of numerical algorithms. But all of a sudden there is a revolution of data science on the JVM, albeit mostly driven by the Scala community, with JVM implementations of structures like recurrent neural networks abounding. It matters less for machine learning than root finding, but how accurate can these implementations be without JVM level support for minimising the propagation floating point errors? With Math.fma this is improving, by allowing two common operations to be performed before rounding.

Math.fma fuses a multiplication and an addition into a single floating point operation to compute expressions like ab + c. This has two key benefits:

  1. There’s only one operation, and only one rounding error
  2. This is explicitly supported in AVX2 by the VFMADD* instructions

Newton’s Method

To investigate any superior suppression of floating point errors, I use a toy implementation of Newton’s method to compute the root of a quadratic equation, which any teenager could calculate analytically (the error is easy to quantify).

I compare these two implementations for 4x^2 - 12x + 9 (there is a repeated root at 1.5) to get an idea for the error (defined by |1.5 - x_n|) after a large number of iterations.

I implemented this using FMA:

public class NewtonsMethodFMA {

    private final double[] coefficients;

    public NewtonsMethodFMA(double[] coefficients) {
        this.coefficients = coefficients;

    public double evaluateF(double x) {
        double f = 0D;
        int power = coefficients.length - 1;
        for (int i = 0; i < coefficients.length; ++i) {
            f = Math.fma(coefficients[i], Math.pow(x, power--), f);
        return f;

    public double evaluateDF(double x) {
        double df = 0D;
        int power = coefficients.length - 2;
        for (int i = 0; i < coefficients.length - 1; ++i) {
            df = Math.fma((power + 1) * coefficients[i],  Math.pow(x, power--), df);
        return df;

    public double solve(double initialEstimate, int maxIterations) {
        double result = initialEstimate;
        for (int i = 0; i < maxIterations; ++i) {
            result -= evaluateF(result)/evaluateDF(result);
        return result;

And an implementation with normal operations:

public class NewtonsMethod {

    private final double[] coefficients;

    public NewtonsMethod(double[] coefficients) {
        this.coefficients = coefficients;

    public double evaluateF(double x) {
        double f = 0D;
        int power = coefficients.length - 1;
        for (int i = 0; i < coefficients.length; ++i) {
            f += coefficients[i] * Math.pow(x, power--);
        return f;

    public double evaluateDF(double x) {
        double df = 0D;
        int power = coefficients.length - 2;
        for (int i = 0; i < coefficients.length - 1; ++i) {
            df += (power + 1) * coefficients[i] * Math.pow(x, power--);
        return df;

    public double solve(double initialEstimate, int maxIterations) {
        double result = initialEstimate;
        for (int i = 0; i < maxIterations; ++i) {
            result -= evaluateF(result)/evaluateDF(result);
        return result;

When I run this code for 1000 iterations, the FMA version results in 1.5000000083575202, whereas the vanilla version results in 1.500000017233207. It’s completely unscientific, but seems plausible and confirms my prejudice so… In fact, it’s not that simple, and over a range of initial values, there is only a very small difference in FMA’s favour. There’s not even a performance improvement – clearly this method wasn’t added so you can start implementing numerical root finding algorithms – the key takeaway is that the results are slightly different because a different rounding strategy has been used.

Benchmark (maxIterations) Mode Cnt Score Error Units
NM_FMA 100 thrpt 10 93.805 ± 5.174 ops/ms
NM_FMA 1000 thrpt 10 9.420 ± 1.169 ops/ms
NM_FMA 10000 thrpt 10 0.962 ± 0.044 ops/ms
NM_HandWritten 100 thrpt 10 93.457 ± 5.048 ops/ms
NM_HandWritten 1000 thrpt 10 9.274 ± 0.483 ops/ms
NM_HandWritten 10000 thrpt 10 0.928 ± 0.041 ops/ms

Interpreting Compression Ratios as Signals

I often think about the book Theories of Everything by John D. Barrow. There is a beautiful passage on algorithmic compressibility:

The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content…

We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is ‘random’ if there is no other representation of the string which is shorter than itself. But we say it is ‘non-random’ if there does exist an abbreviated representation.

When processing a data set, patterns can be found and exploited to reduce its size to close to that of its information content, in other words the data set can be abbreviated. This idea can be turned on its head to use a data set’s compression ratio as a measure of its information content. Using compression algorithms which seek to exploit correlations between attribute values, hitherto unseen relations, or mistakes, decrease compression ratio. Repeated measurement of this proxy information content in windows over a stream of, say, credit card transactions could be interpreted as a measure of the likelihood of fraud. In a more generic setting, changes in the compression ratio can indicate changes in data quality.

Compression Techniques

There are various approaches to compression used in database systems. These techniques can be classified in terms of what is actually compressed, agnosticism to the domain of the data, and invariance of properties of relations in compressed form.

At one extreme there are techniques, typically used for cold data, which compress entire chunks of data byte-wise, with no relational invariance. A good example of such a cold data compression approach is SNAPPY block level compression in HBase.  This approach has a key benefit; it can compress anything since it is agnostic to the format of the data. In order to exploit a property it must be recognised and acknowledged first, and ubiquity comes at the cost of being unable to perform relational operations in compressed form. There are options in various database system to perform this kind of compression at a row level. This can lead to lower compression ratios since at a row level the frequency histograms are unlikely to converge to the frequency histogram of the entire data set (convergence could not happen unless each row were equivalent up to a byte-wise permutation).

At the other extreme there are techniques like columnar dictionary encoding, which maps the domain of a column to integral indices into some array-like storage of the distinct attribute values. The distinct values are typically represented as a sorted array or as a hash set. Several properties are preserved in dictionarised data sets. Equality is always preserved, meaning that several operations like join and various filters can be performed without decompression (instead, the expressions themselves can be mapped into the dictionary domain). Depending on the storage technique used, ordering and associated relations may also be preserved after compression. The frequency histogram of the attribute values of the column is sometimes avoided because this would not result in a fixed-width encoding, but there are dictionary encodings to map the most frequent values to the smallest integers to decrease the number of bits required in encoded form. While the level of compression feasible is good, it is inferior to an inverse-frequency symbol based approach. Columnar dictionary encoding also cannot exploit correlations between attribute values. By way of example, in a financial data set, there is a high correlation between trade dates, trade types and settlement dates: with columnar dictionary encoding, the compressed form would take up space proportional to the distinct values of the three attributes combined, even though trades in FX spot settle T+2, except USDCAD which settles T+1 (the settlement date needn’t be stored, it could feasibly take up no space and just be computed instead).

Somewhere in between the two extremes there are algorithms which encode tuples of attribute values. One such technique starts by computing the Huffman encoding of the entire domain of each column. Lists of tuples are encoded by first replacing each attribute value of each tuple with the corresponding Huffman code, concatenating the Huffman codes into fixed width integers before a sort and delta encoding. This produces a compression ratio close to the entropy of the data set, and specifically targets correlations between columns. While several compression techniques have their use, it is at this point in the spectrum where I see there being useful information in the compression ratio over a window of data, because unusual things will take up more space. If your compression ratio ever changes, it is because something weird has happened and you should probably investigate it. If you are processing trades and the compression ratio decreases, is it because you recently started trading more USDCAD Spot?

Roaring TreeMap (English Translation)

A presentation of the Roaring TreeMap data structure translated into English from French Nouveaux modèles d’index bitmap compressés à 64 bits authored by Samy Chambi, Daniel Lemire and Robert Godin.

Roaring TreeMap

The RoaringTreeMap model combines a Java TreeMap with the Roaring bitmap structure to index a set of 64-bit integers represented by the positions of the set bits of a bitmap. TreeMap is an implementation of a Red-Black Tree structure, a self-balancing binary search tree, and is defined amongst other collections in the java.util package. The various tree operations (insertion, search, etc.) are implemented using the algorithms proposed in  (Cormen et al., 2001).

To represent a 64-bit integer, the model splits the integer into two parts. The first part constitutes the 32 most significant bits, and the second part represents the 32 least significant bits. A node of a RoaringTreeMap is composed of a key, which is a 32-bit integer, and a RoaringBitmap. RoaringTreeMap arranges groups of 64-bit integers containing the same 32 most significant bits into the same node. The key of the node stores the 32 most significant bits for the entire group, and the Roaring bitmap associated with the node contains the 32 least significant bits of each integer. RoaringTreeMap utilises a form of prefix compression on the 32 most significant bits of each integer in a group, which can save up to 32 \times 2^{32}-1 bits for such a group.

An insertion or search operation for a 64 bit integer in a RoaringTreeMap starts by performing a random access in the tree to find a node such that the node’s key is equal to the 32 most significant bits of the integer in question. Using a self-balancing binary search tree allows such an operation to be performed with logarithmic time complexity with respect to the total number of nodes in the RoaringTreeMap. If such a node is found, a second insertion or search operation will be performed on the RoaringBitmap associated with the node, which will also have a logarithmic time complexity with respect to the number of entries in the first level index of the Roaring bitmap and the number of entries contained within the accessed container, given that the container is sorted.

Another advantage of the RoaringTreeMap comes from the property of self-balancing binary search trees, which guarantees that the nodes of the tree can always be ordered by the values of their keys in a linear time complexity with respect the number of nodes in the tree. Combined with the ascending sort order of the 32-bit integers maintained within the Roaring bitmaps, this allows the RoaringTreeMap to iterate over the set of 64-bit integers in ascending order in time linearly proportional to the size of set. This is very effective for the computation of basic set operations between two RoaringTreeMaps, such as: intersection, union, symmetric difference, which can be performed in time linearly proportional to the number of integers contained in the tree. Without this property, such an operation would execute with quadratic time complexity with respect to the number of elements in the two sets.

Union of two RoaringTreeMaps

A union takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts with the allocation of a new empty dynamic array which will be filled with the nodes which form the union. Next, the nodes of the two trees are traversed in the ascending order of their keys. During an iteration, if the two current nodes have keys with different values, the container of the node with the smallest key is copied and inserted into the dynamic array, then the algorithm advances to the position of this node in the tree. Otherwise, if the two nodes compared during an iteration have keys of the same value, a logical OR is computed between the Roaring bitmaps associated with each node, and the result is returned as a new Roaring bitmap. A new node containing the Roaring bitmap obtained and a key of equal value to the two nodes will be inserted into the dynamic array. These operations continue until each node in either tree is consumed. At the end, the nodes inserted in the dynamic array will be sorted in the order of their keys, after which a recursive algorithm constructs the RoaringTreeMap from the dynamic array. The algorithm requires time of O(n1 + n2) to traverse the two input trees, where n1 and n2 represent the number of nodes in the two trees respectively. The construction of the dynamic array requires time of O(n1+n2) because it could require up to O(n1+n2) insertions, and each insertion requires constant amortised time. The construction of the final RoaringTreeMap is implemented with an efficient recursive algorithm which runs in a time of O(n1 + n2). So the total execution time for computing the union consisting of the two algorithms is O(n1 + n2), plus the time necessary to compute the logical OR of the Roaring Bitmaps (Chambi et al., 2014, 2015).

Intersection of two RoaringTreeMaps

An intersection takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts by allocating an empty dynamic array  into which the nodes that form the intersection will be written. After, the algorithm iterates over the nodes of the two trees in the ascending order of their keys. In the case where the keys of the nodes differ in value, the algorithm advances to the position of the node with the smallest key. Otherwise, in the case where the two keys have the same value, a logical AND operation will be performed between the Roaring bitmaps associated with each node, which results in a new Roaring bitmap. Then a new node containing the key and the resultant Roaring bitmap is inserted into the next slot in the dynamic array. These operations continue until each node in either tree is consumed. After the iteration is complete, the dynamic array will be sorted in the order of the nodes’ keys. Afterwards a recursive algorithm builds the RoaringTreeMap from the node array. The execution time of the second intersection algorithm depends on the time necessary to traverse the two trees, for populating the dynamic array, and for constructing the tree of the RoaringTreeMap. The total time to complete these operations is of the order of O(n1 + n2) in the worst case, where n1 and n2 represent the number of nodes in the first and second trees respectively. In the best case, the algorithm executes in a time of min(n1, n2), when a traversal of one of the trees is sufficient to arrive at the final result. This does not account for the possible time to compute the logical ANDs between the Roaring bitmaps (Chambi et al., 2014, 2015).

Can Java 9 Leverage AVX-512 Yet?

Advanced Vector Extensions (AVX) has gone through several iterations, doubling the size of SIMD registers available several times. The state of the art is currently AVX-512 which offers 512-bit registers. In parallel, support for auto-vectorisation of handwritten Java has been improving. With an early access JDK9 on an Intel(R) Core(TM) i7-6700HQ CPU I have observed usage of both 256-bit (vector addition) and 128-bit registers (sum of squares). I was very kindly granted access to a Knights Landing server where I ran the same code. I was hoping to see evidence of usage of 512-bit registers, though I didn’t see it directly I did see a strong indication that this occurs.

SIMD Registers

There is a hierarchy of register types from several iterations of SIMD instruction sets: 128 bits, named xmm, originally from Streaming SIMD Extensions (SSE), 256-bits, named ymm, originally from AVX, and 512-bits, named zmm, introduced with AVX-512. A 512-bit register is enormous; it can contain 16 ints and can be used as a single operand to various vector instructions allowing up to 16 ints to be operated on in parallel. Each of these registers can be used as if it is the predecessor by masking the upper bits.

On my laptop I have seen Java 9 use 128-bit xmm registers in a sum of squares calculation on doubles:

0x000001aada783fca: vmovsd  xmm0,qword ptr [rbp+r13*8+10h] ...
0x000001aada783fd1: vmulsd  xmm0,xmm0,xmm0 ...
0x000001aada783fd5: vmovq   xmm1,rbx ...
0x000001aada783fda: vaddsd  xmm0,xmm1,xmm0 ...

As an aside, the weird thing about this code is that only a 64-bit qword, as opposed to an xmmword, is being loaded into xmm0 here. The code generated for floats is very interesting in this respect because it seems to load a full 256-bit ymmword of the array and supplies it as an operand to vmulps.

  0x00000245b32aa0dc: vmovdqu ymm1,ymmword ptr [r8+r9*4+10h]                                           
  0x00000245b32aa0e3: vmulps  ymm1,ymm1,ymm1
  0x00000245b32aa0e7: vaddss  xmm0,xmm0,xmm1

I have also seen 256-bit ymm registers in use in the simpler float vector addition:

  0x000002759215a17a: vmovdqu ymm0,ymmword ptr [rcx+rax*4+50h]
  0x000002759215a180: vaddps  ymm0,ymm0,ymmword ptr [rbp+rax*4+50h]
  0x000002759215a186: vmovdqu ymmword ptr [rbx+rax*4+50h],ymm0

Where a 256-bit ymmword is being used to load chunks of the array into the register. The interesting thing about running the float vector addition case is that you can trace the optimisations being applied as the code runs; starting off with scalar instructions, graduating to AVX instructions on xmm registers, finally learning to utilise the ymm registers. This is quite an impressive feat of the optimiser.

What’s next? 512-bit zmm registers!

Knights Landing

Knights Landing has 32 512-bit zmm registers, I try to use them by running the code at github, and observing the assembly code emitted.

About the Box

The Knights Landing server has 256 processors, each with a modest clock speed, but support for AVX-512 core and several extensions (see the flags). It is designed for massively parallel workloads, and, for single threaded code, would not compare favourably with a commodity desktop. But if you can write code to keep 256 cores busy concurrently, it will likely profit from running on a Knights Landing.

processor       : 255
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
stepping        : 1
microcode       : 0x130
cpu MHz         : 1299.695
cache size      : 1024 KB
physical id     : 0
siblings        : 256
core id         : 73
cpu cores       : 64
apicid          : 295
initial apicid  : 295
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips        : 2600.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I am running on this server simply because it supports AVX-512, and am interested to see if Java can use it. The code is purposefully single threaded, and the point is not to confirm that one of 256 1.3Ghz processors is twice as slow as one of my eight 2.7GHz processors when measured on a single threaded workload, the interesting aspect is the instructions generated.

The operating system is CentOS Linux release 7.2.1511 (Core), the Java version used was Java HotSpot(TM) 64-Bit Server VM (build 9+177, mixed mode).


My observations were mixed: I didn’t see any 512-bit registers being used, some code expected to vectorise didn’t, but, critically, also saw evidence that the disassembler did not understand some of the assembly used. I hope, just because I like things to work, that this is hiding evidence of 512-bit register use. As you can see, there are a lot of new instructions in AVX-512 which may explain these black holes in the output.

I ran vector addition and sum of squares across a range of vector sizes and primitive types and saw little use of AVX instructions in my handwritten code, let alone evidence of 512-bit zmm register operands, yet extensive use of 256-bit registers for String intrinsics.

The first data point to look at is the relative timings (remember: absolute timings are meaningless because of Knights Landing design motivations, even more so because assembly was being printed during this run) which are interesting: some kind of optimisation is being applied to sum of squares only for ints.

Benchmark                          (size)   Mode  Cnt   Score    Error   Units
c.o.s.add.Addition.Add_Doubles     100000  thrpt   10   3.022 ±  0.262  ops/ms
c.o.s.add.Addition.Add_Doubles    1000000  thrpt   10   0.284 ±  0.006  ops/ms
c.o.s.add.Addition.Add_Floats      100000  thrpt   10   6.212 ±  0.521  ops/ms
c.o.s.add.Addition.Add_Floats     1000000  thrpt   10   0.572 ±  0.019  ops/ms
c.o.s.add.Addition.Add_Ints        100000  thrpt   10   6.383 ±  0.445  ops/ms
c.o.s.add.Addition.Add_Ints       1000000  thrpt   10   0.573 ±  0.019  ops/ms
c.o.s.add.Addition.Add_Longs       100000  thrpt   10   3.022 ±  0.241  ops/ms
c.o.s.add.Addition.Add_Longs      1000000  thrpt   10   0.281 ±  0.025  ops/ms   100000  thrpt   10   2.145 ±  0.004  ops/ms  1000000  thrpt   10   0.206 ±  0.001  ops/ms    100000  thrpt   10   2.150 ±  0.002  ops/ms   1000000  thrpt   10   0.212 ±  0.001  ops/ms      100000  thrpt   10  16.960 ±  0.043  ops/ms     1000000  thrpt   10   1.015 ±  0.019  ops/ms     100000  thrpt   10   6.379 ±  0.014  ops/ms    1000000  thrpt   10   0.429 ±  0.033  ops/ms

I looked at how SS_Ints was being executed and saw the usage of the vpaddd (addition of packed integers) instruction with xmm register operands, in between two instructions the disassembler (hsdis) seemingly could not read. Maybe these unprintable instructions are from AVX-512 or extensions? It’s impossible to say. The same mechanism, using vpaddq, was not used for longs, but is used on my laptop.

  0x00007fc31576edf1: (bad)                     ;...0e

  0x00007fc31576edf2: vpaddd %xmm4,%xmm1,%xmm1  ;...c5f1fecc

  0x00007fc31576edf6: (bad)                     ;...62

The case of vector addition is less clear; there are many uninterpreted instructions in the output, while it clearly vectorises on my laptop using the largest registers available. The only vector instruction present was vzeroupper, which zeroes the upper 128 bits of a 256-bit register, and is usually used for optimising use of SSE on more modern architectures. Incidentally, there is explicit advice from Intel against using this instruction on Knights Landing. I saw assembly for the scalar implementation of this code (this will always happen prior to optimisation), but there’s no evidence the code was vectorised. However, there were 110 (bad) instructions in the output for float array addition alone.

An interesting presentation, by one of the key Hotspot compiler engineers, outlines some of the limits of SIMD support in Java. It neither includes examples of 512-bit register usage nor states that this is unsupported. Possibly Project Panama‘s Long8 type will utilise 512-bit registers. I will recompile the disassembler and give JDK9 the benefit of the doubt for now.

Choosing the Right Radix: Measurement or Mathematics?

I recently wrote a post about radix sorting, and found that for large arrays of unsigned integers a handwritten implementation beats Arrays.sort. However, I paid no attention to the choice of radix and used a default of eight bits. It turns out this was a lucky choice: modifying my benchmark to parametrise the radix, I observed a maximum at one byte, regardless of the size of array.

Is this an algorithmic or technical phenomenon? Is this something that could have been predicted on the back of an envelope without running a benchmark?

Extended Benchmark Results

Size Radix Score Score Error (99.9%) Unit
100 2 135.559923 7.72397 ops/ms
100 4 262.854579 37.372678 ops/ms
100 8 345.038234 30.954927 ops/ms
100 16 7.717496 1.144967 ops/ms
1000 2 13.892142 1.522749 ops/ms
1000 4 27.712719 4.057162 ops/ms
1000 8 52.253497 4.761172 ops/ms
1000 16 7.656033 0.499627 ops/ms
10000 2 1.627354 0.186948 ops/ms
10000 4 3.620869 0.029128 ops/ms
10000 8 6.435789 0.610848 ops/ms
10000 16 3.703248 0.45177 ops/ms
100000 2 0.144575 0.014348 ops/ms
100000 4 0.281837 0.025707 ops/ms
100000 8 0.543274 0.031553 ops/ms
100000 16 0.533998 0.126949 ops/ms
1000000 2 0.011293 0.001429 ops/ms
1000000 4 0.021128 0.003137 ops/ms
1000000 8 0.037376 0.005783 ops/ms
1000000 16 0.031053 0.007987 ops/ms


To model the execution time of the algorithm, we can write t = f(r, n), where n \in \mathbb{N} is the length of the input array, and r \in [1, 32) is the size in bits of the radix. We can inspect if the model predicts non-monotonic execution time with a minimum (corresponding to maximal throughput), or if t increases indefinitely as a function of r. If we find a plausible model predicting a minimum, temporarily treating r as continuous, we can solve \frac{\partial f}{\partial r}|_{n=N, r \in [1,32)} = 0 to find the theoretically optimal radix. This pre-supposes we derive a non-monotonic model.

Constructing a Model

We need to write down an equation before we can do any calculus, which requires two dangerous assumptions.

  1. Each operation has the same cost, making the execution time proportional to the number of operation.
  2. The costs of operations do not vary as a function of n or r.

This means all we need to do is find a formula for the number of operations, and then vary n and r. The usual pitfall in this approach relates to the first assumption, in that memory accesses are modelled as uniform cost; memory access can vary widely in cost ranging from registers to RAM on another socket. We are about to fall foul of both assumptions constructing an intuitive model of the algorithm’s loop.

        while (shift < Integer.SIZE) {
            Arrays.fill(histogram, 0);
            for (int i = 0; i < data.length; ++i) {
                ++histogram[((data[i] & mask) >> shift) + 1];
            for (int i = 0; i < 1 << radix; ++i) {
                histogram[i + 1] += histogram[i];
            for (int i = 0; i < data.length; ++i) {
                copy[histogram[(data[i] & mask) >> shift]++] = data[i];
            for (int i = 0; i < data.length; ++i) {
                data[i] = copy[i];
            shift += radix;
            mask <<= radix;

The outer loop depends on the choice of radix while the inner loops depend on the size of the array and the choice of radix. There are five obvious aspects to capture:

  • The first inner loop takes time proportional to n
  • The third and fourth inner loops take time proportional to n
  • We can factor the per-element costs of loops 1, 3 and 4 into a constant a
  • The second inner loop takes time proportional to 2^r, modeled with by the term b2^r
  • The body of the loop executes 32/r times

This can be summarised as the formula:

f(r, n) = 32\frac{(3an + b2^r)}{r}

It was claimed the algorithm had linear complexity in n and it only has a linear term in n. Good. However, the exponential r term in the numerator dominates the linear term in the denominator, making the function monotonic in r. The model fails to predict the observed throughput maximising radix. There are clearly much more complicated mechanisms at play than can be captured counting operations.

Sorting Unsigned Integers Faster in Java

I discovered a curious resource for audio-visualising sort algorithms, which is exciting for two reasons. The first is that I finally feel like I understand Alexander Scriabin: he was not a composer. He discovered Tim Sort 80 years before Tim Peters and called it Black Mass. (If you aren’t familiar with the piece, fast-forward to 1:40 to hear the congruence.)

The second reason was that I noticed Radix Sort (LSD). While it was an affront to my senses, it used a mere 800 array accesses and no comparisons! I was unaware of this algorithm so delved deeper and implemented it for integers, and benchmarked my code against Arrays.sort.

Radix Sort

It is taken as given by many (myself included, or am I just projecting my thoughts on to others?) that O(n \log n) is the best you can do in a sort algorithm. But this is actually only true for sort algorithms which depend on comparison. If you can afford to restrict the data types your sort algorithm supports to types with a positional interpretation (java.util can’t because it needs to be ubiquitous and maintainable), you can get away with a linear time algorithm.

Radix sort, along with the closely related counting sort, does not use comparisons. Instead, the data is interpreted as a fixed length string of symbols. For each position, the cumulative histogram of symbols is computed to calculate sort indices. While the data needs to be scanned several times, the algorithm scales linearly and the overhead of the multiple scans is amortised for large arrays.

As you can see on Wikipedia, there are two kinds of radix sort: Least Significant Digit and Most Significant Digit. This dichotomy relates to the order the (representational) string of symbols is traversed in. I implemented and benchmarked the LSD version for integers.


The implementation interprets an integer as the concatenation of n bit string symbols of fixed size size 32/n. It performs n passes over the array, starting with the least significant bits, which it modifies in place. For each pass the data is scanned three times, in order to:

  1. Compute the cumulative histogram over the symbols in their natural sort order
  2. Copy the value with symbol k to the mth position in a buffer, where m is defined by the cumulative density of k.
  3. Copy the buffer back into the original array

The implementation, which won’t work unless the chunks are proper divisors of 32, is below. The bonus (or caveat) is that it automatically supports unsigned integers. The code could be modified slightly to work with signed integers at a performance cost.

import java.util.Arrays;

public class RadixSort {

    private final int radix;

    public RadixSort() {

    public RadixSort(int radix) {
        assert 32 % radix== 0;
        this.radix= radix;

    public void sort(int[] data) {
        int[] histogram = new int[(1 << radix) + 1];
        int shift = 0;
        int mask = (1 << radix) - 1;
        int[] copy = new int[data.length];
        while (shift < Integer.SIZE) {
            Arrays.fill(histogram, 0);
            for (int i = 0; i < data.length; ++i) {
                ++histogram[((data[i] & mask) >> shift) + 1];
            for (int i = 0; i < 1 << radix; ++i) {
                histogram[i + 1] += histogram[i];
            for (int i = 0; i < data.length; ++i) {
                copy[histogram[(data[i] & mask) >> shift]++] = data[i];
            for (int i = 0; i < data.length; ++i) {
                data[i] = copy[i];
            shift += radix;
            mask <<= radix;

The time complexity is obviously linear, a temporary buffer is allocated, but in comparison to Arrays.sort it looks fairly spartan. Instinctively, cache locality looks fairly poor because the second inner loop of the three jumps all over the place. Will this implementation beat Arrays.sort (for integers)?


The algorithm is measured using arrays of random positive integers, for which both algorithms are equivalent, from a range of sizes. This isn’t always the best idea (the Tim Sort algorithm comes into its own on nearly sorted data), so take the result below with a pinch of salt. Care must be taken to copy the array in the benchmark since both algorithms are in-place.

public void launchBenchmark(String... jvmArgs) throws Exception {
        Options opt = new OptionsBuilder()
                .include(this.getClass().getName() + ".*")

        new Runner(opt).run();

    public void Arrays_Sort(Data data, Blackhole bh) {
        int[] array = Arrays.copyOf(, data.size);

    public void Radix_Sort(Data data, Blackhole bh) {
        int[] array = Arrays.copyOf(, data.size);

    public static class Data {

        @Param({"100", "1000", "10000", "100000", "1000000"})
        int size;

        int[] data;
        RadixSort radixSort = new RadixSort();

        public void init() {
            data = createArray(size);

    public static int[] createArray(int size) {
        int[] array = new int[size];
        Random random = new Random(0);
        for (int i = 0; i < size; ++i) {
            array[i] = Math.abs(random.nextInt());
        return array;
Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
Arrays_Sort thrpt 1 10 1304.687189 147.380334 ops/ms 100
Arrays_Sort thrpt 1 10 78.518664 9.339994 ops/ms 1000
Arrays_Sort thrpt 1 10 1.700208 0.091836 ops/ms 10000
Arrays_Sort thrpt 1 10 0.133835 0.007146 ops/ms 100000
Arrays_Sort thrpt 1 10 0.010560 0.000409 ops/ms 1000000
Radix_Sort thrpt 1 10 404.807772 24.930898 ops/ms 100
Radix_Sort thrpt 1 10 51.787409 4.881181 ops/ms 1000
Radix_Sort thrpt 1 10 6.065590 0.576709 ops/ms 10000
Radix_Sort thrpt 1 10 0.620338 0.068776 ops/ms 100000
Radix_Sort thrpt 1 10 0.043098 0.004481 ops/ms 1000000
Arrays_Sort sample 1 3088586 0.000902 0.000018 ms/op 100
Arrays_Sort·p0.00 sample 1 1 0.000394 ms/op 100
Arrays_Sort·p0.50 sample 1 1 0.000790 ms/op 100
Arrays_Sort·p0.90 sample 1 1 0.000791 ms/op 100
Arrays_Sort·p0.95 sample 1 1 0.001186 ms/op 100
Arrays_Sort·p0.99 sample 1 1 0.001974 ms/op 100
Arrays_Sort·p0.999 sample 1 1 0.020128 ms/op 100
Arrays_Sort·p0.9999 sample 1 1 0.084096 ms/op 100
Arrays_Sort·p1.00 sample 1 1 4.096000 ms/op 100
Arrays_Sort sample 1 2127794 0.011876 0.000037 ms/op 1000
Arrays_Sort·p0.00 sample 1 1 0.007896 ms/op 1000
Arrays_Sort·p0.50 sample 1 1 0.009872 ms/op 1000
Arrays_Sort·p0.90 sample 1 1 0.015408 ms/op 1000
Arrays_Sort·p0.95 sample 1 1 0.024096 ms/op 1000
Arrays_Sort·p0.99 sample 1 1 0.033920 ms/op 1000
Arrays_Sort·p0.999 sample 1 1 0.061568 ms/op 1000
Arrays_Sort·p0.9999 sample 1 1 0.894976 ms/op 1000
Arrays_Sort·p1.00 sample 1 1 4.448256 ms/op 1000
Arrays_Sort sample 1 168991 0.591169 0.001671 ms/op 10000
Arrays_Sort·p0.00 sample 1 1 0.483840 ms/op 10000
Arrays_Sort·p0.50 sample 1 1 0.563200 ms/op 10000
Arrays_Sort·p0.90 sample 1 1 0.707584 ms/op 10000
Arrays_Sort·p0.95 sample 1 1 0.766976 ms/op 10000
Arrays_Sort·p0.99 sample 1 1 0.942080 ms/op 10000
Arrays_Sort·p0.999 sample 1 1 2.058273 ms/op 10000
Arrays_Sort·p0.9999 sample 1 1 7.526102 ms/op 10000
Arrays_Sort·p1.00 sample 1 1 46.333952 ms/op 10000
Arrays_Sort sample 1 13027 7.670135 0.021512 ms/op 100000
Arrays_Sort·p0.00 sample 1 1 6.356992 ms/op 100000
Arrays_Sort·p0.50 sample 1 1 7.634944 ms/op 100000
Arrays_Sort·p0.90 sample 1 1 8.454144 ms/op 100000
Arrays_Sort·p0.95 sample 1 1 8.742502 ms/op 100000
Arrays_Sort·p0.99 sample 1 1 9.666560 ms/op 100000
Arrays_Sort·p0.999 sample 1 1 12.916883 ms/op 100000
Arrays_Sort·p0.9999 sample 1 1 28.037900 ms/op 100000
Arrays_Sort·p1.00 sample 1 1 28.573696 ms/op 100000
Arrays_Sort sample 1 1042 96.278673 0.603645 ms/op 1000000
Arrays_Sort·p0.00 sample 1 1 86.114304 ms/op 1000000
Arrays_Sort·p0.50 sample 1 1 94.896128 ms/op 1000000
Arrays_Sort·p0.90 sample 1 1 104.293990 ms/op 1000000
Arrays_Sort·p0.95 sample 1 1 106.430464 ms/op 1000000
Arrays_Sort·p0.99 sample 1 1 111.223767 ms/op 1000000
Arrays_Sort·p0.999 sample 1 1 134.172770 ms/op 1000000
Arrays_Sort·p0.9999 sample 1 1 134.742016 ms/op 1000000
Arrays_Sort·p1.00 sample 1 1 134.742016 ms/op 1000000
Radix_Sort sample 1 2240042 0.002941 0.000033 ms/op 100
Radix_Sort·p0.00 sample 1 1 0.001578 ms/op 100
Radix_Sort·p0.50 sample 1 1 0.002368 ms/op 100
Radix_Sort·p0.90 sample 1 1 0.003556 ms/op 100
Radix_Sort·p0.95 sample 1 1 0.004344 ms/op 100
Radix_Sort·p0.99 sample 1 1 0.011056 ms/op 100
Radix_Sort·p0.999 sample 1 1 0.027232 ms/op 100
Radix_Sort·p0.9999 sample 1 1 0.731127 ms/op 100
Radix_Sort·p1.00 sample 1 1 5.660672 ms/op 100
Radix_Sort sample 1 2695825 0.018553 0.000038 ms/op 1000
Radix_Sort·p0.00 sample 1 1 0.013424 ms/op 1000
Radix_Sort·p0.50 sample 1 1 0.016576 ms/op 1000
Radix_Sort·p0.90 sample 1 1 0.025280 ms/op 1000
Radix_Sort·p0.95 sample 1 1 0.031200 ms/op 1000
Radix_Sort·p0.99 sample 1 1 0.050944 ms/op 1000
Radix_Sort·p0.999 sample 1 1 0.082944 ms/op 1000
Radix_Sort·p0.9999 sample 1 1 0.830295 ms/op 1000
Radix_Sort·p1.00 sample 1 1 6.660096 ms/op 1000
Radix_Sort sample 1 685589 0.145695 0.000234 ms/op 10000
Radix_Sort·p0.00 sample 1 1 0.112512 ms/op 10000
Radix_Sort·p0.50 sample 1 1 0.128000 ms/op 10000
Radix_Sort·p0.90 sample 1 1 0.196608 ms/op 10000
Radix_Sort·p0.95 sample 1 1 0.225792 ms/op 10000
Radix_Sort·p0.99 sample 1 1 0.309248 ms/op 10000
Radix_Sort·p0.999 sample 1 1 0.805888 ms/op 10000
Radix_Sort·p0.9999 sample 1 1 1.818141 ms/op 10000
Radix_Sort·p1.00 sample 1 1 14.401536 ms/op 10000
Radix_Sort sample 1 60843 1.641961 0.005783 ms/op 100000
Radix_Sort·p0.00 sample 1 1 1.251328 ms/op 100000
Radix_Sort·p0.50 sample 1 1 1.542144 ms/op 100000
Radix_Sort·p0.90 sample 1 1 2.002944 ms/op 100000
Radix_Sort·p0.95 sample 1 1 2.375680 ms/op 100000
Radix_Sort·p0.99 sample 1 1 3.447030 ms/op 100000
Radix_Sort·p0.999 sample 1 1 5.719294 ms/op 100000
Radix_Sort·p0.9999 sample 1 1 8.724165 ms/op 100000
Radix_Sort·p1.00 sample 1 1 13.074432 ms/op 100000
Radix_Sort sample 1 4846 20.640787 0.260926 ms/op 1000000
Radix_Sort·p0.00 sample 1 1 14.893056 ms/op 1000000
Radix_Sort·p0.50 sample 1 1 18.743296 ms/op 1000000
Radix_Sort·p0.90 sample 1 1 26.673152 ms/op 1000000
Radix_Sort·p0.95 sample 1 1 30.724915 ms/op 1000000
Radix_Sort·p0.99 sample 1 1 40.470446 ms/op 1000000
Radix_Sort·p0.999 sample 1 1 63.016600 ms/op 1000000
Radix_Sort·p0.9999 sample 1 1 136.052736 ms/op 1000000
Radix_Sort·p1.00 sample 1 1 136.052736 ms/op 1000000

The table tells an interesting story. Arrays.sort is vastly superior for small arrays (the arrays most people have), but for large arrays the custom implementation comes into its own. Interestingly, this is consistent with the computer science. If you need to sort large arrays of (unsigned) integers and care about performance, think about implementing radix sort.