# Explicit Intent and Even Faster Hash Codes

I wrote a post recently about how disappointed I was that the optimiser couldn’t outsmart some clever Java code for computing hash codes. Well, here’s a faster hash code along the same lines.

The hash code implemented in Arrays.hashCode is a polynomial hash, it applies to any data type with a positional interpretation. It takes the general form $\sum_{i=0}^{n}x_{i}31^{n - i}$ where $x_0 = 1$. In other words, it’s a dot product of the elements of the array and some powers of 31. Daniel Lemire’s implementation makes it explicit to the optimiser, in a way it won’t otherwise infer, that this operation is data parallel. If it’s really just a dot product it can be made even more obvious at the cost of a loss of flexibility.

Imagine you are processing fixed or limited length strings (VARCHAR(255) or an URL) or coordinates of a space of fixed dimension. Then you could pre-compute the coefficients in an array and write the hash code explicitly as a dot product. Java 9 uses AVX instructions for dot products, so it should be very fast.

public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[maxLength] = 1;
for (int i = maxLength - 1; i >= 0; --i) {
coefficients[i] = 31 * coefficients[i + 1];
}
}

public int hashCode(int[] value) {
int result = coefficients[0];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[i + 1] * value[i];
}
return result;
}
}


This is really explicit, unambiguously parallelisable, and the results are remarkable.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
HashCode.BuiltIn thrpt 1 10 10.323026 0.223614 ops/us 100
HashCode.BuiltIn thrpt 1 10 0.959246 0.038900 ops/us 1000
HashCode.BuiltIn thrpt 1 10 0.096005 0.001836 ops/us 10000
HashCode.FixedLength thrpt 1 10 20.186800 0.297590 ops/us 100
HashCode.FixedLength thrpt 1 10 2.314187 0.082867 ops/us 1000
HashCode.FixedLength thrpt 1 10 0.227090 0.005377 ops/us 10000
HashCode.Unrolled thrpt 1 10 13.250821 0.752609 ops/us 100
HashCode.Unrolled thrpt 1 10 1.503368 0.058200 ops/us 1000
HashCode.Unrolled thrpt 1 10 0.152179 0.003541 ops/us 10000

Modifying the algorithm slightly to support limited variable length arrays degrades performance slightly, but there are seemingly equivalent implementations which do much worse.

public class FixedLengthHashCode {

private final int[] coefficients;

public FixedLengthHashCode(int maxLength) {
this.coefficients = new int[maxLength + 1];
coefficients[0] = 1;
for (int i = 1; i <= maxLength; ++i) {
coefficients[i] = 31 * coefficients[i - 1];
}
}

public int hashCode(int[] value) {
final int max = value.length;
int result = coefficients[max];
for (int i = 0; i < value.length && i < coefficients.length - 1; ++i) {
result += coefficients[max - i - 1] * value[i];
}
return result;
}
}

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
FixedLength thrpt 1 10 19.172574 0.742637 ops/us 100
FixedLength thrpt 1 10 2.233006 0.115285 ops/us 1000
FixedLength thrpt 1 10 0.227451 0.012231 ops/us 10000

The benchmark code is at github.

# Still True in Java 9: Handwritten Hash Codes are Faster

Update 05/09/2017: I have recently discovered that this phenomenon was first described by Alexey Shipilev on an OpenJDK ticket in 2014. As Alexey notes, this is due to the strength reduction of a multiplication by 31 to a shift and a subtraction, which creates a data dependency preventing an unroll. You can see the assembly emitted, confirming this data dependency, in the comments of this post.

One of the things to keep in mind with Java is that the best performance advice for one version may not apply to the next. The JVM has an optimiser which works by detecting intent in byte-code; it does a better job when the programmer is skilled enough to make that intent clear. There have been times when the optimiser has done a bad job, either through bugs or feature gaps, and compensatory idioms emerge. The value of these idioms degrades over time as the optimiser improves; a stroke of ingenuity in one version can become ritualistic nonsense in the next. It’s important not to be that guy who fears adding strings together because it was costly a decade ago, but does it always get better, even with low hanging fruit?

I reproduced the results of an extremely astute optimisation presented in 2015 by Daniel Lemire. I was hoping to see that an improved optimiser in Java 9, having observed several cases of automatic vectorisation, would render this optimisation null.

### Hash Codes for Arrays

I encourage you to read the original blog post because it is informative, but for the sake of coherence I will summarise the key point here. Arrays.hashCode implements the following hash code:

    public static int hashCode(int[] a) {
if (a == null)
return 0;

int result = 1;
for (int element : a)
result = 31 * result + element;

return result;
}


This results in a good hash code, but a scalar internal representation of this code has a problem: a data dependency on the multiplication, which is slower than moving data into registers. A significant and reproducible speed up can be observed when unrolling the loop, which enforces a prefetch of 128 bits of the array into registers before doing the slower multiplications:

    public static int hashCode(int[] a) {
if (a == null)
return 0;

int result = 1;
int i = 0;
for (; i + 3 < a.length; i += 4) {
result = 31 * 31 * 31 * 31 * result
+ 31 * 31 * 31 * a[i]
+ 31 * 31 * a[i + 1]
+ 31 * a[i + 2]
+ a[i + 3];
}
for (; i < a.length; i++) {
result = 31 * result + a[i];
}
return result;
}


The improvement in performance from this simple change can be confirmed with Java 8 by running a simple benchmark.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.537736 0.382617 ops/us 100
BuiltIn thrpt 1 10 0.804620 0.103037 ops/us 1000
BuiltIn thrpt 1 10 0.092297 0.003947 ops/us 10000
Unrolled thrpt 1 10 14.974398 0.628522 ops/us 100
Unrolled thrpt 1 10 1.514986 0.035759 ops/us 1000
Unrolled thrpt 1 10 0.137408 0.010200 ops/us 10000

The performance improvement is so obvious, and the change so easy to make, that one wonders why JVM vendors didn’t make the change themselves.

### Java 9: Universally Better Automatic Optimisations?

As the comments on the blog post suggest, this is a prime candidate for vectorisation. Auto-vectorisation is a thing in Java 9. Using intrinsics or code clean enough to express intent clearly, you can really expect to see good usage of SIMD in Java 9. I was really expecting the situation to reverse in Java 9; but it doesn’t.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: size
BuiltIn thrpt 1 10 9.822568 0.381087 ops/us 100
BuiltIn thrpt 1 10 0.949273 0.024021 ops/us 1000
BuiltIn thrpt 1 10 0.093171 0.004502 ops/us 10000
Unrolled thrpt 1 10 13.762617 0.440135 ops/us 100
Unrolled thrpt 1 10 1.501106 0.094897 ops/us 1000
Unrolled thrpt 1 10 0.139963 0.011487 ops/us 10000

This is a still a smart optimisation two years later, but it offends my sensibilities in the same way hints do in SQL – a layer of abstraction has been punctured. I will have to try again in Java 10.

# New Methods in Java 9: Math.fma and Arrays.mismatch

There are two noteworthy new methods in Java 9: Arrays.mismatch and Math.fma.

#### Arrays.mismatch

This method takes two primitive arrays, and returns the index of the first differing values. This effectively computes the longest common prefix of the two arrays. This is really quite useful, mostly for text processing but also for Bioinformatics (protein sequencing and so on, much more interesting than the sort of thing I work on). Having worked extensively with Apache HBase (where a vast majority of the API involves manipulating byte arrays) I can think of lots of less interesting use cases for this method.

Looking carefully, you can see that the method calls into the internal ArraysSupport utility class, which will try to perform a vectorised mismatch (an intrinsic candidate). Since this will use AVX instructions, this is very fast; much faster than a handwritten loop.

Let’s measure the boost versus a handwritten loop, testing across a range of common prefices and array lengths of byte[].

    @Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void Mismatch_Intrinsic(BytePrefixData data, Blackhole bh) {
bh.consume(Arrays.mismatch(data.data1, data.data2));
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void Mismatch_Handwritten(BytePrefixData data, Blackhole bh) {
byte[] data1 = data.data1;
byte[] data2 = data.data2;
int length = Math.min(data1.length, data2.length);
int mismatch = -1;
for (int i = 0; i < length; ++i) {
if (data1[i] != data2[i]) {
mismatch = i;
break;
}
}
bh.consume(mismatch);
}


The results speak for themselves. Irritatingly, there is some duplication in output because I haven’t figured out how to make JMH use a subset of the Cartesian product of its parameters.

Benchmark (prefix) (size) Mode Cnt Score Error Units
Mismatch_Handwritten 10 100 thrpt 10 22.360 ± 0.938 ops/us
Mismatch_Handwritten 10 1000 thrpt 10 2.459 ± 0.256 ops/us
Mismatch_Handwritten 10 10000 thrpt 10 0.255 ± 0.009 ops/us
Mismatch_Handwritten 100 100 thrpt 10 22.763 ± 0.869 ops/us
Mismatch_Handwritten 100 1000 thrpt 10 2.690 ± 0.044 ops/us
Mismatch_Handwritten 100 10000 thrpt 10 0.273 ± 0.008 ops/us
Mismatch_Handwritten 1000 100 thrpt 10 24.970 ± 0.713 ops/us
Mismatch_Handwritten 1000 1000 thrpt 10 2.791 ± 0.066 ops/us
Mismatch_Handwritten 1000 10000 thrpt 10 0.281 ± 0.007 ops/us
Mismatch_Intrinsic 10 100 thrpt 10 89.169 ± 2.759 ops/us
Mismatch_Intrinsic 10 1000 thrpt 10 26.995 ± 0.501 ops/us
Mismatch_Intrinsic 10 10000 thrpt 10 3.553 ± 0.065 ops/us
Mismatch_Intrinsic 100 100 thrpt 10 83.037 ± 5.590 ops/us
Mismatch_Intrinsic 100 1000 thrpt 10 26.249 ± 0.714 ops/us
Mismatch_Intrinsic 100 10000 thrpt 10 3.523 ± 0.122 ops/us
Mismatch_Intrinsic 1000 100 thrpt 10 87.921 ± 6.566 ops/us
Mismatch_Intrinsic 1000 1000 thrpt 10 25.812 ± 0.442 ops/us
Mismatch_Intrinsic 1000 10000 thrpt 10 4.177 ± 0.059 ops/us

Why is there such a big difference? Look at how the score decreases as a function of array length, even when the common prefix, and therefore the number of comparisons required, is small: clearly the performance of this algorithm depends on the efficiency of memory access. Arrays.mismatch optimises this, reading qwords of the array into SIMD registers. Working one long at a time, it is possible to compute the XOR in a single instruction to determine if it’s even necessary to look at each byte.

java/util/ArraysSupport.vectorizedMismatch(Ljava/lang/Object;JLjava/lang/Object;JII)I  [0x000002bd9215a820, 0x000002bd9215aa78]  600 bytes
Argument 0 is unknown.RIP: 0x2bd9215a820 Code size: 0x00000258
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000002bda79cbf68} 'vectorizedMismatch' '(Ljava/lang/Object;JLjava/lang/Object;JII)I' in 'java/util/ArraysSupport'
# parm0:    rdx:rdx   = 'java/lang/Object'
# parm1:    r8:r8     = long
# parm2:    r9:r9     = 'java/lang/Object'
# parm3:    rdi:rdi   = long
# parm4:    rsi       = int
# parm5:    rcx       = int
#           [sp+0x60]  (sp of caller)
0x000002bd9215a820: mov     dword ptr [rsp+0ffffffffffff9000h],eax
;...89
;...84
;...24
;...00
;...90
;...ff
;...ff

0x000002bd9215a827: push    rbp               ;...55

0x000002bd9215a828: sub     rsp,50h           ;...48
;...83
;...ec
;...50
;*synchronization entry
; - java.util.ArraysSupport::vectorizedMismatch@-1 (line 120)

0x000002bd9215a82c: mov     r10,rdi           ;...4c
;...8b
;...d7

0x000002bd9215a82f: vmovq   xmm2,r9           ;...c4
;...c1
;...f9
;...6e
;...d1

0x000002bd9215a834: vmovq   xmm1,rdx          ;...c4
;...e1
;...f9
;...6e
;...ca

0x000002bd9215a839: mov     r14d,ecx          ;...44
;...8b
;...f1

0x000002bd9215a83c: vmovd   xmm0,esi          ;...c5
;...f9
;...6e
;...c6

0x000002bd9215a840: mov     r9d,3h            ;...41
;...b9
;...03
;...00
;...00
;...00

0x000002bd9215a846: sub     r9d,ecx           ;...44
;...2b
;...c9
;*isub {reexecute=0 rethrow=0 return_oop=0}
; - java.util.ArraysSupport::vectorizedMismatch@5 (line 120)

0x000002bd9215a849: mov     edx,esi           ;...8b
;...d6

0x000002bd9215a84b: mov     ecx,r9d           ;...41
;...8b
;...c9

0x000002bd9215a84e: sar     edx,cl            ;...d3
;...fa
;*ishr {reexecute=0 rethrow=0 return_oop=0}
; - java.util.ArraysSupport::vectorizedMismatch@17 (line 122)

0x000002bd9215a850: mov     eax,1h            ;...b8
;...01
;...00
;...00
;...00

0x000002bd9215a855: xor     edi,edi           ;...33
;...ff

0x000002bd9215a857: test    edx,edx           ;...85
;...d2

0x000002bd9215a859: jle     2bd9215a97ah      ;...0f
;...8e
;...1b
;...01
;...00
;...00


The code for this benchmark is at github.

#### Math.fma

In comparison to users of some languages, Java programmers are lackadaisical about floating point errors. It’s a good job that historically Java hasn’t been considered suitable for the implementation of numerical algorithms. But all of a sudden there is a revolution of data science on the JVM, albeit mostly driven by the Scala community, with JVM implementations of structures like recurrent neural networks abounding. It matters less for machine learning than root finding, but how accurate can these implementations be without JVM level support for minimising the propagation floating point errors? With Math.fma this is improving, by allowing two common operations to be performed before rounding.

Math.fma fuses a multiplication and an addition into a single floating point operation to compute expressions like $ab + c$. This has two key benefits:

1. There’s only one operation, and only one rounding error
2. This is explicitly supported in AVX2 by the VFMADD* instructions

#### Newton’s Method

To investigate any superior suppression of floating point errors, I use a toy implementation of Newton’s method to compute the root of a quadratic equation, which any teenager could calculate analytically (the error is easy to quantify).

I compare these two implementations for $4x^2 - 12x + 9$ (there is a repeated root at 1.5) to get an idea for the error (defined by $|1.5 - x_n|$) after a large number of iterations.

I implemented this using FMA:

public class NewtonsMethodFMA {

private final double[] coefficients;

public NewtonsMethodFMA(double[] coefficients) {
this.coefficients = coefficients;
}

public double evaluateF(double x) {
double f = 0D;
int power = coefficients.length - 1;
for (int i = 0; i < coefficients.length; ++i) {
f = Math.fma(coefficients[i], Math.pow(x, power--), f);
}
return f;
}

public double evaluateDF(double x) {
double df = 0D;
int power = coefficients.length - 2;
for (int i = 0; i < coefficients.length - 1; ++i) {
df = Math.fma((power + 1) * coefficients[i],  Math.pow(x, power--), df);
}
return df;
}

public double solve(double initialEstimate, int maxIterations) {
double result = initialEstimate;
for (int i = 0; i < maxIterations; ++i) {
result -= evaluateF(result)/evaluateDF(result);
}
return result;
}
}


And an implementation with normal operations:


public class NewtonsMethod {

private final double[] coefficients;

public NewtonsMethod(double[] coefficients) {
this.coefficients = coefficients;
}

public double evaluateF(double x) {
double f = 0D;
int power = coefficients.length - 1;
for (int i = 0; i < coefficients.length; ++i) {
f += coefficients[i] * Math.pow(x, power--);
}
return f;
}

public double evaluateDF(double x) {
double df = 0D;
int power = coefficients.length - 2;
for (int i = 0; i < coefficients.length - 1; ++i) {
df += (power + 1) * coefficients[i] * Math.pow(x, power--);
}
return df;
}

public double solve(double initialEstimate, int maxIterations) {
double result = initialEstimate;
for (int i = 0; i < maxIterations; ++i) {
result -= evaluateF(result)/evaluateDF(result);
}
return result;
}
}


When I run this code for 1000 iterations, the FMA version results in 1.5000000083575202, whereas the vanilla version results in 1.500000017233207. It’s completely unscientific, but seems plausible and confirms my prejudice so… In fact, it’s not that simple, and over a range of initial values, there is only a very small difference in FMA’s favour. There’s not even a performance improvement – clearly this method wasn’t added so you can start implementing numerical root finding algorithms – the key takeaway is that the results are slightly different because a different rounding strategy has been used.

Benchmark (maxIterations) Mode Cnt Score Error Units
NM_FMA 100 thrpt 10 93.805 ± 5.174 ops/ms
NM_FMA 1000 thrpt 10 9.420 ± 1.169 ops/ms
NM_FMA 10000 thrpt 10 0.962 ± 0.044 ops/ms
NM_HandWritten 100 thrpt 10 93.457 ± 5.048 ops/ms
NM_HandWritten 1000 thrpt 10 9.274 ± 0.483 ops/ms
NM_HandWritten 10000 thrpt 10 0.928 ± 0.041 ops/ms

# Interpreting Compression Ratios as Signals

I often think about the book Theories of Everything by John D. Barrow. There is a beautiful passage on algorithmic compressibility:

The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content…

We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is ‘random’ if there is no other representation of the string which is shorter than itself. But we say it is ‘non-random’ if there does exist an abbreviated representation.

When processing a data set, patterns can be found and exploited to reduce its size to close to that of its information content, in other words the data set can be abbreviated. This idea can be turned on its head to use a data set’s compression ratio as a measure of its information content. Using compression algorithms which seek to exploit correlations between attribute values, hitherto unseen relations, or mistakes, decrease compression ratio. Repeated measurement of this proxy information content in windows over a stream of, say, credit card transactions could be interpreted as a measure of the likelihood of fraud. In a more generic setting, changes in the compression ratio can indicate changes in data quality.

### Compression Techniques

There are various approaches to compression used in database systems. These techniques can be classified in terms of what is actually compressed, agnosticism to the domain of the data, and invariance of properties of relations in compressed form.

At one extreme there are techniques, typically used for cold data, which compress entire chunks of data byte-wise, with no relational invariance. A good example of such a cold data compression approach is SNAPPY block level compression in HBase.  This approach has a key benefit; it can compress anything since it is agnostic to the format of the data. In order to exploit a property it must be recognised and acknowledged first, and ubiquity comes at the cost of being unable to perform relational operations in compressed form. There are options in various database system to perform this kind of compression at a row level. This can lead to lower compression ratios since at a row level the frequency histograms are unlikely to converge to the frequency histogram of the entire data set (convergence could not happen unless each row were equivalent up to a byte-wise permutation).

At the other extreme there are techniques like columnar dictionary encoding, which maps the domain of a column to integral indices into some array-like storage of the distinct attribute values. The distinct values are typically represented as a sorted array or as a hash set. Several properties are preserved in dictionarised data sets. Equality is always preserved, meaning that several operations like join and various filters can be performed without decompression (instead, the expressions themselves can be mapped into the dictionary domain). Depending on the storage technique used, ordering and associated relations may also be preserved after compression. The frequency histogram of the attribute values of the column is sometimes avoided because this would not result in a fixed-width encoding, but there are dictionary encodings to map the most frequent values to the smallest integers to decrease the number of bits required in encoded form. While the level of compression feasible is good, it is inferior to an inverse-frequency symbol based approach. Columnar dictionary encoding also cannot exploit correlations between attribute values. By way of example, in a financial data set, there is a high correlation between trade dates, trade types and settlement dates: with columnar dictionary encoding, the compressed form would take up space proportional to the distinct values of the three attributes combined, even though trades in FX spot settle T+2, except USDCAD which settles T+1 (the settlement date needn’t be stored, it could feasibly take up no space and just be computed instead).

Somewhere in between the two extremes there are algorithms which encode tuples of attribute values. One such technique starts by computing the Huffman encoding of the entire domain of each column. Lists of tuples are encoded by first replacing each attribute value of each tuple with the corresponding Huffman code, concatenating the Huffman codes into fixed width integers before a sort and delta encoding. This produces a compression ratio close to the entropy of the data set, and specifically targets correlations between columns. While several compression techniques have their use, it is at this point in the spectrum where I see there being useful information in the compression ratio over a window of data, because unusual things will take up more space. If your compression ratio ever changes, it is because something weird has happened and you should probably investigate it. If you are processing trades and the compression ratio decreases, is it because you recently started trading more USDCAD Spot?

# Roaring TreeMap (English Translation)

A presentation of the Roaring TreeMap data structure translated into English from French Nouveaux modèles d’index bitmap compressés à 64 bits authored by Samy Chambi, Daniel Lemire and Robert Godin.

#### Roaring TreeMap

The RoaringTreeMap model combines a Java TreeMap with the Roaring bitmap structure to index a set of 64-bit integers represented by the positions of the set bits of a bitmap. TreeMap is an implementation of a Red-Black Tree structure, a self-balancing binary search tree, and is defined amongst other collections in the java.util package. The various tree operations (insertion, search, etc.) are implemented using the algorithms proposed in  (Cormen et al., 2001).

To represent a 64-bit integer, the model splits the integer into two parts. The first part constitutes the 32 most significant bits, and the second part represents the 32 least significant bits. A node of a RoaringTreeMap is composed of a key, which is a 32-bit integer, and a RoaringBitmap. RoaringTreeMap arranges groups of 64-bit integers containing the same 32 most significant bits into the same node. The key of the node stores the 32 most significant bits for the entire group, and the Roaring bitmap associated with the node contains the 32 least significant bits of each integer. RoaringTreeMap utilises a form of prefix compression on the 32 most significant bits of each integer in a group, which can save up to $32 \times 2^{32}-1$ bits for such a group.

An insertion or search operation for a 64 bit integer in a RoaringTreeMap starts by performing a random access in the tree to find a node such that the node’s key is equal to the 32 most significant bits of the integer in question. Using a self-balancing binary search tree allows such an operation to be performed with logarithmic time complexity with respect to the total number of nodes in the RoaringTreeMap. If such a node is found, a second insertion or search operation will be performed on the RoaringBitmap associated with the node, which will also have a logarithmic time complexity with respect to the number of entries in the first level index of the Roaring bitmap and the number of entries contained within the accessed container, given that the container is sorted.

Another advantage of the RoaringTreeMap comes from the property of self-balancing binary search trees, which guarantees that the nodes of the tree can always be ordered by the values of their keys in a linear time complexity with respect the number of nodes in the tree. Combined with the ascending sort order of the 32-bit integers maintained within the Roaring bitmaps, this allows the RoaringTreeMap to iterate over the set of 64-bit integers in ascending order in time linearly proportional to the size of set. This is very effective for the computation of basic set operations between two RoaringTreeMaps, such as: intersection, union, symmetric difference, which can be performed in time linearly proportional to the number of integers contained in the tree. Without this property, such an operation would execute with quadratic time complexity with respect to the number of elements in the two sets.

#### Union of two RoaringTreeMaps

A union takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts with the allocation of a new empty dynamic array which will be filled with the nodes which form the union. Next, the nodes of the two trees are traversed in the ascending order of their keys. During an iteration, if the two current nodes have keys with different values, the container of the node with the smallest key is copied and inserted into the dynamic array, then the algorithm advances to the position of this node in the tree. Otherwise, if the two nodes compared during an iteration have keys of the same value, a logical OR is computed between the Roaring bitmaps associated with each node, and the result is returned as a new Roaring bitmap. A new node containing the Roaring bitmap obtained and a key of equal value to the two nodes will be inserted into the dynamic array. These operations continue until each node in either tree is consumed. At the end, the nodes inserted in the dynamic array will be sorted in the order of their keys, after which a recursive algorithm constructs the RoaringTreeMap from the dynamic array. The algorithm requires time of $O(n1 + n2)$ to traverse the two input trees, where $n1$ and $n2$ represent the number of nodes in the two trees respectively. The construction of the dynamic array requires time of $O(n1+n2)$ because it could require up to $O(n1+n2)$ insertions, and each insertion requires constant amortised time. The construction of the final RoaringTreeMap is implemented with an efficient recursive algorithm which runs in a time of $O(n1 + n2)$. So the total execution time for computing the union consisting of the two algorithms is $O(n1 + n2)$, plus the time necessary to compute the logical OR of the Roaring Bitmaps (Chambi et al., 2014, 2015).

#### Intersection of two RoaringTreeMaps

An intersection takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts by allocating an empty dynamic array  into which the nodes that form the intersection will be written. After, the algorithm iterates over the nodes of the two trees in the ascending order of their keys. In the case where the keys of the nodes differ in value, the algorithm advances to the position of the node with the smallest key. Otherwise, in the case where the two keys have the same value, a logical AND operation will be performed between the Roaring bitmaps associated with each node, which results in a new Roaring bitmap. Then a new node containing the key and the resultant Roaring bitmap is inserted into the next slot in the dynamic array. These operations continue until each node in either tree is consumed. After the iteration is complete, the dynamic array will be sorted in the order of the nodes’ keys. Afterwards a recursive algorithm builds the RoaringTreeMap from the node array. The execution time of the second intersection algorithm depends on the time necessary to traverse the two trees, for populating the dynamic array, and for constructing the tree of the RoaringTreeMap. The total time to complete these operations is of the order of $O(n1 + n2)$ in the worst case, where $n1$ and $n2$ represent the number of nodes in the first and second trees respectively. In the best case, the algorithm executes in a time of $min(n1, n2)$, when a traversal of one of the trees is sufficient to arrive at the final result. This does not account for the possible time to compute the logical ANDs between the Roaring bitmaps (Chambi et al., 2014, 2015).

# Can Java 9 Leverage AVX-512 Yet?

Advanced Vector Extensions (AVX) has gone through several iterations, doubling the size of SIMD registers available several times. The state of the art is currently AVX-512 which offers 512-bit registers. In parallel, support for auto-vectorisation of handwritten Java has been improving. With an early access JDK9 on an Intel(R) Core(TM) i7-6700HQ CPU I have observed usage of both 256-bit (vector addition) and 128-bit registers (sum of squares). I was very kindly granted access to a Knights Landing server where I ran the same code. I was hoping to see evidence of usage of 512-bit registers, though I didn’t see it directly I did see a strong indication that this occurs.

### SIMD Registers

There is a hierarchy of register types from several iterations of SIMD instruction sets: 128 bits, named xmm, originally from Streaming SIMD Extensions (SSE), 256-bits, named ymm, originally from AVX, and 512-bits, named zmm, introduced with AVX-512. A 512-bit register is enormous; it can contain 16 ints and can be used as a single operand to various vector instructions allowing up to 16 ints to be operated on in parallel. Each of these registers can be used as if it is the predecessor by masking the upper bits.

On my laptop I have seen Java 9 use 128-bit xmm registers in a sum of squares calculation on doubles:

0x000001aada783fca: vmovsd  xmm0,qword ptr [rbp+r13*8+10h] ...


As an aside, the weird thing about this code is that only a 64-bit qword, as opposed to an xmmword, is being loaded into xmm0 here. The code generated for floats is very interesting in this respect because it seems to load a full 256-bit ymmword of the array and supplies it as an operand to vmulps.

  0x00000245b32aa0dc: vmovdqu ymm1,ymmword ptr [r8+r9*4+10h]
0x00000245b32aa0e3: vmulps  ymm1,ymm1,ymm1


I have also seen 256-bit ymm registers in use in the simpler float vector addition:

  0x000002759215a17a: vmovdqu ymm0,ymmword ptr [rcx+rax*4+50h]
0x000002759215a186: vmovdqu ymmword ptr [rbx+rax*4+50h],ymm0


Where a 256-bit ymmword is being used to load chunks of the array into the register. The interesting thing about running the float vector addition case is that you can trace the optimisations being applied as the code runs; starting off with scalar instructions, graduating to AVX instructions on xmm registers, finally learning to utilise the ymm registers. This is quite an impressive feat of the optimiser.

What’s next? 512-bit zmm registers!

### Knights Landing

Knights Landing has 32 512-bit zmm registers, I try to use them by running the code at github, and observing the assembly code emitted.

The Knights Landing server has 256 processors, each with a modest clock speed, but support for AVX-512 core and several extensions (see the flags). It is designed for massively parallel workloads, and, for single threaded code, would not compare favourably with a commodity desktop. But if you can write code to keep 256 cores busy concurrently, it will likely profit from running on a Knights Landing.

processor       : 255
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
stepping        : 1
microcode       : 0x130
cpu MHz         : 1299.695
cache size      : 1024 KB
physical id     : 0
siblings        : 256
core id         : 73
cpu cores       : 64
apicid          : 295
initial apicid  : 295
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips        : 2600.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


I am running on this server simply because it supports AVX-512, and am interested to see if Java can use it. The code is purposefully single threaded, and the point is not to confirm that one of 256 1.3Ghz processors is twice as slow as one of my eight 2.7GHz processors when measured on a single threaded workload, the interesting aspect is the instructions generated.

The operating system is CentOS Linux release 7.2.1511 (Core), the Java version used was Java HotSpot(TM) 64-Bit Server VM (build 9+177, mixed mode).

### Observations

My observations were mixed: I didn’t see any 512-bit registers being used, some code expected to vectorise didn’t, but, critically, also saw evidence that the hsdis-amd64.so disassembler did not understand some of the assembly used. I hope, just because I like things to work, that this is hiding evidence of 512-bit register use. As you can see, there are a lot of new instructions in AVX-512 which may explain these black holes in the output.

I ran vector addition and sum of squares across a range of vector sizes and primitive types and saw little use of AVX instructions in my handwritten code, let alone evidence of 512-bit zmm register operands, yet extensive use of 256-bit registers for String intrinsics.

The first data point to look at is the relative timings (remember: absolute timings are meaningless because of Knights Landing design motivations, even more so because assembly was being printed during this run) which are interesting: some kind of optimisation is being applied to sum of squares only for ints.

Benchmark                          (size)   Mode  Cnt   Score    Error   Units
c.o.s.ss.SumOfSquares.SS_Doubles   100000  thrpt   10   2.145 ±  0.004  ops/ms
c.o.s.ss.SumOfSquares.SS_Doubles  1000000  thrpt   10   0.206 ±  0.001  ops/ms
c.o.s.ss.SumOfSquares.SS_Floats    100000  thrpt   10   2.150 ±  0.002  ops/ms
c.o.s.ss.SumOfSquares.SS_Floats   1000000  thrpt   10   0.212 ±  0.001  ops/ms
c.o.s.ss.SumOfSquares.SS_Ints      100000  thrpt   10  16.960 ±  0.043  ops/ms
c.o.s.ss.SumOfSquares.SS_Ints     1000000  thrpt   10   1.015 ±  0.019  ops/ms
c.o.s.ss.SumOfSquares.SS_Longs     100000  thrpt   10   6.379 ±  0.014  ops/ms
c.o.s.ss.SumOfSquares.SS_Longs    1000000  thrpt   10   0.429 ±  0.033  ops/ms


I looked at how SS_Ints was being executed and saw the usage of the vpaddd (addition of packed integers) instruction with xmm register operands, in between two instructions the disassembler (hsdis) seemingly could not read. Maybe these unprintable instructions are from AVX-512 or extensions? It’s impossible to say. The same mechanism, using vpaddq, was not used for longs, but is used on my laptop.

  0x00007fc31576edf1: (bad)                     ;...0e



The case of vector addition is less clear; there are many uninterpreted instructions in the output, while it clearly vectorises on my laptop using the largest registers available. The only vector instruction present was vzeroupper, which zeroes the upper 128 bits of a 256-bit register, and is usually used for optimising use of SSE on more modern architectures. Incidentally, there is explicit advice from Intel against using this instruction on Knights Landing. I saw assembly for the scalar implementation of this code (this will always happen prior to optimisation), but there’s no evidence the code was vectorised. However, there were 110 (bad) instructions in the output for float array addition alone.

An interesting presentation, by one of the key Hotspot compiler engineers, outlines some of the limits of SIMD support in Java. It neither includes examples of 512-bit register usage nor states that this is unsupported. Possibly Project Panama‘s Long8 type will utilise 512-bit registers. I will recompile the disassembler and give JDK9 the benefit of the doubt for now.

# Choosing the Right Radix: Measurement or Mathematics?

I recently wrote a post about radix sorting, and found that for large arrays of unsigned integers a handwritten implementation beats Arrays.sort. However, I paid no attention to the choice of radix and used a default of eight bits. It turns out this was a lucky choice: modifying my benchmark to parametrise the radix, I observed a maximum at one byte, regardless of the size of array.

Is this an algorithmic or technical phenomenon? Is this something that could have been predicted on the back of an envelope without running a benchmark?

### Extended Benchmark Results

Size Radix Score Score Error (99.9%) Unit
100 2 135.559923 7.72397 ops/ms
100 4 262.854579 37.372678 ops/ms
100 8 345.038234 30.954927 ops/ms
100 16 7.717496 1.144967 ops/ms
1000 2 13.892142 1.522749 ops/ms
1000 4 27.712719 4.057162 ops/ms
1000 8 52.253497 4.761172 ops/ms
1000 16 7.656033 0.499627 ops/ms
10000 2 1.627354 0.186948 ops/ms
10000 4 3.620869 0.029128 ops/ms
10000 8 6.435789 0.610848 ops/ms
10000 16 3.703248 0.45177 ops/ms
100000 2 0.144575 0.014348 ops/ms
100000 4 0.281837 0.025707 ops/ms
100000 8 0.543274 0.031553 ops/ms
100000 16 0.533998 0.126949 ops/ms
1000000 2 0.011293 0.001429 ops/ms
1000000 4 0.021128 0.003137 ops/ms
1000000 8 0.037376 0.005783 ops/ms
1000000 16 0.031053 0.007987 ops/ms

### Modeling

To model the execution time of the algorithm, we can write $t = f(r, n)$, where $n \in \mathbb{N}$ is the length of the input array, and $r \in [1, 32)$ is the size in bits of the radix. We can inspect if the model predicts non-monotonic execution time with a minimum (corresponding to maximal throughput), or if $t$ increases indefinitely as a function of $r$. If we find a plausible model predicting a minimum, temporarily treating $r$ as continuous, we can solve $\frac{\partial f}{\partial r}|_{n=N, r \in [1,32)} = 0$ to find the theoretically optimal radix. This pre-supposes we derive a non-monotonic model.

### Constructing a Model

We need to write down an equation before we can do any calculus, which requires two dangerous assumptions.

1. Each operation has the same cost, making the execution time proportional to the number of operation.
2. The costs of operations do not vary as a function of $n$ or $r$.

This means all we need to do is find a formula for the number of operations, and then vary $n$ and $r$. The usual pitfall in this approach relates to the first assumption, in that memory accesses are modelled as uniform cost; memory access can vary widely in cost ranging from registers to RAM on another socket. We are about to fall foul of both assumptions constructing an intuitive model of the algorithm’s loop.

        while (shift < Integer.SIZE) {
Arrays.fill(histogram, 0);
for (int i = 0; i < data.length; ++i) {
++histogram[((data[i] & mask) >> shift) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < data.length; ++i) {
copy[histogram[(data[i] & mask) >> shift]++] = data[i];
}
for (int i = 0; i < data.length; ++i) {
data[i] = copy[i];
}
}


The outer loop depends on the choice of radix while the inner loops depend on the size of the array and the choice of radix. There are five obvious aspects to capture:

• The first inner loop takes time proportional to $n$
• The third and fourth inner loops take time proportional to $n$
• We can factor the per-element costs of loops 1, 3 and 4 into a constant $a$
• The second inner loop takes time proportional to $2^r$, modeled with by the term $b2^r$
• The body of the loop executes $32/r$ times

This can be summarised as the formula:

$f(r, n) = 32\frac{(3an + b2^r)}{r}$

It was claimed the algorithm had linear complexity in $n$ and it only has a linear term in $n$. Good. However, the exponential $r$ term in the numerator dominates the linear term in the denominator, making the function monotonic in $r$. The model fails to predict the observed throughput maximising radix. There are clearly much more complicated mechanisms at play than can be captured counting operations.