A presentation of the Roaring TreeMap data structure translated into English from French Nouveaux modèles d’index bitmap compressés à 64 bits authored by Samy Chambi, Daniel Lemire and Robert Godin.
The RoaringTreeMap model combines a Java TreeMap with the Roaring bitmap structure to index a set of 64-bit integers represented by the positions of the set bits of a bitmap. TreeMap is an implementation of a Red-Black Tree structure, a self-balancing binary search tree, and is defined amongst other collections in the java.util package. The various tree operations (insertion, search, etc.) are implemented using the algorithms proposed in (Cormen et al., 2001).
To represent a 64-bit integer, the model splits the integer into two parts. The first part constitutes the 32 most significant bits, and the second part represents the 32 least significant bits. A node of a RoaringTreeMap is composed of a key, which is a 32-bit integer, and a RoaringBitmap. RoaringTreeMap arranges groups of 64-bit integers containing the same 32 most significant bits into the same node. The key of the node stores the 32 most significant bits for the entire group, and the Roaring bitmap associated with the node contains the 32 least significant bits of each integer. RoaringTreeMap utilises a form of prefix compression on the 32 most significant bits of each integer in a group, which can save up to bits for such a group.
An insertion or search operation for a 64 bit integer in a RoaringTreeMap starts by performing a random access in the tree to find a node such that the node’s key is equal to the 32 most significant bits of the integer in question. Using a self-balancing binary search tree allows such an operation to be performed with logarithmic time complexity with respect to the total number of nodes in the RoaringTreeMap. If such a node is found, a second insertion or search operation will be performed on the RoaringBitmap associated with the node, which will also have a logarithmic time complexity with respect to the number of entries in the first level index of the Roaring bitmap and the number of entries contained within the accessed container, given that the container is sorted.
Another advantage of the RoaringTreeMap comes from the property of self-balancing binary search trees, which guarantees that the nodes of the tree can always be ordered by the values of their keys in a linear time complexity with respect the number of nodes in the tree. Combined with the ascending sort order of the 32-bit integers maintained within the Roaring bitmaps, this allows the RoaringTreeMap to iterate over the set of 64-bit integers in ascending order in time linearly proportional to the size of set. This is very effective for the computation of basic set operations between two RoaringTreeMaps, such as: intersection, union, symmetric difference, which can be performed in time linearly proportional to the number of integers contained in the tree. Without this property, such an operation would execute with quadratic time complexity with respect to the number of elements in the two sets.
Union of two RoaringTreeMaps
A union takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts with the allocation of a new empty dynamic array which will be filled with the nodes which form the union. Next, the nodes of the two trees are traversed in the ascending order of their keys. During an iteration, if the two current nodes have keys with different values, the container of the node with the smallest key is copied and inserted into the dynamic array, then the algorithm advances to the position of this node in the tree. Otherwise, if the two nodes compared during an iteration have keys of the same value, a logical OR is computed between the Roaring bitmaps associated with each node, and the result is returned as a new Roaring bitmap. A new node containing the Roaring bitmap obtained and a key of equal value to the two nodes will be inserted into the dynamic array. These operations continue until each node in either tree is consumed. At the end, the nodes inserted in the dynamic array will be sorted in the order of their keys, after which a recursive algorithm constructs the RoaringTreeMap from the dynamic array. The algorithm requires time of to traverse the two input trees, where and represent the number of nodes in the two trees respectively. The construction of the dynamic array requires time of because it could require up to insertions, and each insertion requires constant amortised time. The construction of the final RoaringTreeMap is implemented with an efficient recursive algorithm which runs in a time of . So the total execution time for computing the union consisting of the two algorithms is , plus the time necessary to compute the logical OR of the Roaring Bitmaps (Chambi et al., 2014, 2015).
Intersection of two RoaringTreeMaps
An intersection takes two RoaringTreeMaps as input and produces a new RoaringTreeMap as output. The algorithm starts by allocating an empty dynamic array into which the nodes that form the intersection will be written. After, the algorithm iterates over the nodes of the two trees in the ascending order of their keys. In the case where the keys of the nodes differ in value, the algorithm advances to the position of the node with the smallest key. Otherwise, in the case where the two keys have the same value, a logical AND operation will be performed between the Roaring bitmaps associated with each node, which results in a new Roaring bitmap. Then a new node containing the key and the resultant Roaring bitmap is inserted into the next slot in the dynamic array. These operations continue until each node in either tree is consumed. After the iteration is complete, the dynamic array will be sorted in the order of the nodes’ keys. Afterwards a recursive algorithm builds the RoaringTreeMap from the node array. The execution time of the second intersection algorithm depends on the time necessary to traverse the two trees, for populating the dynamic array, and for constructing the tree of the RoaringTreeMap. The total time to complete these operations is of the order of in the worst case, where and represent the number of nodes in the first and second trees respectively. In the best case, the algorithm executes in a time of , when a traversal of one of the trees is sufficient to arrive at the final result. This does not account for the possible time to compute the logical ANDs between the Roaring bitmaps (Chambi et al., 2014, 2015).