How a Bitmap Index Works

Bitmap indices are used in various data technologies for efficient query processing. At a high level, a bitmap index can be thought of as a physical materialisation of a set of predicates over a data set, is naturally columnar and particularly good for multidimensional boolean query processing. PostgreSQL materialises a bitmap index on the fly from query … Continue reading How a Bitmap Index Works

Advanced AOP with Guice Type Listeners

There are cross-cutting concerns, or aspects, in any non-trivial program. These blocks of code tend to be repetitive, unrelated to business logic, and don't lend themselves to being factored out. If you have ever added the same statement at the start of several methods, you have encountered an aspect. For instance, audit, instrumentation, authentication, authorisation could all be … Continue reading Advanced AOP with Guice Type Listeners

Lifecycle Management with Guice Provision Listeners

Typically in a Java web application you will have services with resources which need lifecycle management - at the very least closing gracefully at shutdown. If you'd use a sledgehammer to crack a walnut, there's Spring, which will do this for you with init and destroy methods. I'll explain why I dislike Spring in another post. You … Continue reading Lifecycle Management with Guice Provision Listeners

Tuning Spark Back Pressure by Simulation

Spark back pressure, which can be enabled by setting spark.streaming.backpressure.enabled=true, will dynamically resize batches so as to avoid queue build up. It is implemented using a Proportional Integral Derivative (PID) algorithm. This algorithm has some interesting properties, the most interesting of which is, in contrast with TCP-style probing algorithms, the lack of guarantee of a stable fixed point. This … Continue reading Tuning Spark Back Pressure by Simulation

TCP Congestion Control and Spark Back Pressure

Bandwidth in the Internet is a rivalrous resource; consumption by any consumer grants utility and inhibits consumption by other consumers. Unless consumers sometimes relent in their pursuit of bandwidth, buffers on routers would spend most of their time discarding of excess packets making point-to-point connections unsustainable. Stopping the Internet from collapsing requires a distributed algorithm along the … Continue reading TCP Congestion Control and Spark Back Pressure

Concise Binary Object Representation

Concise Binary Object Representation (CBOR) defined by RFC 7049 is a binary, typed, self describing serialisation format. In contrast with JSON, it is binary and distinguishes between different sizes of primitive type properly. In contrast with Avro and Protobuf, it is self describing and can be used without a schema. It goes without saying for all binary … Continue reading Concise Binary Object Representation

Perpetual Kerberos Login in Hadoop

Kerberos is the only real option for securing an Hadoop cluster. When deploying custom services into a cluster with Kerberos enabled, authentication can quickly become a cross-cutting concern. Kerberos Basics First, a brief introduction to basic Kerberos mechanisms. In each realm there is a Key Distribution Centre (KDC) which issues different types of tickets. A KDC has two … Continue reading Perpetual Kerberos Login in Hadoop