Bandwidth in the Internet is a rivalrous resource; consumption by any consumer grants utility and inhibits consumption by other consumers. Unless consumers sometimes relent in their pursuit of bandwidth, buffers on routers would spend most of their time discarding of excess packets making point-to-point connections unsustainable. Stopping the Internet from collapsing requires a distributed algorithm along the lines of the old adage “do as you would be done by”. Such algorithms have existed for decades, are well understood, and have some interesting properties that can be applied to throughput problems in application development.
In 1986 the throughput on the campus network at UC Berkeley, consisting of three links, fell from 32Kbps to 40bps without explanation. Van Jacobson realised that this was caused by buffer overload at routers, leading to an excessive fraction of network packets being dropped, preventing establishment and sustenance of TCP connections. In order to optimise aggregate bandwidth, utilisation should be controlled at a level where packet loss at routers is minimised. One solution to the problem would be to grant all bandwidth to a single consumer – with obvious drawbacks – any allocation should be fair and stable. A centralised mediator, such as a government, is computationally infeasible so the algorithm must be decentralised.
Van Jacobson devised an algorithm in TCP clients which use packet loss (manifested in corruption or timeouts) as a congestion indicator. Each client maintains a congestion window which limits the number of unacknowledged packets at any point in time. When congestion is detected, the client backs off. The algorithm splits the lifecycle of a TCP connection into two phases: Slow Start and Congestion Avoidance.
- Slow Start occurs when a connection is created or after a packet times out:
- Initialise the congestion window (cwnd) to a number of packets called the maximum segment size (MSS).
- For each acknowledged packet, increase cwnd by one until the slow start threshold (ssthresh) is reached or a packet is lost. This is more or less exponential increase (and favours short connections).
- When a loss event occurs, set ssthresh to half its current value and resume slow start
- Congestion avoidance starts when the slow start threshold is reached.
- Every time cwnd packets are acknowledged, add the MSS to cwnd. This is linear increase.
- When a loss event occurs, set ssthresh to half its current value, halve cwnd and resume linear increase.
The algorithm aims to spend as much time as possible in congestion avoidance to maintain stability, and aims to find a fair value of the slow start threshold with respect to all consumers. The time series of cwnd during a TCP connection makes a saw tooth after an initial exponential increase.
This algorithm prevents congestions collapse and can be shown to be stable at an aggregate level. However there are some issues with it. For instance, the algorithm favours short connections (those with short round trip times) both during slow start and congestion avoidance. Given the choice of a short oversubscribed route and a long undersubscribed route, the algorithm will choose the short oversubscribed route. Particularly, the linear increase during congestion avoidance is too slow and means that too much time is spent away from the optimal bandwidth.
Since Jacobson’s algorithm was devised, many variations on its theme have been proposed and some implemented, all aiming to maximise the time spent at a pareto-optimal per connection bandwidth. For instance, the BIC (Binary Increase Congestion) algorithm, which replaces the linear CA phase with a binary search. CUBIC, which replaces the linear phase with a cubic function (the inflection point set at the level of the cwnd at the last congestion), event was the default algorithm in the Linux kernel from 2.6.19. CUBIC was replaced as the default by Proportional Rate Reduction, devised by Google, which does not wait for the acknowledgement of outstanding packet before retransmission, from 3.2 onwards.
When Spark back pressure is enabled and a queue of micro-batches builds up, Spark will automatically resize the batch to make it smaller. The upstream system can either block or hold onto the unconsumed messages for a bit longer. When the backlog has cleared, Spark starts increasing the batch size. I haven’t yet figured out whether Spark is explicitly optimising for minimal queue length, minimal queued bytes, maximal throughput or minimal latency but the act of varying the batch size is a search for an optimum.
Though the feature works well, decreases are often more aggressive than necessary and increases slower than feasible. Sometimes queues build up for intrinsic reasons; the job is too slow to service the batch size at the desired polling frequency, but sometimes the build up is caused extrinsically, for instance by temporary cluster overload. If the back-pressure is too aggressive and the cause extrinsic, then unless the algorithm aggressively probes recovery rate will be delayed. It would be interesting to vary the batch sizing algorithm along the lines of variations on Jacobson’s algorithm, like replacing linear increase with binary recovery/probing, and assess the optimality (choose one or many of maximise throughput/minimise latency/minimise risk of OOM) and stability (for what proportion of time does the batch size stay at the optimal level?)