Worst-Case Distributed Systems Design | Peter Bailis (original) (raw)

03 Feb 2015

Designing distributed systems that handle worst-case scenarios gracefully can—perhaps surprisingly—improve average-case behavior as well.

When designing CAP “AP” (or coordination-free) systems, we typically analyze their behavior under worst-case environmental conditions. We often make statements of the form “in the event of arbitrary loss of network connectivity between servers, every request received by a non-failing server in the system will result in a response.” These claims pertain to behavior under worst-case scenarios (“arbitrary loss of network connectivity”), which is indeed useful (e.g., DBAs don’t get paged when a network partition strikes). However, taken at face value, this language can lead to confusion (e.g., “in my experience, network failures don’t occur, so why bother?”). More importantly, this language tends to obscure a more serious benefit of this kind of distributed systems design.

When we design coordination-free systems that don’t have to communicate in the worst case, we’re designing systems that don’t have to communicate in the average case, either. If my servers don’t have to communicate to guarantee a response in the event of network partitions, they also don’t have to pay the cost of a round-trip within a datacenter or, even better, across geographically distant datacenters. I can add more servers and make use of them—without placing additional load on my existing cluster! I’m able to achieve effectively indefinite scale-out—simply because my servers never have to communicate on the fast-path.

This notion of worst-case-improves-average-case is particularly interesting because designing for the worst case doesn’t always work out so nicely. For example, when I bike to my lab, I put on a helmet to guard against collisions, knowing that my helmet will help in some but not all situations. But my helmet is no real match for a true worst case—say, a large meteor, or maybe just an eighteen-wheeler. To adequately defend myself against an eighteen-wheeler, I’d need more serious protection that’d undoubtedly encumber my bicycling. By handling the worst case, I lose in the average case. In fact, this pattern of worst-case-degrades-average-case is common, particularly in the real world: consider automotive design, building architecture, and processor design (e.g., for thermal, voltage, and process variations).1 Often, there’s a pragmatic trade-off between how much we’re willing to pay to handle extreme conditions and how much we’re willing to pay in terms of average-case performance.

So why do distributed systems exhibit this pattern? One possibility is that networks are unpredictable and, in the field, pretty terrible to work with. Despite exciting advances in networking research, we still don’t have reliable SLAs from our networks. A line of research in distributed computing has asked what we could do if we had better-behaved networks (e.g., with bounded delay)—but we (still) don’t yet.2 Given the inability to (easily and practically) distinguish between message delays, link failures, and (both permanent and transient) server failures, we do well by assuming the worst. Essentially, the defining feature of our distributed systems—the network—encourages and rewards us to minimize our reliance on it.

Over time, I’ve gained a greater appreciation for the subtle power of this worst-case thinking. It’s often instrumental in determining the fundamental overheads of a given design rather than superficial (albeit important) differences in implementations or engineering quality.3 It’s a clean (and often elegant) way to reason about system behavior and is a useful tool for systems architects. We’d do well by paying more attention to this pattern while fixating less on failures. Although formulations such as Abadi’s PACELC are a step in the right direction, the connections between latency, availability and scale-out performance deserve more attention.

Asides

You can follow me on Twitter here.



Read More