Worst-Case Distributed Systems Design | Peter Bailis (original) (raw)
03 Feb 2015
Designing distributed systems that handle worst-case scenarios gracefully can—perhaps surprisingly—improve average-case behavior as well.
When designing CAP “AP” (or coordination-free) systems, we typically analyze their behavior under worst-case environmental conditions. We often make statements of the form “in the event of arbitrary loss of network connectivity between servers, every request received by a non-failing server in the system will result in a response.” These claims pertain to behavior under worst-case scenarios (“arbitrary loss of network connectivity”), which is indeed useful (e.g., DBAs don’t get paged when a network partition strikes). However, taken at face value, this language can lead to confusion (e.g., “in my experience, network failures don’t occur, so why bother?”). More importantly, this language tends to obscure a more serious benefit of this kind of distributed systems design.
When we design coordination-free systems that don’t have to communicate in the worst case, we’re designing systems that don’t have to communicate in the average case, either. If my servers don’t have to communicate to guarantee a response in the event of network partitions, they also don’t have to pay the cost of a round-trip within a datacenter or, even better, across geographically distant datacenters. I can add more servers and make use of them—without placing additional load on my existing cluster! I’m able to achieve effectively indefinite scale-out—simply because my servers never have to communicate on the fast-path.
This notion of worst-case-improves-average-case is particularly interesting because designing for the worst case doesn’t always work out so nicely. For example, when I bike to my lab, I put on a helmet to guard against collisions, knowing that my helmet will help in some but not all situations. But my helmet is no real match for a true worst case—say, a large meteor, or maybe just an eighteen-wheeler. To adequately defend myself against an eighteen-wheeler, I’d need more serious protection that’d undoubtedly encumber my bicycling. By handling the worst case, I lose in the average case. In fact, this pattern of worst-case-degrades-average-case is common, particularly in the real world: consider automotive design, building architecture, and processor design (e.g., for thermal, voltage, and process variations).1 Often, there’s a pragmatic trade-off between how much we’re willing to pay to handle extreme conditions and how much we’re willing to pay in terms of average-case performance.
So why do distributed systems exhibit this pattern? One possibility is that networks are unpredictable and, in the field, pretty terrible to work with. Despite exciting advances in networking research, we still don’t have reliable SLAs from our networks. A line of research in distributed computing has asked what we could do if we had better-behaved networks (e.g., with bounded delay)—but we (still) don’t yet.2 Given the inability to (easily and practically) distinguish between message delays, link failures, and (both permanent and transient) server failures, we do well by assuming the worst. Essentially, the defining feature of our distributed systems—the network—encourages and rewards us to minimize our reliance on it.
Over time, I’ve gained a greater appreciation for the subtle power of this worst-case thinking. It’s often instrumental in determining the fundamental overheads of a given design rather than superficial (albeit important) differences in implementations or engineering quality.3 It’s a clean (and often elegant) way to reason about system behavior and is a useful tool for systems architects. We’d do well by paying more attention to this pattern while fixating less on failures. Although formulations such as Abadi’s PACELC are a step in the right direction, the connections between latency, availability and scale-out performance deserve more attention.
Asides
You can follow me on Twitter here.
Read More
- How To Make Fossils Productive Again (30 Apr 2016)
- You Can Do Research Too (24 Apr 2016)
- Lean Research (20 Feb 2016)
- I Loved Graduate School (01 Jan 2016)
- NSF Graduate Research Fellowship: N=1 Materials for Systems Research (03 Sep 2015)
- When Does Consistency Require Coordination? (12 Nov 2014)
- Data Integrity and Problems of Scope (20 Oct 2014)
- Linearizability versus Serializability (24 Sep 2014)
- MSR Silicon Valley Systems Projects I Have Loved (19 Sep 2014)
- Understanding Weak Isolation Is a Serious Problem (16 Sep 2014)