Lightweight Fault-tolerance (original) (raw)
People Working Directory
As distributed computing becomes commonplace, and many more applications are faced with the current costs of high availability, there is a fresh need for recovery-based techniques that combine high performance during failure-free executions with fast recovery. However, although the literature contains approximately 300 papers in this area, rollback recovery is seldom used in practice to build reliable distributed applications. The Lightweight Fault-Tolerance (LiFT) project focuses on changing this state of affairs with an approach that blends algorithmic work, systems building, and empirical analysis.
Highlights
- Causal Logging ProtocolsWe have developed the first formal specification of the consistency condition common to all rollback recovery protocols, and we have derived from it derived from it Causal Logging, a novel technique that eliminates the traditional performance tradeoffs between pessimistic and optimistic protocols. Causal logging protocols perform as well as optimistic protocols during failure-free executions, but, like pessimistic protocols, never roll back correct processes during crash recovery.
- The Egida ToolkitResearch in rollback recovery has long suffered from a flourishing of algorithmic results that are rarely supported by careful experimental evaluation of their practical significance. As a result, the performance of these algorithms in practice is not well understood, and little attention has been given to simplifying the difficult task of integrating rollback recovery protocols with applications. To address these issues, wehave developed theEgida toolkit. Egida’s design addresses for the first time the fundamental problem of characterizing the set of functionalities that are at the core of all message-logging protocols. This characterization, based on a framework for handling non-determinism in a process execution, gives Egida the expressiveness to encompass the diversity of rollback recovery protocols. A protocol is specified using a simple, high-level language; the protocol’s implementation is synthesized from this specification by gluing together appropriate modules from a library. As a result, Egida allows the implementation of arbitrary rollback recovery protocols with minimal programming effort.
- Understanding the Cost of RecoveryUsing Egida, we have performed the first study of the recovery performance of message-logging protocols. This study has revealed that no existing protocol can simultaneously guarantee low overhead during failure-free execution, fast recovery, and fault-containment, leaving applications to face a complex tradeoff. To eliminate this tradeoff, we have developed a new class of protocols that never roll back correct processes, recover quickly (within 2% of the protocol with the best recovery time) and impose little overhead during failure-free execution (within 2% of the protocol with the best failure-free time).
- An analysis of Communication Induced Checkpointing
- Efficient support of file I/O. Traditional rollback recovery techniques treat the file system as part of the "outside world". As a result, processes may be forced to execute a blocking output commit protocol whenever they interact with the file system.
We have derived a new protocol that integrates records and efficiently replicates the information necessary to reproduce file I/O operations during recovery. Our simulation studies show that this approach eliminates all synchronous logging to stable storage, thereby reducing the cost of performing file I/O dramatically.
To confirm the simulation results, I am building a new file system based on these protocols.
Current Focus
- A Fault-Tolerant JVM *** Fault-Tolerance and Security. While the need for protecting applications from security attacks is universally recognized, little attention has been given to the problem of securing the software that applications rely upon for fault-tolerance. This problem is especially acute for rollback recovery protocols, in which a malicious party who alters the information used during recovery can affect the state to which a faulty process is restored and introduce a Trojan horse. And, denial-of-service attacks can force a process to fail. We plan to exploit Egida’s extensibility to design and evaluate new secure rollback recovery protocols.** *** Support for self-tuning fault-tolerance protocols. Currently there are no simple guidelines that can help even the experts in choosing the most efficient protocol for a given application in a given execution environment. We plan to leverage the simplicity with which different protocols can be implemented within Egida to develop the understanding necessary to articulate these guidelines. Our goal is to use these insights to allow Egida to monitor the execution environment and the application behavior and to select automatically the fault-tolerance protocol that best suits them.**
## Publications