The Weakest Failure Detectors to Solve Quittable Consensus and Nonblocking Atomic Commit (original) (raw)

The weakest failure detectors to solve certain fundamental problems in distributed computing

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing - PODC '04, 2004

We determine the weakest failure detectors to solve several fundamental problems in distributed message-passing systems, for all environments -i.e., regardless of the number and timing of crashes. The problems that we consider are: implementing an atomic register, solving consensus, solving quittable consensus (a variant of consensus in which processes have the option to decide 'quit' if a failure occurs), and solving non-blocking atomic commit.

The weakest failure detector for solving consensus

1992

We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In Chandra and Toueg [1996], it is shown that {ᐃ, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as {ᐃ. Thus, {ᐃ is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.

Implementing the weakest failure detector for solving the consensus problem

International Journal of Parallel, Emergent and Distributed Systems, 2013

The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to solve several agreement problems, like Consensus. In this paper, algorithms that implement failure detectors in partially synchronous systems are presented. First two simple algorithms of the weakest class to solve Consensus, namely the Eventually Strong class (3S), are presented. While the first algorithm is wait free, the second is f-resilient, where f is a known upper bound on the number of faulty processes. Both algorithms guarantee that, eventually, all the correct processes agree permanently on a common correct process, i.e., they also implement a failure detector * Research partially supported by the Spanish Research Council, under grants TIN2005-09198-C02-01, TIN2007-67353-C02-02, and TIN2008-06735-C02-01, and the Comunidad de Madrid, under grant S-0505/TIC/0285. † A preliminary version of this article was presented at SRDS'2000 [22]. of the class Omega (Ω). They are also shown to be optimal in terms of the number of communication links used forever. Additionally, a wait-free algorithm that implements a failure detector of the Eventually Perfect class (3P) is presented. This algorithm is shown to be optimal in terms of the number of bidirectional links used forever.

Simple CHT: A new derivation of the weakest failure detector for consensus

The paper proposes an alternative proof that Ω, an oracle that outputs a process identifier and guarantees that eventually the same correct process identifier is output at all correct processes, provides minimal information about failures for solving consensus in read-write shared-memory systems: every oracle that gives enough failure information to solve consensus can be used to implement Ω.

Distributed Consensus, revisited

Acta Informatica, 2007

We provide a novel model to formalize a well-known algorithm, by Chandra and Toueg, that solves Consensus among asynchronous distributed processes in the presence of a particular class of failure detectors (3S or, equivalently, Ω), under the hypothesis that only a minority of processes may crash. The model is defined as a global transition system that is unambigously generated by local transition rules. The model is syntax-free in that it does not refer to any form of programming language or pseudo code. We use our model to formally prove that the algorithm is correct. * The original publication is available at www.springerlink.com 1 Actually, the algorithm may easily reach system configurations in which, at a certain point in time, every process is coordinator in its current round, while all processes are in pairwise different rounds, by having every participant simply always suspect the respective coordinator. Analogously, the algorithm may easily reach moments in which none of the processes is the coordinator of its round. Moreover, in such a moment, it is impossible to predict, from a chronological point of view, which process will next become coordinator.

Consensus based on failure detectors with a perpetual accuracy property

Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000

This paper is on the Consensus problem, in the context of asynchronous distributed systems made of n processes, at most f of them may crash. A family of failure detector classes satisfying a Perpetual Accuracy property is first defined. This family includes the failure detector class S (the class of Strong failure detectors defined by Chandra and Toueg) central to the definition of a class (Sx) where x is the minimum number (x 1) of correct processes that can never be suspected to have crashed. Then, a protocol that solves the Consensus problem is given. This protocol works with any failure detector class (Sx) of this family. It is particularly simple and uses a Reliable Broadcast protocol as a skeleton. It requires n,x+ 1 communication steps, and its communication bit complexity is n , x + 1n , 1jvj (where jvj is the maximal size of an initial value a process can propose).

Initial failures in distributed computations

International Journal of Parallel Programming, 1989

We inv estigate the possibility of solving problems in completely asynchronous message passing systems where a number of processes may fail prior to execution. By using game-theoretical notions, necessary and sufficient conditions are provided for solving problems in such a model with and without a termination requirement. An upper bound on the message complexity for solving any problem in the model is given, as well as a simple design concept for constructing a solution to any solvable problem.

Consensus Based on Strong Failure Detectors: A Time and Message-Efficient Protocol

Lecture Notes in Computer Science, 2000

The class of strong failure detectors (denoted S) includes all failure detectors that suspect all crashed processes and that do not suspect some (a priori unknown) process that never crashes. So, a failure detector that belongs to S is intrinsically unreliable as it can arbitrarily suspect correct processes. Several S-based consensus protocols have been designed. Some of them systematically require n computation rounds (n being the number of processes), each round involving n 2 or n messages. Others allow early decision (i.e., the number of rounds depends on the maximal number of crashes when there are no erroneous suspicions) but require each round to involve n 2 messages. This paper presents an early deciding S-based consensus protocol each round of which involves 3(n ? 1) messages. So, the proposed protocol is particularly time and message-e cient. Moreover, it can easily be generalized to reduce the number of rounds at the price of an increase in the number of messages per round.

Implementing the Weakest Failure Detector for Solving Consensus

Abstract The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to solve several agreement problems, like Consensus. In this paper, algorithms that implement failure detectors in partially synchronous systems are presented. First two simple algorithms of the weakest class to solve Consensus, namely the Eventually Strong class (3S), are presented.

The consensus problem in fault-tolerant computing

ACM Computing Surveys, 1993

The consensus problem is concerned with the agreement on a system status by the fault-free segment of a processor population in spite of the possible inadvertent or even malicious spread of disinformation by the faulty segment of that population. The resulting protocols are useful throughout fault-tolerant parallel and distributed systems and will impact the design of decision systems to come. This paper surveys research on the consensus problem, compares approaches, outlines applications, and suggests directions for future work.