Communication-efficient and crash-quiescent Omega with unknown membership (original) (raw)

Implementing the weakest failure detector for solving the consensus problem

International Journal of Parallel, Emergent and Distributed Systems, 2013

The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to solve several agreement problems, like Consensus. In this paper, algorithms that implement failure detectors in partially synchronous systems are presented. First two simple algorithms of the weakest class to solve Consensus, namely the Eventually Strong class (3S), are presented. While the first algorithm is wait free, the second is f-resilient, where f is a known upper bound on the number of faulty processes. Both algorithms guarantee that, eventually, all the correct processes agree permanently on a common correct process, i.e., they also implement a failure detector * Research partially supported by the Spanish Research Council, under grants TIN2005-09198-C02-01, TIN2007-67353-C02-02, and TIN2008-06735-C02-01, and the Comunidad de Madrid, under grant S-0505/TIC/0285. † A preliminary version of this article was presented at SRDS'2000 [22]. of the class Omega (Ω). They are also shown to be optimal in terms of the number of communication links used forever. Additionally, a wait-free algorithm that implements a failure detector of the Eventually Perfect class (3P) is presented. This algorithm is shown to be optimal in terms of the number of bidirectional links used forever.

Simple CHT: A new derivation of the weakest failure detector for consensus

The paper proposes an alternative proof that Ω, an oracle that outputs a process identifier and guarantees that eventually the same correct process identifier is output at all correct processes, provides minimal information about failures for solving consensus in read-write shared-memory systems: every oracle that gives enough failure information to solve consensus can be used to implement Ω.

Consensus based on failure detectors with a perpetual accuracy property

Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000

This paper is on the Consensus problem, in the context of asynchronous distributed systems made of n processes, at most f of them may crash. A family of failure detector classes satisfying a Perpetual Accuracy property is first defined. This family includes the failure detector class S (the class of Strong failure detectors defined by Chandra and Toueg) central to the definition of a class (Sx) where x is the minimum number (x 1) of correct processes that can never be suspected to have crashed. Then, a protocol that solves the Consensus problem is given. This protocol works with any failure detector class (Sx) of this family. It is particularly simple and uses a Reliable Broadcast protocol as a skeleton. It requires n,x+ 1 communication steps, and its communication bit complexity is n , x + 1n , 1jvj (where jvj is the maximal size of an initial value a process can propose).

Leader Election in Arbitrarily Connected Networks with Process Crashes and Weak Channel Reliability

Networked Systems, 2021

A channel from a process p to a process q satisfies the ADD property if there are constants K and D, unknown to the processes, such that in any sequence of K consecutive messages sent by p to q, at least one of them is delivered to q at most D time units after it has been sent. This paper studies implementations of an eventual leader, namely, an Ω failure detector, in an arbitrarily connected network of eventual ADD channels, where processes may fail by crashing. It first presents an algorithm that assumes that processes initially know n, the total number of processes, sending messages of size O(log n). Then, it presents a second algorithm that does not assume the processes know n. Eventually the size of the messages sent by this algorithm is also O(log n). These are the first implementations of leader election in the ADD model. In this model, only eventually perfect failure detectors were considered, sending messages of size O(n log n).

Implementing the Omega failure detector in the crash-recovery failure model

Journal of Computer and System Sciences, 2009

Unreliable failure detectors are mechanisms providing information about process failures, that allow to solve several problems in asynchronous systems, e.g., Consensus. A particular failure detector, Omega, provides an eventual leader election functionality. This paper addresses the implementation of Omega in the crash-recovery failure model. We first propose an algorithm assuming that processes are reachable from the correct process that crashes and recovers a minimum number of times. Then, we propose two algorithms which assume only that processes are reachable from some correct process. Besides this, one of the algorithms requires the membership to be known a priori, while the other two do not.

Implementing unreliable failure detectors with unknown membership

2006

Unreliable failure detectors [3] are useful devices to solve several fundamental problems in fault-tolerant distributed computing, like consensus or atomic broadcast. In their original work [3], Chandra and Toueg proposed 8 different classes of unreliable failure detectors, and showed that all of them can be used to solve consensus in a crash-prone asynchronous system with reliable links.

On the implementation of unreliable failure detectors in partially synchronous systems

IEEE Transactions on Computers, 2004

Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an algorithm implementing a failure detector of one of these classes in a partially synchronous system. This algorithm is based on all-to-all communication and periodically exchanges a number of messages that is quadratic on the number of processes. In this paper, we study the implementability of different classes of failure detectors in several models of partial synchrony. We first show that no failure detector with perpetual accuracy (namely, P, Q, S, and W) can be implemented in these models in systems with even a single failure. We also show that, in these models of partial synchrony, it is necessary a majority of correct processes to implement a failure detector of the class  proposed by Aguilera et al. Then, we present a family of distributed algorithms that implement the four classes of unreliable failure detectors with eventual accuracy (namely, ÅP, ÅQ, ÅS, and ÅW). Our algorithms are based on a logical ring arrangement of the processes, which defines the monitoring and failure information propagation pattern. The resulting algorithms periodically exchange at most a linear number of messages.

The weakest failure detector for solving consensus

1992

We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In Chandra and Toueg [1996], it is shown that {ᐃ, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as {ᐃ. Thus, {ᐃ is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.

Initial failures in distributed computations

International Journal of Parallel Programming, 1989

We inv estigate the possibility of solving problems in completely asynchronous message passing systems where a number of processes may fail prior to execution. By using game-theoretical notions, necessary and sufficient conditions are provided for solving problems in such a model with and without a termination requirement. An upper bound on the message complexity for solving any problem in the model is given, as well as a simple design concept for constructing a solution to any solvable problem.

Failure Detectors in Homonymous Distributed Systems (with an Application to Consensus

2011

This paper is on homonymous distributed systems where processes are prone to crash failures and have no initial knowledge of the system membership ("homonymous" means that several processes may have the same identifier). New classes of failure detectors suited to these systems are first defined. Among them, the classes HΩ and HΣ are introduced that are the homonymous counterparts of the classes Ω and Σ, respectively. (Recall that the pair Ω, Σ defines the weakest failure detector to solve consensus.) Then, the paper shows how HΩ and HΣ can be implemented in homonymous systems without membership knowledge (under different synchrony requirements). Finally, two algorithms are presented that use these failure detectors to solve consensus in homonymous asynchronous systems where there is no initial knowledge of the membership. One algorithm solves consensus with HΩ, HΣ , while the other uses only HΩ, but needs a majority of correct processes.