Broadcast protocols for distributed systems (original) (raw)
Related papers
An ordered and reliable broadcast protocol for distributed systems
Computer Communications, 1997
The purpose of a reliable broadcast protocol is to allow groups of nodes on unreliable broadcast networks to reliably broadcast messages. A reliable broadcast protocol must guarantee two properties: (1) all of the receivers in a group receive the broadcast messages, and (2) each of the receivers orders the messages in the same sequence. In an optimistic approach to reliable broadcast protocol, a batch acknowledgement is employed for a sequence of broadcast messages, instead of one or more acknowledgements per broadcast message used in the pessimistic approach. In this paper, based on the optimistic approach, we have proposed a counter-based reliable broadcast protocol. In this protocol, the unique token ownership is circulated among all the nodes in an order specified by a token-passing-list. The system state which records related information about messages broadcast by each node is included in the token message. By appropriately updating the counter information recorded in the system state included in the token message, instead of using explicit acknowledgement messages, the proposed protocol needs fewer control messages to commit a broadcast message than other protocols, no matter whether the rate of transmission errors is high or low. Moreover, we show how to handle the flow control problem and describe the token update technique.
Two New Protocols for Fault Tolerant Agreement
International Journal of Distributed and Parallel systems, 2011
The paper attempts to handle failures effectively, while reaching agreement, in a distributed transaction processing system. The standard protocols such as BFTDC [3], Zyzzyva [4] and PBFT [5] handle the problem to a greater extent. However, the limitation with these protocols is that they incur increased message overhead as well as large latency. Moreover, the nodes are evacuated from the transaction system after being declared faulty. We propose a novel proactive based agreement which identifies the tentative failures in the system. To improve the failure resiliency with minimum execution overhead, we also propose an optimized reactive view change mechanism. Both mechanisms have been analyzed and compared. The dynamic analysis of the protocol reflects that, in a faulty scenario, the proactive approach is computationally more efficient with reduced latency as compared to reactive one. Moreover, unlike PBFT and BFTDC, our agreement protocol runs in two phases, which leads to reduced message overhead and total execution time. The protocol treats the fail-silent (i.e. crashed) nodes in the system.
A Unified Fault-Tolerance Protocol
Lecture Notes in Computer Science, 2004
Davies and Wakerly show that Byzantine fault tolerance can be achieved by a cascade of broadcasts and middle value select functions. We present an extension of the Davies and Wakerly protocol, the unified protocol, and its proof of correctness. We prove that it satisfies validity and agreement properties for communication of exact values. We then introduce bounded communication error into the model. Inexact communication is inherent for clock synchronization protocols. We prove that validity and agreement properties hold for inexact communication, and that exact communication is a special case. As a running example, we illustrate the unified protocol using the SPIDER family of fault-tolerant architectures. In particular we demonstrate that the SPIDER interactive consistency, distributed diagnosis, and clock synchronization protocols are instances of the unified protocol.
Reliable communication in the presence of failures
ACM Transactions on Computer Systems, 1985
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local-and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
Multicoordinated Agreement Protocols for Higher Availabilty
2008 Seventh IEEE International Symposium on Network Computing and Applications, 2008
Adaptability and graceful degradation are important features in distributed systems. Yet, consensus and other agreement protocols, basic building blocks of reliable distributed systems, lack these features and must perform expensive reconfiguration even in face of single failures. In this paper we describe multicoordinated mode of execution for agreement protocols that has improved availability and tolerates failures in a graceful manner. We exemplify our approach by presenting a Generic Broadcast algorithm. Our protocol can adapt to environment changes by switching to different execution modes. Finally, we show how our algorithm can solve the Generalized Consensus and its many instances (e.g., consensus, atomic broadcast, reliable broadcast).
The consensus problem in fault-tolerant computing
ACM Computing Surveys, 1993
The consensus problem is concerned with the agreement on a system status by the fault-free segment of a processor population in spite of the possible inadvertent or even malicious spread of disinformation by the faulty segment of that population. The resulting protocols are useful throughout fault-tolerant parallel and distributed systems and will impact the design of decision systems to come. This paper surveys research on the consensus problem, compares approaches, outlines applications, and suggests directions for future work.
Atomic Broadcast In Asynchronous Crash-Recovery Distributed Systems
icdcs, 2000
Atomic Broadcast is a fundamental problem of distributed systems: it states that messages must be delivered in the same order to their destination processes. This paper describes a solution to this problem in asynchronous distributed systems in which processes can crash and recover.
Multicoordinated agreement protocols and the log service
Agreement problems are a common abstraction in distributed systems. They appear when the components of the system must concur on reconfigurations, changes of state, or in lines of action in general. Examples of agreement problems are Consensus, Atomic Commitment, and Atomic Broadcast. In this thesis we investigate these abstractions in the context of the environment in which they will run and the applications that they will serve; in general, we consider the asynchronous crash-recovery model. The goal is to devise protocols that explore the contextual information to deliver improved availability. The correctness of our protocols holds even when the extra assumptions do not. In the first part of this thesis we explore the following property: messages broadcast in small networks tend to be delivered in order and reliably. We make three contributions in this part. The first contribution is to turn known Consensus algorithms that harness this ordering property to reach agreement in the ...
2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., 2003
Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we present a performance evaluation methodology that can be generalized to analyze many kinds of fault-tolerant algorithms. We use the methodology to compare two atomic broadcast algorithms with different fault tolerance mechanisms: unreliable failure detectors and group membership. We evaluated the steady state latency in (1) runs with neither crashes nor suspicions, (2) runs with crashes and (3) runs with no crashes in which correct processes are wrongly suspected to have crashed, as well as (4) the transient latency after a crash. We found that the two algorithms have the same performance in Scenario 1, and that the group membership based algorithm has an advantage in terms of performance and resiliency in Scenario 2, whereas the failure detector based algorithm offers better performance in the other scenarios. We discuss the implications of our results to the design of fault tolerant distributed systems.
A Timely Distributed Consensus Solution in a Crash/Omission-Fault Environment
A timely protocol to solve the distributed consensus problem that tolerates process crashes and message omissions is described. The protocol is optimal in terms of the number of communication steps needed to achieve consensus. The model on which the protocol is based relies on a priority-based communication network, a kind of network commonly used in practice to support real-time systems.