Efficient failure detection and consensus at extreme-scale systems (original) (raw)
Related papers
Scalable epidemic message passing interface fault tolerance
Bulletin of Electrical Engineering and Informatics, 2022
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) and extreme scale systems. Components fail more often in such systems, results in application abort. Adopting faulttolerance techniques can be consistently detect failures and continue application's execution even if the failures exist. A prominent parallel programming specification, message passing interface (MPI), as it would be used to implement failure detection and consensus algorithm in this paper. Although the MPI does not facilitate fault tolerant behavior, this work presents a fault tolerant, matrix based failure detection and consensus algorithm. The proposed algorithm uses Gossiping. To detect failures, randomised pinging will be applied during the execution of the algorithm by using piggybacked gossip messages. In order to achieve consensus on the failures in the system, failed processes' information will be sent using the same piggybacked gossip messages to all the alive processes. The algorithm was implemented in MPI framework and is completely fault tolerant. The results exhibit all the MPI process failures were detected using randomised pinging and global consensus has achieved on failed MPI process in the system.
Cluster Computing, 2004
Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. Extending the gossip protocol such that a system reaches consensus on detected faults can be performed via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. In this paper, the performance of gossip services employing flat and hierarchical schemes is analyzed on an experimental testbed in terms of consensus time, resource utilization and scalability. Performance associated with a hierarchically arranged gossip scheme is analyzed with varying group sizes and is shown to scale well. Resource utilization of the gossip-style failure detection and consensus service is measured in terms of network bandwidth utilization and CPU utilization. Analytical models are developed for resource utilization and performance projections are made...
A Failure Detection System for Large Scale Distributed Systems
… International Conference on …, 2010
Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. In this paper we present an innovative solution to this problem. The approach is based on adaptive, decentralized failure detectors, capable of working asynchronous and independent on the application flow. The proposed failure detectors are based on clustering, the use of a gossip-based algorithm for detection at local level and the use of a hierarchical structure among clusters of detectors along which traffic is channeled. In this we present result proving that the system is able to scale to a large number of nodes, while still considering the QoS requirements of both applications and resources, and it includes the fault tolerance and system orchestration mechanisms, added in order to asses the reliability and availability of distributed systems in an autonomic manner.
Consensus Based on Strong Failure Detectors: A Time and Message-Efficient Protocol
Lecture Notes in Computer Science, 2000
The class of strong failure detectors (denoted S) includes all failure detectors that suspect all crashed processes and that do not suspect some (a priori unknown) process that never crashes. So, a failure detector that belongs to S is intrinsically unreliable as it can arbitrarily suspect correct processes. Several S-based consensus protocols have been designed. Some of them systematically require n computation rounds (n being the number of processes), each round involving n 2 or n messages. Others allow early decision (i.e., the number of rounds depends on the maximal number of crashes when there are no erroneous suspicions) but require each round to involve n 2 messages. This paper presents an early deciding S-based consensus protocol each round of which involves 3(n ? 1) messages. So, the proposed protocol is particularly time and message-e cient. Moreover, it can easily be generalized to reduce the number of rounds at the price of an increase in the number of messages per round.
Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001
Gossip protocols and services provide a means bywhich failures can be detected in large, distributed systemsin an asynchronous manner without the limits associatedwith reliable multicasting for group communications.Gossiping with consensus can take place throughout thesystem via a flat structure, or it can be hierarchicallydistributed across cooperating layers of nodes. In thispaper, the performance of flat and layered protocols isanalyzed on an experimental testbed in terms of consensustime and scalability. Performance associated with layeredgossip is analyzed with varying group sizes and is shownto scale well in a heterogeneous environment.
Simulative performance analysis of gossip failure detection for scalable distributed systems
Cluster Computing, 1999
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant machine at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.
Distributed Systems Course Project: Consensus with Failure Detector
disi.unitn.it
The content of this work is about the implementation of the consensus protocol making use of the PeerSim Simulator [5]. Since solving consensus in an unreliable asynchronous distributed system is impossible, even if there is at most one failure and the links are reliable, we need to introduce failure detectors in order to solve it. This report is created as course project relative to the Distributed Systems course held at the University of Trento by prof. Alberto Montresor and his assistant Gianluca Ciccarelli. Our approach at the problem starts introducing consensus, giving and explaining all the characteristics that it presents and then discovering the different kinds of unreliable failure detectors that we can built in order to solve the consensus problem. In the second part of the report we will propose some implementations based on the solution proposed in . These possible solutions are implemented using the PeerSim Simulator framework 1 in order to run some simulations. The last and final step is related to the analysis of the result obtained running the simulations of the protocol we have built.
Communication-efficient failure detection and consensus in omission environments
Information Processing Letters, 2011
Failure detectors have been shown to be a very useful mechanism to solve the consensus problem in the crash failure model, for which a number of communication-efficient algorithms have been proposed. In this paper we deal with the definition, implementation and use of communication-efficient failure detectors in the general omission failure model, where processes can fail by crashing and by