Experimental Analysis of a Gossip-Based Service for Scalable, Distributed Failure Detection and Consensus (original) (raw)

Simulative performance analysis of gossip failure detection for scalable distributed systems

Cluster Computing, 1999

 Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant machine at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.

Performance analysis of flat and layered gossip services for failure detection and consensus in scalable heterogeneous clusters

Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001

Gossip protocols and services provide a means bywhich failures can be detected in large, distributed systemsin an asynchronous manner without the limits associatedwith reliable multicasting for group communications.Gossiping with consensus can take place throughout thesystem via a flat structure, or it can be hierarchicallydistributed across cooperating layers of nodes. In thispaper, the performance of flat and layered protocols isanalyzed on an experimental testbed in terms of consensustime and scalability. Performance associated with layeredgossip is analyzed with varying group sizes and is shownto scale well in a heterogeneous environment.

Achieving scalable cluster system analysis and management with a gossip-based network service

Proceedings LCN 2001. 26th Annual IEEE Conference on Local Computer Networks

Clusters of workstations are increasingly used for applications requiring high levels of both performance and reliability. Certain fundamental services are highly desirable to achieve these twin goals of network-based cluster system analysis and management. Among these services is the ability to detect network and node failures and the capability to efficiently determine computer and network load levels. Furthermore, the ability to allow for the distribution of administrative directives is also integral to the goal of cluster management. This paper presents a scalable approach to providing these vital support capabilities for distributed computing integrated into a cluster management system. Previous approaches to cluster management have suffered from problems of scalability and the inability to properly support heterogeneous systems in a non-proprietary fashion. This cluster management system employs gossip techniques to address the problem of scalability in network-based system management. The results of two case studies show that the cluster management system is scalable and has little adverse impact on the performance of sequential and parallel applications running on the managed system.

Gossip-based service coordination for scalability and resilience

2008

Many interesting emerging applications involve the coordination of a large number of service instances, for instance, as targets for dissemination or sources in information gathering. These applications raise hard architectural, scalability, and resilience issues that are not suitably addressed by centralized or monolithic coordination solutions.

Gossip-based broadcast protocols

Gossip, or epidemic, protocols have emerged as a powerful strategy to implement highly scalable and resilient reliable broadcast primitives. Due to scalability reasons, each participant in a gossip protocol maintains only a partial view of the system, from which they select peers to perform gossip exchanges. On the other hand the natural redundancy of gossip protocols makes them less efficient than other approaches that rely in some sort of structured overlay network. The thesis addresses gossip protocols and the problem of building partial views to support their operation. For that purpose, the thesis presents and evaluates a new scalable membership protocol, which is called HyParView, that provides a number of properties, such as degree distribution, accuracy and clustering coefficient, that are highly useful to the construction of efficient gossip protocols. The thesis also introduce two new gossip protocols, based on HyParView, that provide high reliability with small message redundancy. One is an eager push gossip protocol while the other is a tree based gossip broadcast protocol. Simulations results show that, in comparison with other existing protocols, HyParView-based gossip protocols not only provide better reliability but also support higher percentages of node failures, and are able to recover faster from these failures.

Efficient failure detection and consensus at extreme-scale systems

International Journal of Electrical and Computer Engineering (IJECE), 2022

Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughout academia organizations, business, home, and government sectors. Peer-to-peer (P2P) technology is a typical distributed system model that is gaining popularity for delivering computing resources and services. Distributed systems try to increase its availability in the event of frequent component failures and functioning the system in such scenario is notoriously difficult. In order to identify component failures in the system and achieve global agreement (consensus) among failed components, this paper implemented an efficient failure detection and consensus algorithm based on fail-stop type process failures. The proposed algorithm is fault-tolerant to process failures occurring before and during the execution of the algorithm. The proposed algorithm works with the epidemic gossip protocol, which is a randomly generated paradigm of computation and communication that is both fault-tolerant and scalable. A simulation of an extreme-scale information dissemination process shows that global agreement can be achieved. A P2P simulator, PeerSim, is used in the paper to implement and test the proposed algorithm. The proposed algorithm results exhibited high scalability and at the same time detected all the process failures. The status of all the processes is maintained in a Boolean matrix.

A Failure Detection System for Large Scale Distributed Systems

… International Conference on …, 2010

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. In this paper we present an innovative solution to this problem. The approach is based on adaptive, decentralized failure detectors, capable of working asynchronous and independent on the application flow. The proposed failure detectors are based on clustering, the use of a gossip-based algorithm for detection at local level and the use of a hierarchical structure among clusters of detectors along which traffic is channeled. In this we present result proving that the system is able to scale to a large number of nodes, while still considering the QoS requirements of both applications and resources, and it includes the fault tolerance and system orchestration mechanisms, added in order to asses the reliability and availability of distributed systems in an autonomic manner.

Scalable epidemic message passing interface fault tolerance

Bulletin of Electrical Engineering and Informatics, 2022

Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) and extreme scale systems. Components fail more often in such systems, results in application abort. Adopting faulttolerance techniques can be consistently detect failures and continue application's execution even if the failures exist. A prominent parallel programming specification, message passing interface (MPI), as it would be used to implement failure detection and consensus algorithm in this paper. Although the MPI does not facilitate fault tolerant behavior, this work presents a fault tolerant, matrix based failure detection and consensus algorithm. The proposed algorithm uses Gossiping. To detect failures, randomised pinging will be applied during the execution of the algorithm by using piggybacked gossip messages. In order to achieve consensus on the failures in the system, failed processes' information will be sent using the same piggybacked gossip messages to all the alive processes. The algorithm was implemented in MPI framework and is completely fault tolerant. The results exhibit all the MPI process failures were detected using randomised pinging and global consensus has achieved on failed MPI process in the system.

A probabilistic characterization of a fault-tolerant gossiping algorithm

Journal of Systems Science and Complexity, 2009

Gossiping is a popular technique for probabilistic reliable multicast (or broadcast). However, it is often difficult to understand the behavior of gossiping algorithms in an analytic fashion. Indeed, existing analyses of gossip algorithms are either based on simulation or based on ideas borrowed from epidemic models while inheriting some features that do not seem to be appropriate for the setting of gossiping. On one hand, in epidemic spreading, an infected node typically intends to spread the infection an unbounded number of times (or rounds); whereas in gossiping, an infected node (i.e., a node having received the message in question) may prefer to gossip the message a bounded number of times. On the other hand, the often assumed homogeneity in epidemic spreading models (especially that every node has equal contact to everyone else in the population) has been silently inherited in the gossiping literature, meaning that an expensive membership protocol is often needed for maintaining nodes' views. Motivated by these observations, the authors present a characterization of a popular class of fault-tolerant gossip schemes (known as "push-based gossiping") based on a novel probabilistic model, while taking the afore-mentioned factors into consideration.