Implementing fault-tolerant services using the state machine approach: a tutorial (original) (raw)

2005

Abstract Nowadays, one of the major concerns about the services provided over the Internet is related to their availability. Replication is a well known way to increase the availability of a service. However, replication has some associated costs, namely it is necessary to guarantee a correct coordination among the replicas. Moreover, being the Internet such an unpredictable and insecure environment, coordination correctness should be tolerant to Byzantine faults and immune to timing failures.

This work was first developed towards the end of the Malicious and Accidental Fault Tolerance (MAFTIA)

2015

State machine replication (SMR) is a generic technique for implementing fault-tolerant distributed services by replicating them in sets of servers. There have been several proposals for using SMR to tolerate arbitrary or Byzantine faults, including intrusions. However, most of these systems can tolerate at most f faulty servers out of a total of 3f + 1. We show that it is possible to implement a Byzantine state machine replication algorithm with only 2f + 1 replicas by extending the system with a simple trusted distributed component. Several performance metrics show that our algorithm, BFT-TO, fares well in comparison with others in the literature. Furthermore, BFT-TO is not vulnerable to some recently-presented performance attacks that affect alternative approaches.

Fault tolerance in distributed systems using fused state machines

Distributed Computing, 2014

Replication is a standard technique for fault tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash or ⌊ f /2⌋ Byzantine faults among n different machines, replication requires n f additional backup machines. We present a solution called fusion that requires just f additional backup machines. First, we build a framework for fault tolerance in DFSMs based on the notion of Hamming distances. We introduce the concept of an (f , m)-fusion, which is a set of m backup machines that can correct f crash faults or ⌊ f /2⌋ Byzantine faults among a given set of machines. Second, we present an algorithm to generate an (f , f)-fusion for a given set of machines. We ensure that our backups are efficient in terms of the size of their state and event sets. Third, we use locality sensitive hashing for the detection and correction of faults that incurs almost the same overhead as that for replication. We detect Byzantine faults with time complexity O(n f) on average while we correct crash and Byzantine faults with time complexity O(nρ f) with high probability, where ρ is the average state reduction achieved by fusion. Finally, our evaluation of fusion on the widely used MCNC'91 benchmarks for DFSMs show that the average state space *This research was supported in part by the NSF Grants CNS-0718990, CNS-0509024, CNS-1115808 and Cullen Trust for Higher Education Endowed Professorship.

State of Art Survey for Fault Tolerance Feasibility in Distributed Systems

Asian Journal of Research in Computer Science, 2021

The use of technology has grown dramatically, and computer systems are now interconnected via various communication mediums. The use of distributed systems (DS) in our daily activities has only gotten better with data distributions. This is due to the fact that distributed systems allow nodes to arrange and share their resources across linked systems or devices, allowing humans to be integrated with geographically spread computer capacity. Due to multiple system failures at multiple failure points, distributed systems may result in a lack of service availability. to avoid multiple system failures at multiple failure points by using fault tolerance (FT) techniques in distributed systems to ensure replication, high redundancy, and high availability of distributed services. In this paper shows ease fault tolerance systems, its requirements, and explain about distributed system. Also, discuss distributed system architecture; furthermore, explain used techniques of fault tolerance, in ad...

Study of Several Fault Tolerance Methodologies in Distributed Environment

2012

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. This paper provides a learning of fault tolerance techniques in distributed systems, particularly replication and check-pointing. We have also suggested fault tolerance by replicated chechpointing in which both the tolerance techniques are combined. This work will help new scholars and students a good quality reference.

Efficient State Transfer for Recovery-Based Byzantine-Fault-Tolerant State Machine Replication

2009

Abstract. This paper presents an e cient state-transfer protocol for Byzantine-fault-tolerant state machine replication systems enhanced with recovery mechanisms. Usually the recovery of a stateful replica consumes a considerable amount of time, mostly due to state transfer. As a result it is essential to reduce the state transfer time, simultaneously ensuring that correct replicas never lose their state.

Workshop on Methods, Models and Tools for Fault Tolerance

2007

Faults are unavoidable in all large systems and therefore designing for fault tolerance is essential. We believe that the use of formal methods is essential for mastering the complexity inherent in systems with faults and mechanism for tolerating those faults. Formal modelling and analysis helps designers to identify faults and to understand the effect of faults on systems behaviour. Modelling and analysis also helps designers understand the contribution of fault-tolerance mechanisms to overall system dependability.

FT-SR: A Programming Language For Constructing Fault-Tolerant Distributed Systems

1992

This dissertation focuses on the area of improving programming language support for constructing fault-tolerant systems. Specifically, the design and implementation of FT-SR, a programming language developed for building a wide variety of fault-tolerant systems, is described. FT-SR is based on the concurrent programming language SR and is designed as a set of extensions to SR. A distinguishing feature of FT-SR is the flexibility it provides the programmer in structuring fault-tolerant software. It is flexible enough to be used for structuring systems according to any of the standard fault-tolerance structuring paradigms that have been developed for such systems, including the object/action model, the restartable action paradigm, and the state machine approach. This is especially important in systems building because different structuring paradigms are often appropriate for different parts of the system. This flexibility sets FT-SR apart from other fault-tolerant programming languages which provide language support for the one paradigm that is best suited for the class of applications they choose to support. FT-SR, on the other hand, is suitable for programming a variety of systems and applications. FT-SR derives its flexibility from a programming model based on fail-stop atomic objects. These objects execute operations as atomic actions except when a failure or series of failures cause underlying implementation assumptions to be violated; in this case, notification is provided. This dissertation argues that fail-stop atomic objects are the fundamental building blocks for all fault-tolerant programs. FT-SR provides the programmer with simple fail-stop atomic objects, and mechanisms that allow these failstop atomic objects to be composed to form higher-level fail-stop atomic objects that can tolerate a greater number of faults. The mechanisms for composing fail-stop atomic objects are based on standard redundancy techniques. This ability to combine the basic building blocks in a variety of ways allows programmers to structure their programs in a manner best suited to the application at hand.

Implementing fault-tolerant services using the state machine approach: a tutorial (original) (raw)

Related papers