EngagedScholarship@CSU Low Latency Fault Tolerance System Low Latency Fault Tolerance System (original) (raw)

Low Latency Fault Tolerance System

The Computer Journal, 2013

The Low Latency Fault Tolerance (LLFT) system provides fault tolerance for distributed applications within a local-area network, using a leader-follower replication strategy. LLFT provides application-transparent replication, with strong replica consistency, for applications that involve multiple interacting processes or threads. Its novel system model enables LLFT to maintain a single consistent infinite computation, despite faults and asynchronous communication.

Fault-Tolerant Intra-Group Communication

In distributed applications, a group of processes have to be cooperated. The intra-group communication supports the atomic and causally ordered delivery of messages with the processes in the group. Each process in the group is replicated into a collection of multiple replicas named a clusters. In this paper, we would like to discuss a fault-tolerant group communication which supports the atomic and ordered delivery of messages among the clusters in the group in the presence of Byzantine faults of the replicas.

Reliable communication in the presence of failures

ACM Transactions on Computer Systems, 1985

The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local-and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.

Fault Tolerance Middleware for Cloud Computing

2010

The Low Latency Fault Tolerance (LLFT) middleware provides fault tolerance for distributed applications deployed within a cloud computing or data center environment, using the leader/follower replication approach. The LLFT middleware consists of a Low Latency Messaging Protocol, a Leader-Determined Membership Protocol, and a Virtual Determinizer Framework. The Messaging Protocol provides a reliable, totally ordered message delivery service by employing a direct group-to-group multicast where the ordering is determined by the primary replica in the group. The Membership Protocol provides a fast reconfiguration and recovery service when a replica becomes faulty and when a replica joins or leaves a group. The Virtual Determinizer Framework captures ordering information at the primary replica and enforces the same ordering at the backup replicas for major sources of nondeterminism. The LLFT middleware maintains strong replica consistency, offers application transparency, and achieves low end-to-end latency.

Asynchronous active replication in three-tier distributed systems

2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings., 2002

The deployment of server replicas of a given service across an asynchronous distributed system (e.g. Internet) is a real practical challenge. This target cannot be indeed achieved by classical software replication techniques (e.g. passive and active replication) as these techniques usually rely on group communication toolkits that require server replicas to run over a partially synchronous distributed system. This paper proposes a threetier architecture for software replication that encapsulates the need of partial synchrony in a specific software component of a mid-tier to free replicas (end-tier) and clients (client-tier) from the need of underlying partial synchrony assumptions. Then we propose how to specialize the mid-tier in order to manage active replication of server replicas.

A Three-tier Active Replication Protocol for Large Scale Distributed Systems

IEICE TRANSACTIONS on Information and Systems, 2003

The deployment of server replicas of a service across an asynchronous distributed system (e.g., Internet) is a real practical challenge. This target cannot be indeed achieved by classical software replication techniques (e.g., passive and active replication) as these techniques usually rely on group communication toolkits that require server replicas to run over a partially synchronous distributed system to solve the underlying agreement problem. This paper proposes a three-tier architecture for software replication that encapsulates the need of partial synchrony in a specific software component of a mid-tier to free replicas and clients from the need of underlying partial synchrony assumptions. Then we propose how to specialize the mid-tier in order to manage active replication of server replicas. key words: high availability, fault tolerance, software replication, three-tier architectures * * It is well known that in asynchronous distributed systems, due to FLP impossibility result [17], it is not possible to implement these primitives while ensuring both safety conditions and the deterministic termination of the agreement protocols that they embed. In other words, these primitives can block the system due to asynchrony and despite replication . * * * Even the weakest unreliable failure detector allowing to solve Consensus is not implementable in asynchronous distributed systems .

A hierarchical asynchronous replication protocol for large scale systems

Proceedings 1993 IEEE Workshop on Advances in Parallel and Distributed Systems, 1993

This paper presents a new asynchronous replication protocol that is especially suitable for wide area and mobile systems, and allows reads and writes to occur at any replica. Updates reach other replicas using a propagation scheme based on nodes organized into a logical hierarchy. The hierarchical structure enables the scheme to scale well for thousands of replicas, while ensuring reliable delivery. A new service interface is proposed that provides different levels of asynchrony, allowing strong consistency and weak consistency to be integrated into the same framework. Further, the scheme provides the ability to offer different levels of staleness, depending upon the needs of various applications, by querying from different levels of the hierarchy. Also, it allows a selection from a number of reconciliation techniques based on delivery order mechanisms. Restructuring operations are provided to build and reconfigure the hierarchy dynamically without disturbing normal operation. The scheme tolerates transmission failures, node failures and network partitions.

Customizable Fault Tolerance forWide-Area Replication

2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007), 2007

Constructing logical machines out of collections of physical machines is a well-known technique for improving the robustness and fault tolerance of distributed systems. We present a new, scalable replication architecture, built upon logical machines specifically designed to perform well in wide-area systems spanning multiple sites. The physical machines in each site implement a logical machine by running a local state machine replication protocol, and a wide-area replication protocol runs among the logical machines. Implementing logical machines via the state machine approach affords free substitution of the fault tolerance method used in each site and in the wide-area replication protocol, allowing one to balance performance and fault tolerance based on perceived risk. We present a new Byzantine fault-tolerant protocol that establishes a reliable virtual communication link between logical machines. Our communication protocol is efficient (a necessity in wide-area environments), avoiding the need for redundant message sending during normal-case operation and allowing a logical machine to consume approximately the same wide-area bandwidth as a single physical machine. This dramatically improves the wide-area performance of our system compared to existing logical machine based approaches. We implemented a prototype system and compare its performance and fault tolerance to existing solutions.

Newtop: a fault-tolerant group communication protocol

1995

A general purpose group communication protocol suite called Newtop is described. It is assumed that processes can simultaneously belong to many groups, group size could be large, and processes could be communicating over the Internet. Asynchronous communication environment is therefore assumed where message transmission times cannot be accurately estimated, and the underlying network may well get partitioned, preventing functioning processes from communicating with each other. Newtop can provide causality preserving total order delivery to members of a group, ensuring that total order delivery is preserved for multi-group processes. Both symmetric and asymmetric order protocols are supported, permitting a process to use say symmetric version in one group and asymmetric version in other.

Fault-tolerant group communication protocols for asynchronous systems

1994

Contents iv Dlustrations vi Chapter 1-Introduction 1 1.1 Group Communication 2 1.1.1 Process Crashes and Membership Reconfiguration 2 1.1.2 Message Ordering 3 1.1.3 Message Delivery in Overlapping Process Groups 4 1.1.4 Existing Group Communication Protocols 4 1.2 Contributions of the Thesis 5 1.3 Thesis outline 6 Chapter 2-Group Communication Protocols and Related Problems 9 2.1 Synchrony and Group Communication 9 2.2 The System Model 11 2.3 Overlapping Process Groups 12 2.4 Message Order Delivery 2.4.1 Event Ordering in Distributed Systems 2.4.2 Identical Order Delivery 2.4.3 Causal Order Delivery 2.4.4 Total Order Delivery 2.5 Fault-Tolerance 2.6 Related Work 23 2.6.1 Chang and Maxemchuk's protocol.. 23 2.6.2 V System and Amoeba 24 2.6.3 ISIS protocols 25 2.6.4 Psync protocol 27 2.6.5 Trans and Total protocols 29 2.6.6 Transis protocols 30 2.6.7 Garcia-Molina and Spauster's protocol..