Recovery in Parallel State-Machine Replication (original) (raw)

Fast recovery in parallel state machine replication

2016

A well-established technique used to design fault-tolerant systems is state machine replication. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. The traditional state machine replication model builds on the sequential execution of requests to ensure consistency among the replicas. Sequentiality of execution, however, threatens the scalability of replicas. Recently, some proposals have suggested parallelizing the execution of replicas to achieve higher performance. Despite the success of parallel state machine replication in accomplishing high performance, the implication of such models on the recovery is mostly left unaddressed. Even for the traditional state machine replication approach, relatively few studies have considered the issues involved in recovering faulty replicas. The motivation of this thesis is clarifying the challenges and performance implications involved in checkpointing and recovery for parallel state machine rep...

Checkpointing in Parallel State-Machine Replication

Lecture Notes in Computer Science, 2014

State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently, several proposals have suggested parallelizing the execution model of the replicas to enhance state-machine replication's performance. Despite their success in accomplishing high performance, the implications of these models on checkpointing and recovery is mostly left unaddressed. In this paper, we focus on the checkpointing problem in the context of Parallel State-Machine Replication. We propose two novel algorithms and assess them through simulation and a real implementation.

Rethinking State-Machine Replication for Parallelism

2014 IEEE 34th International Conference on Distributed Computing Systems, 2014

State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the same output upon executing the same sequence of commands. These requirements usually result in single-threaded replicas, which hinders service performance. This paper introduces Parallel State-Machine Replication (P-SMR), a new approach to parallelism in state-machine replication. P-SMR scales better than previous proposals since no component plays a centralizing role in the execution of independent commands-those that can be executed concurrently, as defined by the service. The paper introduces P-SMR, describes a "commodified architecture" to implement it, and compares its performance to other proposals using a key-value store and a networked file system.

Boosting State Machine Replication with Concurrent Execution

2018 Eighth Latin-American Symposium on Dependable Computing (LADC), 2018

State machine replication is a fundamental technique to render services fault tolerant. One of the key assumptions of state machine replication is that replicas must execute operations deterministically. Deterministic execution often translates into sequential execution of requests at replicas. With the increasing demand for dependable services and widespread use of multi-core servers, several proposals for enabling concurrent execution in state machine replication have appeared in the literature. Invariably, these techniques exploit the fact that independent operations, those that do not share any common state or do not update shared state, can execute concurrently. Existing protocols differ in several important ways. In this paper, we survey this field of research and discuss the main aspects of the different protocols. Central aspects include conflict detection, representation and enforcing; tradeoffs involving existing architectures and level of allowed parallelism; workload-driven adaptation schemes; and implications of parallel state machine replication to recovery. Moreover, we discuss ongoing and future work directions for high-throughput state machine replication.

Analysis of Checkpointing Overhead in Parallel State Machine Replication

State machine replication (SMR) is a well-established technique to fault-tolerant systems. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. Recently, several proposals have suggested par-allelizing the execution of state machine replicas to achieve high throughput. Concurrent execution of commands has many implications, including the recovery of replicas from failures. Conventional checkpointing techniques, for example , must be revisited in parallelized models. In this paper, we review parallel variations of state machine replication and discuss how checkpointing procedures apply to these models. Moreover, we evaluate the impact caused by checkpointing techniques on recovery through simulations.

Boosting concurrency in Parallel State Machine Replication

Proceedings of the 20th International Middleware Conference, 2019

State machine replication (SMR) is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of SMR, some proposals execute independent commands concurrently, while dependent commands execute sequentially in the total delivery order. The most general approach to handling command dependencies resorts to a directed acyclic graph (DAG), where nodes represent commands and edges represent dependencies. In this paper we show that due to the command arrival and multithreaded execution rates of SMR, a highly concurrent implementation of a DAG is needed. We show that a typical coarse-grained DAG implementation, where the whole graph is a critical section, results in a bottleneck in the replica. We propose two improvements to the coarse-grained DAG approach: ne-grained algorithms, using lock-coupling, and lock-free algorithms. Our ne-grain algorithms lock individual vertices in the DAG. The lock-free algorithms use nonblocking synchronization, with atomic operations, and lazy synchronization to postpone physical removal of nodes. All algorithms were integrated in a parallel SMR prototype. Experimental evaluation revealed that the ne-grained algorithms are also subject to a bottleneck. The lock-free implementation, however, sports linear speedup with the number of working threads, in some cases scaling up to 64 threads. CCS Concepts • Computer systems organization → Dependable and fault-tolerant systems and networks; • Computing methodologies → Distributed algorithms;

Efficient and Deterministic Scheduling for Parallel State Machine Replication

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017

Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of state machine replication, recent proposals have introduced parallel execution of commands. In parallel state machine replication, incoming commands may or may not depend on other commands that are waiting for execution. Although dependent commands must be processed in the same relative order at every replica to avoid inconsistencies, independent commands can be executed in parallel and benefit from multi-core architectures. Since many application workloads are mostly composed of independent commands, these parallel models promise high throughput without sacrificing strong consistency. The efficient execution of commands in such environments, however, requires effective scheduling strategies. Existing approaches rely on dependency tracking based on pairwise comparison between commands, which introduces scheduling contention. In this paper, we propose a new and highly efficient scheduler for parallel state machine replication. Our scheduler considers batches of commands, instead of commands individually. Moreover, each batch of commands is augmented with a compact data structure that encodes commands information needed to the dependency analysis. We show, by means of experimental evaluation, that our technique outperforms schedulers for parallel state machine replication by a fairly large margin.

On the Efficiency of Durable State Machine Replication

State Machine Replication (SMR) is a fundamental technique for ensuring the dependability of critical services in modern internet-scale infrastructures. SMR alone does not protect from full crashes, and thus in practice it is employed together with secondary storage to ensure the durability of the data managed by these services. In this work we show that the classical durability enforcing mechanisms -logging, checkpointing, state transfer -can have a high impact on the performance of SMRbased services even if SSDs are used instead of disks. To alleviate this impact, we propose three techniques that can be used in a transparent manner, i.e., without modifying the SMR programming model or requiring extra resources: parallel logging, sequential checkpointing, and collaborative state transfer. We show the benefits of these techniques experimentally by implementing them in an open-source replication library, and evaluating them in the context of a consistent key-value store and a coordination service.

Resilient state machine replication

2005

Abstract Nowadays, one of the major concerns about the services provided over the Internet is related to their availability. Replication is a well known way to increase the availability of a service. However, replication has some associated costs, namely it is necessary to guarantee a correct coordination among the replicas. Moreover, being the Internet such an unpredictable and insecure environment, coordination correctness should be tolerant to Byzantine faults and immune to timing failures.

Reconfiguring Parallel State Machine Replication

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), 2017

State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.