Checkpoint and rollback in asynchronous distributed systems (original) (raw)

Concurrent Checkpointing and Recovery in Distributed Systems

2002

The main objective of this paper is to speed up the consistent state restoration of distributed systems. Process recovery uses vector time to address unusual message handling issues and overlapping failures. Single rollback of non-failed process in response to a single failure has low message complexity. After a failure, processes required to rollback do so concurrently, which substantially decreases recovery delay. 1.

Significant checkpoint in distributed system

Lecture Notes in Computer Science, 1996

In distributed applications, a group of multiple objects are cooperated to achieve some objectives. The objects may suer from kinds of faults. If some object o is faulty, o is rolled back to the checkpoint and objects which have received messages from o are also required to be rolled back. In this paper, on the basis of the message semantics, we dene inuential messages whose receivers are required to be rolled back from the application point of view if the senders are rolled back. By using the inuential messages, a signicant checkpoint is dened to denote a consistent global state of the system while being inconsistent from the traditional denition. We would present protocols for taking the signicant checkpoint and for rolling back the objects.

Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing, 2004

This paper proposes a new classification of executions with checkpoints based on the amount of rollback during recovery. Specifically, an execution is k-rollback, if k indicates the maximal number of checkpoints that have to be rolled back. It is shown that coordinated checkpointing, SZPF, and ZPF are 1-rollback, while ZCF is ðn À 1Þ-rollback, where n is the number of participants in an execution. A new class of executions, called d-bounded cycles (in short, d-BC), is introduced, and is shown to be ððn À 1Þ Á dÞ-rollback (ZCF is a special case of d-BC for d ¼ 1). Finally, a protocol is presented whose executions are d-bounded cycles. A nice property of this protocol is that it does not impose any control information overhead on application messages, yet sends only a few control messages of its own. Moreover, the protocol maintains information that enables very efficient discovery of a recent recovery line that existed shortly before the failure.

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring network

Journal of Parallel and Distributed Computing, 2004

Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).

Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing, 2007

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n) steps, where n is the number of processes. The algorithm can be extended to work for general topologies too.

Checkpoint Interval and System's Overall Quality for Message Logging-Based Rollback and Recovery in Distributed and Embedded Computing

2009 International Conference on Embedded Software and Systems, 2009

In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent checkpointing reduces system recovery time in the presence of faults, and hence improves the system availability; however, more frequent checkpointing may also increase the probability for a task to miss its deadlines or prolong its execution time in faultfree scenarios. Hence, in distributed and real-time computing, the system's overall quality must be measured by a set of aggregated criteria, such as availability, task execution time, and task deadline miss probability. In this paper, we take into account state synchronization costs in the checkpointing and rollback recovery scheme and quantitatively analyze the relationships between checkpoint intervals and these criteria. Based on the analytical results, we present an algorithm for finding an optimal checkpoint interval that maximizes system's overall quality.

Design and analysis of an integrated checkpointing and recovery scheme for distributed applications

2000

An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency which is the main source of multi-step rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The bene ts of the integrated checkpointing scheme are quanti ed by means of simulation using an object-oriented test framework.

Characterization of consistent global checkpoints in large-scale distributed systems

1995

Abstract Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery.

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing, 2004

Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is OðknÞ when k initiators initiate concurrently. The time complexity is OðnÞ: For the recovery algorithm, time and message complexities are both OðnÞ: r 2004 Elsevier Inc. All rights reserved.

A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System

A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. Checkpoint is defined as a fault tolerant technique. It is a save state of a process during the failure-free execution, enabling it to restart from this checkpointed state upon a failure to reduce the amount of lost work instead of repeating the computation from beginning. The process of restoring form previous checkpointed state is known as rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated. Checkpointing is major challenge in mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts(MH) capable of communicating with each other without the assistance of base stations, some of processes running on mobile host. The main issues of this environment are insufficient power and limited storage capacity. This paper surveys the algorithms which have been reported in the literature for checkpointing in distributed systems as well as Mobile Distributed systems. Keywords – Checkpointing, Distributed systems, Fault tolerance, Mobile computing system, Rollback recovery.