Multiple Distributed Checkpoints over Unreliable Channels (original) (raw)

Significant checkpoint in distributed system

Lecture Notes in Computer Science, 1996

In distributed applications, a group of multiple objects are cooperated to achieve some objectives. The objects may suer from kinds of faults. If some object o is faulty, o is rolled back to the checkpoint and objects which have received messages from o are also required to be rolled back. In this paper, on the basis of the message semantics, we dene inuential messages whose receivers are required to be rolled back from the application point of view if the senders are rolled back. By using the inuential messages, a signicant checkpoint is dened to denote a consistent global state of the system while being inconsistent from the traditional denition. We would present protocols for taking the signicant checkpoint and for rolling back the objects.

Efficient checkpointing procedures for fault tolerant distributed systems

Microprocessing and Microprogramming, 1994

A classical approach for achieving fault tolerance in distributed systems is based on the incorporation of efficient and fault tolerant procedures for checkpointing and recovery in such systems. We propose two checkpointing procedures, which can be initiated by any process in the system or upon failure of one or more component processes. Our procedures return the most recent and consistent checkpoints for the processes initiating the procedure, and do not interfere with the progress of the distributed system application. Furthermore, our procedures guarantee that a consistent checkpoint will be obtained when they terminate. Examples illustrating the application of the procedures are also provided.

Efficient message logging for uncoordinated checkpointing protocols

Lecture Notes in Computer Science, 1996

A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages.

Coordinated checkpointing without direct coordination

1998

Abstract Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Long running parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper we describe a new coordinated checkpoint protocol capable of satisfying both types of applications.

Concurrent Checkpointing and Recovery in Distributed Systems

2002

The main objective of this paper is to speed up the consistent state restoration of distributed systems. Process recovery uses vector time to address unusual message handling issues and overlapping failures. Single rollback of non-failed process in response to a single failure has low message complexity. After a failure, processes required to rollback do so concurrently, which substantially decreases recovery delay. 1.

Characterization of consistent global checkpoints in large-scale distributed systems

1995

Abstract Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery.

Checkpoint and rollback in asynchronous distributed systems

Proceedings of INFOCOM '97, 1997

This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) Multiple processes can simultaneously initiate the checkpointing.

Consistency of Distributed System with Active Initiator Process Without Useless Checkpoints

International Journal of Computing, 2014

Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally co...

Object-based checkpoints in distributed systems

Proceedings Third International Workshop on Object-Oriented Real-Time Dependable Systems, 1997

In distributed applications, multiple objects are cooperated to achieve some objectives. The objects may suer from kinds of faults. If some object o is faulty, o is rolled back to the checkpoint and objects which have received messages from o are also required to be rolled back. In this paper, we dene inuential messages whose receivers are required to be rolled back if the senders are rolled back in the object-based computation model. By using the inuential messages, an object-based (O) checkpoints are dened to denote semantically consistent global states of the system while inconsistent with the traditional message-based denition. We show how many checkpoints can be reduced by taking only the O-checkpoints.

A Scalable Communication-Induced Checkpointing Algorithm for Distributed Systems

IEICE Transactions on Information and Systems, 2013

Communication-induced checkpointing (CIC) has two main advantages: first, it allows processes in a distributed computation to take asynchronous checkpoints, and secondly, it avoids the domino effect. To achieve these, CIC algorithms piggyback information on the application messages and take forced local checkpoints when they recognize potentially dangerous patterns. The main disadvantages of CIC algorithms are the amount of overhead per message and the induced storage overhead. In this paper we present a communication-induced checkpointing algorithm called Scalable Fully-Informed (S-FI) that attacks the problem of message overhead. For this, our algorithm modifies the Fully-Informed algorithm by integrating it with the immediate dependency principle. The S-FI algorithm was simulated and the result shows that the algorithm is scalable since the message overhead presents an under-linear growth as the number of processes and/or the message density increase.