Performance comparison of checkpoint and recovery protocols (original) (raw)

Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing, 2006

Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.

Design and analysis of an integrated checkpointing and recovery scheme for distributed applications

2000

An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency which is the main source of multi-step rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The bene ts of the integrated checkpointing scheme are quanti ed by means of simulation using an object-oriented test framework.

Evaluating distributed checkpointing protocols

23rd International Conference on Distributed Computing Systems, 2003. Proceedings., 2003

This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. The paper also analyzes several known protocols and compares their overhead ratio.

A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System

A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. Checkpoint is defined as a fault tolerant technique. It is a save state of a process during the failure-free execution, enabling it to restart from this checkpointed state upon a failure to reduce the amount of lost work instead of repeating the computation from beginning. The process of restoring form previous checkpointed state is known as rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated. Checkpointing is major challenge in mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts(MH) capable of communicating with each other without the assistance of base stations, some of processes running on mobile host. The main issues of this environment are insufficient power and limited storage capacity. This paper surveys the algorithms which have been reported in the literature for checkpointing in distributed systems as well as Mobile Distributed systems. Keywords – Checkpointing, Distributed systems, Fault tolerance, Mobile computing system, Rollback recovery.

A fast restart mechanism for checkpoint/recovery protocols in networked environments

2008

Checkpoint/recovery has been studied extensively, and various optimization techniques have been presented for its improvement. Regardless of the considerable research efforts, little work has been done on improving its restart latency. The time spent on retrieving and loading the checkpoint image during a recovery is non-trivial, especially in networked environments. With the ever-increasing application memory footprint and system failure rate, it is becoming more of an issue. In this paper, we present a Fast REstart Mechanism called FREM. It allows fast restart of a failed process without requiring the availability of the entire checkpoint image. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping the computation of the resumed process with the retrieval of its checkpoint image. We have implemented FREM with the BLCR checkpointing tool in Linux systems. Our experiments with the SPEC benchmarks indicate that it can effectively reduce restart latency by 61.96% on average in networked environments.

Concurrent Checkpointing and Recovery in Distributed Systems

2002

The main objective of this paper is to speed up the consistent state restoration of distributed systems. Process recovery uses vector time to address unusual message handling issues and overlapping failures. Single rollback of non-failed process in response to a single failure has low message complexity. After a failure, processes required to rollback do so concurrently, which substantially decreases recovery delay. 1.

Efficient checkpointing procedures for fault tolerant distributed systems

Microprocessing and Microprogramming, 1994

A classical approach for achieving fault tolerance in distributed systems is based on the incorporation of efficient and fault tolerant procedures for checkpointing and recovery in such systems. We propose two checkpointing procedures, which can be initiated by any process in the system or upon failure of one or more component processes. Our procedures return the most recent and consistent checkpoints for the processes initiating the procedure, and do not interfere with the progress of the distributed system application. Furthermore, our procedures guarantee that a consistent checkpoint will be obtained when they terminate. Examples illustrating the application of the procedures are also provided.

A Comparison between Different Checkpoint Schemes with Advantages and Disadvantages

Ijca Proceedings on National Seminar on Recent Advances in Wireless Networks and Communications, 2014

It is known that check pointing and rollback recovery are widely used techniques that allow a distributed computing to progress in spite of a failure. There are two fundamental approaches for check pointing and recovery. One is asynchronous approach, process take their checkpoints independently. So, taking checkpoints is very simple but due to absence of a recent consistent global checkpoint which may cause a rollback of computation. Synchronous check pointing approach assumes that a single process other than the application process invokes the check pointing algorithm periodically to determine a consistent global checkpoint.

A Review on Evaluation of Multilevel Checkpointing System in Distributed Environment

2015

Nowadays there is need of high performance of computer system in distributed environment. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. However, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. Also uses the designed of multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library[1], that writes lightweight checkpoints to node-local storage in addition to the parallel file system, which present probabilistic Markov models of SCRs performance. The proposed work focuses on evaluation of multiple checkpointing in the distributed environment in the presence of multiple senders and multiple receiver.

Identification of Critical Factors in Checkpointing Based Multiple Fault Tolerance for Distributed System

Journal of Emerging Trends in Computing and …, 2010

Performance of a checkpointing based multiple fault tolerance is low. The main reason is overheads associate with checkpointing. A checkpointing algorithm can be improved by improved storing strategy and checkpointing scheduling. Improved storage strategy and checkpointing scheduling will reduce the overheads associated with checkpointing. Performance and efficiency is most desirable feature of recovery based on checkpointing. In this paper important critical issues involved in fast and efficient recovery are discussed based on checkpointing. Impact of each issue on performance of checkpointing based recovery is also discussed. Relationships among issues are also explored. Finally comparisons of important issues are done between coordinated checkpointing and uncoordinated checkpointing.

Performance comparison of checkpoint and recovery protocols (original) (raw)

Related papers