Efficient checkpoint algorithm for distributed system (original) (raw)

A Survey on Task Checkpointing and Replication based Fault Tolerance in Grid Computing

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken in to account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job Checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate Checkpointing intervals and replica numbers are chosen. This survey work provides several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. This survey results on experiments are evaluated in a newly developed grid simulation environment SimGrid [2], which allows for easy modeling of dynamic system and job behavior. The workload and system parameters derived from logs that were collected from results have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

ArXiv, 2020

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of processing resources without the need to know where these might be found and what technology such as, hardware equipment and operating system was used. Dependability and performance are among the key difficulties faced in a grid computing environment. Various systems have been proposed in the literature to handle recouping from resource failure in Grid computing environment. One case of such system is checkpointing. Checkpointing is a system that endures faults when resources failed. Checkpointing method has the upside of lessening the work lost because of resource faults. However, checkpointing presents a huge runtime overhead. In this paper, we propose an improved checkpointing system to bring down runtime overhead. A replica is added to ensure the a...

Checkpointing Based Fault Tolerant Job Scheduling System for Computational Grid

A computational grid environment, due to its heterogeneous, autonomous and dynamic nature is prone to different kinds of faults which may lead to delay in completion of job or even execution of job from starting point. Checkpointing mechanism plays a vital role for making grid more reliable, cost effective and efficient. In this paper, we have proposed schemes based on system checkpointing and application checkpointing. Their performance comparison is done based on the empirical study. The ABSC scheme is suitable for the applications where computations are not intense. But for computationally intense applications where reliability is more important ABAC scheme is more suitable. But this scheme may produce slight overheads in fault free situations and very reliable in faulty situations.

A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System

A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. Checkpoint is defined as a fault tolerant technique. It is a save state of a process during the failure-free execution, enabling it to restart from this checkpointed state upon a failure to reduce the amount of lost work instead of repeating the computation from beginning. The process of restoring form previous checkpointed state is known as rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated. Checkpointing is major challenge in mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts(MH) capable of communicating with each other without the assistance of base stations, some of processes running on mobile host. The main issues of this environment are insufficient power and limited storage capacity. This paper surveys the algorithms which have been reported in the literature for checkpointing in distributed systems as well as Mobile Distributed systems. Keywords – Checkpointing, Distributed systems, Fault tolerance, Mobile computing system, Rollback recovery.

Reliable and Efficient Distributed Checkpointing System for Grid Environments

Journal of Grid Computing, 2014

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures

The Median Resource Failure Checkpointing

In grid computing, the realization of an enviable fault tolerance ability is linked with the proper utilization of resources and scheduling of jobs. The literature offers two solutions to these two challenging tasks, viz. checkpointing and replication. A checkpointing strategy is being proposed that uses the median of failure intervals of the resources in deciding the checkpoint intervals for the given jobs. The strategy shows improved system throughput, job losses and job execution times while eliminating unnecessary checkpoints.

A hybrid fault tolerance technique in grid computing system

The Journal of Supercomputing, 2011

In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems, 2009

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and

Using replication and checkpointing for reliable task management in computational Grids

2010

In grid computing systems, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for grid or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on grid systems. In this paper, we present ART, which is an Adaptive, Reliable, and fault-Tolerant task management for grid computing environments. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly reduce the number of replications and improve scalability compared with existing mechanisms.