Using replication and checkpointing for reliable task management in computational Grids (original) (raw)

A Survey on Task Checkpointing and Replication based Fault Tolerance in Grid Computing

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken in to account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job Checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate Checkpointing intervals and replica numbers are chosen. This survey work provides several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. This survey results on experiments are evaluated in a newly developed grid simulation environment SimGrid [2], which allows for easy modeling of dynamic system and job behavior. The workload and system parameters derived from logs that were collected from results have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems, 2009

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

ArXiv, 2020

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of processing resources without the need to know where these might be found and what technology such as, hardware equipment and operating system was used. Dependability and performance are among the key difficulties faced in a grid computing environment. Various systems have been proposed in the literature to handle recouping from resource failure in Grid computing environment. One case of such system is checkpointing. Checkpointing is a system that endures faults when resources failed. Checkpointing method has the upside of lessening the work lost because of resource faults. However, checkpointing presents a huge runtime overhead. In this paper, we propose an improved checkpointing system to bring down runtime overhead. A replica is added to ensure the a...

Fault Tolerance In Grid Computing: State of the Art and Open Issues

International Journal of Computer Science & Engineering Survey, 2011

Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes cooperate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in grid computing. Commonly utilized techniques for providing fault tolerance are job checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. In case of complex scientific workflows where tasks can execute in well defined order reliability is another biggest challenge because of the unreliable nature of the grid resources.

Fault Tolerant Scheduling Strategy for Computational Grid Environment

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance echanism with Minimum Total Time to Release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the TTR by selecting a computational resource based on job requirements, job characteristics and hardware features of the resources. The fault tolerance mechanism used here s...

Application-Level Fault-Tolerance Solutions for Grid Computing

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008

One of the key functionalities provided by Grid systems is the remote execution of applications over the Grid. We present a research proposal on fault-tolerance mechanisms for the execution of message-passing parallel applications on the Grid. An architecture called CPPC-G is proposed, consisting of a set of services built on top of the Globus Toolkit. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpointing instrumentation into the application code. CPPC-G services will be in charge of the submission and monitoring of the application's execution, management of checkpoint files generated by CPPC-enabled applications, and detection and automatic restart of failed executions.

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

Arabian Journal for Science and Engineering, 2014

Application scheduling is crucial for grid computing environment. The failure of grid resources poses a great challenge to it. Most existing application scheduling algorithms deal with resource failures by employing reliability-aware scheduling without considering performance and do not adequately provide fault tolerance to them. In this paper, we proposed a fault tolerant task scheduling algorithm for independent and dependent (workflows) tasks considering reliability as well as the performance of grid resources. We focused on the Weibull distributed failures of grid resources in spite of commonly adopted assumption of Poisson failure distribution. To handle such failures, rollback recovery via checkpoint/restart is used for improving system dependability and reliability. The optimal checkpointing frequency is used with the goal to minimize the fault tolerance overhead (expected waste time). Based on minimal wasted time, a new factor known as capacity decreasing factor is generated. It considers both the performance and failure characteristics of the resources. Finally, the efficient scheduling decision is made using genetic algorithm considering the capacity decreasing factor by generating the new computing capacity of the resources in the presence of failures. The efficient scheduling solution is generated having both optimal performance (makespan) and reliability (i.e., the lowest tendency to fail). Further, precedence constraint of sub-tasks is also considered, where ordering of tasks is performed considering the precedence relationship and fault tolerance overhead. The simulation results show that our proposed fault tolerant scheduling algorithm achieves better performance and execu-

Improving Fault Tolerance in Desktop Grids Based On Incremental Checkpointing

2006

Fault tolerance is an important issue to guarantee reliable execution of tasks in computational desktop grid environment where execution failures are frequently expected. Periodic checkpointing of running tasks is one of the common strategies for achieving acceptable fault tolerance. A problem usually arises, that is, temporary stored data in a checkpoint file for some long running tasks might be too large in size to be reliably transmitted between nodes without consuming network bandwidth. Data loss may also occur when transmitting such large amount of data in a non-reliable communication environment (e.g. desktop grid). In this paper, a modified application level incremental checkpointing approach is proposed in which the size of transmitted checkpoint data can be reduced to about 3% of its original size with little overhead on computation time. The proposed approach also investigates a new mechanism for safely storing a checkpoint file with reliance on the availability of the submitting node only. A simulator have been built using the .Net framework 1.1 to test the validity of the proposed approach using an application code built on variable dimensions' matrix multiplication. Experimental results show that the proposed approach improved fault tolerance with minimizing computational overhead

Using replication and checkpointing for reliable task management in computational Grids (original) (raw)

Related papers