Level of confidence evaluation and its usage for Roll-back Recovery with Checkpointing optimization (original) (raw)

Analysis of checkpointing for real-time systems

Real-Time Systems, 2001

Predictable performance in the event of failures is of paramount importance in most safety critical real-time systems. Various hardware as well as software fault-tolerant techniques are employed towards this goal among which checkpointing is a relatively cost-effective scheme. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. This paper provides exact schedulability tests for fault tolerant task sets under a specified failure hypothesis and employing checkpointing to assist in fault recovery. The effects of checkpointing strategies on task response time are analysed and some insights for optimal checkpointing are provided. The emphasis here is on utilizing this analysis as an off-line design support tool.

Analysis of checkpointing for schedulability of real-time systems

… on Real-Time Computing Systems and …, 1997

Checkpointing is a relatively cost effective method for achieving fault tolerance in real-time systems. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. This paper provides exact schedulability tests for fault tolerant task sets under specified failure hypothesis and employing checkpointing to assist in fault recovery. The effects of checkpointing strategies on task response time are analysed and some insights for optimal checkpointing are provided. The emphasis here is on utilizing this analysis as an off-line design support tool. system design. Predictable performance in the event of failures is of paramount importance in most safety critical real-time systems. Among various hardware as well as software techniques employed for achieving fault-tolerance, checkpointing is a relatively cost-effective scheme. One needs to checkpoint only those variables whose values have changed since the last checkpoint operation. Bowen and Pradhan [1] give a detailed discussion on different types of processor as well as memory based checkpointing schemes. These schemes allow checkpoints to be performed like atomic operations with negligible overhead using 'copyback' cache and atomic-update feature of stable transaction memory(STM). However, since checkpointing schemes depend on time redundancy, it can affect the correctness of the system by causing deadlines to be missed.

A Checkpointing Technique for Rollback Error Recovery in Embedded Systems

2006 International Conference on Microelectronics, 2006

In this paper, a general Checkpointing technique for rollback error recovery for embedded systems is proposed and evaluated. This technique is independent of used processor and employs the most important feature in control flow error detection mechanisms to simplify checkpoint selection and to minimize the overall code overhead. In this way, during the implementation of a control flow checking mechanism, the checkpoints are added to the program.

Checkpoint Interval and System's Overall Quality for Message Logging-Based Rollback and Recovery in Distributed and Embedded Computing

2009 International Conference on Embedded Software and Systems, 2009

In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent checkpointing reduces system recovery time in the presence of faults, and hence improves the system availability; however, more frequent checkpointing may also increase the probability for a task to miss its deadlines or prolong its execution time in faultfree scenarios. Hence, in distributed and real-time computing, the system's overall quality must be measured by a set of aggregated criteria, such as availability, task execution time, and task deadline miss probability. In this paper, we take into account state synchronization costs in the checkpointing and rollback recovery scheme and quantitatively analyze the relationships between checkpoint intervals and these criteria. Based on the analytical results, we present an algorithm for finding an optimal checkpoint interval that maximizes system's overall quality.

An optimal checkpointing interval for real-time systems

Proc. Intl. Conf. Parallel and Distributed Processing …

The application of checkpointing as a fault-tolerance measure for realtime services (i.e., services that are restricted by deadlines) raises different problems than dealing with non-real-time services. Therefore, different criteria are necessary for the assessment of checkpointing in such an environment. The probability of correct execution before the deadline even in presence of faults (responsiveness) is such a criterion for real-time services.

On the Combination of Silent Error Detection and Checkpointing

2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013

In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.

Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing, 2006

Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have derived expressions for average cost of checkpointing, rollback recovery, message logging and piggybacking with application messages in synchronous as well as asynchronous checkpointing. For quasi-synchronous checkpointing we show that in a system with n processes, the upper bound and lower bound of selective message logging are O(n 2 ) and O(n), respectively.

A Checkpointing Model for Fault-Tolerant Real-Time Systems

IFAC Proceedings Volumes, 1996

A Fault-Tolerant Real-Time System must provide critical level of service in a timely manner in the presence of one or more hardware or software faults. This paper argues that support from the language, environment, and compiler is required. An integrated approach to providing this support through a novel data classification is proposed. This can in principle provide provide static guarantees of timeliness in a checkpointing real-time system, and for recovery and continued computation for up to one node or link failure.

Performance optimization of checkpointing schemes with task duplication

IEEE Transactions on Computers, 1997

In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and storecheckpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent.

Level of confidence evaluation and its usage for Roll-back Recovery with Checkpointing optimization (original) (raw)

Related papers