Fast checkpoint recovery algorithms for frequently consistent applications (original) (raw)
Related papers
Fault Tolerance For Main-Memory Applications In The Cloud
2013
Advances in hardware have enabled many long-running applications to execute entirely in main memory. With the emergence of cloud computing, thousands of machines could be made available to deploy such applications with lowered operational and maintenance costs. While achieving substantially better performance, these applications have encountered new challenges in achieving fault tolerance; i.e., to ensure durability in the event of a crash. In addition, many of these applications, such as massively multiplayer online games, main-memory OLTP systems, main-memory search engine and deterministic transaction processing systems, must sustain extremely high update rates -often hundreds of thousands of updates per second. They also demand extremely high throughput (e.g. scientific simulation) or low latency (e.g. massively multiplayer online games). To support these demanding requirements, these applications have increasingly turned to database techniques. In this dissertation, we propose an approach to provide fault tolerance for main-memory applications without introducing excessive overhead or latency spikes.
Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
New Generation Computing, 2013
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.
Techniques for efficient in-memory checkpointing
Checkpointing is a pivotal technique in system research, with applications ranging from crash recovery to replay debugging. In this paper, we evaluate a number of in-memory checkpointing techniques and compare their properties. We also present a new compiler-based checkpointing scheme which improves state-of-the-art performance and memory guarantees in the general case. Our solution relies on a shadow state to efficiently store incremental in-memory checkpoints, at the cost of a smaller user-addressable virtual address space. Contrary to common belief, our results show that in-memory checkpointing can be implemented efficiently with moderate impact on production systems.
Latency-Optimized Checkpoint Recovery Algorithms for Massively Multiplayer Online Games
Massively Multiplayer Online Games (MMOs) are long-lived, interactive virtual worlds in which tens of thousands of people play together. In order to provide highly immersive experiences, MMO servers must support extremely high update rates-often hundreds of thousands of updates per second. A major concern for MMOs is to provide durability for the virtual world while limiting the overhead and perceived latency spikes introduced in the game. Recent work has shown that existing checkpoint-recovery algorithms developed for main memory DBMSs can be applied to MMO workloads but there is no single algorithm which outperforms all others over a wide range of update rates.. In this paper we propose two novel checkpointing algorithms that trade additional space in main memory for significantly lower latency. Compared to previous work, our new algorithms do not require any locking nor do they require bulk copies of the game state. Our experimental evaluation shows that our new algorithms attain nearly constant latency and achieve more than an order-of-magnitude lower overhead than the best previous methods.
Distributed Checkpointing for Globally Consistent States of Databases
IEEE Transactions on Software Engineering, 1989
Checkpointing for Globally Consistent States of Databases Abstmcf-The goal of checkpointing in database management systems is to save database slates on a separate secure device so that the database can be recovered when errors and fallures occur. Recently, the possibility of having a checkpointing mechanism whicb does not interfere with the transaction processing has been studied [4], 17. Users are allowed to submit transactions while the checkpointing is in prqr e s , and the transnctions are performed in the system concurrently with the checkpointing process. This property of noninterferelrc+ is highly desirable to real-time applications, where restricting transaction activity during the checkpointing operation is in many cases not feasible. In this paper, 8 new algorithm for checkpointing in distributed database systems is proposed and its correctness is proved, The practicality of the algorithm is discussed by analyzing the extra workload and the robustness of it with respect to site fallures.
A recovery algorithm for a high-performance memory-resident database system
ACM SIGMOD Record, 1987
With memory prices dropping and memory sizes increasing accordingly, a number of researchers are addressing the problem of designing high-performance database systems for managing memory-resident data. In this paper we address the recovery problem in the context of such a system. We argue that existing database recovery schemes fall short of meeting the requirements of such a system, and we present a new recovery mechanism which is designed to overcome their shortcomings. The proposed mechanism takes advantage of a few megabytes of reliable memory in order to organize recovery information on a per “object” basis. As a result, it is able to amortize the cost of checkpoints over a controllable number of updates, and it is also able to separate post-crash recovery into two phases—high-speed recovery of data which is needed immediately by transactions, and background recovery of the remaining portions of the database. A simple performance analysis is undertaken, and the results suggest ...
Checkpointing of control structures in main memory database systems
International Conference on Dependable Systems and Networks, 2004, 2004
This paper proposes an application-transparent, low-overhead checkpointing strategy for maintaining consistency of control structures in a commercial main memory database (MMDB) system, based on the ARMOR (Adaptive Reconfigurable Mobile Object of Reliability) infrastructure. Performance measurements and availability estimates show that the proposed checkpointing scheme significantly e nhances database availability (an extra nine in improvement compared with major-recovery-based solutions) while incurring only a small performance overhead (less than 2% in a typical workload of real applications).
Lightweight Memory Checkpointing
2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2015
Memory checkpointing is a pivotal technique in systems reliability, with applications ranging from crash recovery to replay debugging. Unfortunately, many traditional memory checkpointing use-cases require high-frequency checkpoints, something for which existing application-level solutions are not well-suited. The problem is that they incur either substantial runtime performance overhead, or poor memory usage guarantees. As a result, their application in practice is hampered. This paper presents Lightweight Memory Checkpointing (LMC), a new user-level memory checkpointing technique that combines low performance overhead with strong memory usage guarantees for high checkpointing frequencies. To this end, LMC relies on compiler-based instrumentation to shadow the entire memory address space of the running program and incrementally checkpoint modified memory bytes in a LMC-maintained shadow state. Our evaluation on popular server applications demonstrates the viability of our approach in practice, confirming that LMC imposes low performance overhead with strictly bounded memory usage at runtime.
Alleviating scalability issues of checkpointing protocols
IEEE International Conference on High Performance Computing, Data, and Analytics, 2012
Current fault tolerance protocols are not sufficiently scalable for the exascale era. The most-widely used method, coordinated checkpointing, places enormous demands on the I/O subsystem and imposes frequent synchronizations. Uncoordinated protocols use message logging which introduces message rate limitations or undesired memory and storage requirements to hold payload and event logs. In this paper we propose a combination of several techniques, namely coordinated checkpointing, optimistic message logging, and a protocol that glues them together. This combination eliminates some of the drawbacks of each individual approach and proves to be an alternative for many types of exascale applications. We evaluate performance and scaling characteristics of this combination using simulation and a partial implementation. While not a universal solution, the combined protocol is suitable for a large range of existing and future applications that use coordinated checkpointing and enhances their scalability.