Design for a Soft Error Resilient Dynamic Task-based Runtime (original) (raw)

Submitted by webmaster on Thu, 01/08/2015 - 14:40

Title Design for a Soft Error Resilient Dynamic Task-based Runtime
Publication Type Conference Paper
Year of Publication 2015
Authors Cao, C., G. Bosilca, T. Herault, and J. Dongarra
Conference Name 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
Date Published 2015-05
Publisher IEEE
Conference Location Hyderabad, India
Abstract As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.

External Publication Flag: