Imprecise Exceptions in Distributed Parallel Components (original) (raw)

UnSync: A Soft Error Resilient Redundant Multicore Architecture

2011 International Conference on Parallel Processing, 2011

Reducing device dimensions, increasing transistor densities, and smaller timing windows, expose the vulnerability of processors to soft errors induced by charge carrying particles. Since these factors are only consequences of the inevitable advancement in processor technology, the industry has been forced to improve reliability on general purpose Chip Multiprocessors (CMPs). With the availability of increased hardware resources, redundancy based techniques are the most promising methods to eradicate soft error failures in CMP systems. In this work, we propose a novel redundant CMP architecture (UnSync) that utilizes hardware based detection mechanisms (most of which are readily available in the processor), to reduce overheads during error free executions. In the presence of errors (which are infrequent), the always forward execution enabled recovery mechanism provides for resilience in the system. We design a detailed RTL model of our UnSync architecture and perform hardware synthesis to compare the hardware (power/area) overheads incurred. We compare the same with those of the Reunion technique, a state-of-the-art redundant multi-core architecture. We also perform cycle-accurate simulations over a wide range of SPEC2000, and MiBench benchmarks to evaluate the performance efficiency achieved over that of the Reunion architecture. Experimental results show that, our UnSync architecture reduces power consumption by 34.5% and improves performance by up to 20% with 13.3% less area overhead, when compared to Reunion architecture for the same level of reliability achieved.

Adaptive execution assistance for multiplexed fault-tolerant chip multiprocessors

Relentless scaling of CMOS fabrication technology has made contemporary integrated circuits increasingly susceptible to transient faults, wearout-related permanent faults, intermittent faults and process variations. Therefore, mechanisms to mitigate the effects of decreased reliability are expected to become essential components of future generalpurpose microprocessors.

Parallelizing Software-Implemented Error Detection

2009

Abstract Because of economic pressure, more commodity hardware with insufficient error detection is used in critical applications. Moreover, it is expected that commodity hardware is becoming less reliable because of the continuously decreasing feature size. Thus, we expect that software-implemented approaches to deal with unreliable hardware will be needed.

Efficient Mitigation of Data and Control Flow Errors in Microprocessors

IEEE Transactions on Nuclear Science, 2014

The use of microprocessor-based systems is gaining importance in application domains where safety is a must. For this reason, there is a growing concern about the mitigation of SEU and SET effects. This paper presents a new hybrid technique aimed to protect both the data and the control-flow of embedded applications running on microprocessors. On one hand, the approach is based on software redundancy techniques for correcting errors produced in the data. On the other hand, control-flow errors can be detected by reusing the on-chip debug interface, existing in most modern microprocessors. Experimental results show an important increase in the system reliability even superior to two orders of magnitude, in terms of mitigation of both SEUs and SETs. Furthermore, the overheads incurred by our technique can be perfectly assumable in low-cost systems.

Cost-efficient soft error protection for embedded microprocessors

2006

Abstract Device scaling trends dramatically increase the susceptibility of microprocessors to soft errors. Further, mounting demand for embedded microprocessors in a wide array of safety critical applications, ranging from automobiles to pacemakers, compounds the importance of addressing the soft error problem. Historically, soft error tolerance techniques have been targeted mainly at high-end server markets, leading to solutions such as coarse-grained modular redundancy and redundant multithreading.

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

International Journal of Parallel Programming

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for highperformance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance. In this paper we present RedThreads, an interface that provides applicationlevel fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables

Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

IEEE Transactions on Computers, 2010

As the semiconductor industry continues its relentless push for nano-CMOS technologies, device reliability and occurrence of hard errors have emerged as a dominant concern in multicores. Although regular memory structures are protected against hard errors using error correcting codes or spare rows and columns, many of the structures within the cores are left unprotected. Even if the location of hard errors is known a priori, disabling faulty cores results in a substantial performance loss. Several proposed techniques use microarchitectural redundancy to allow defective cores to continue operation. These techniques are attractive, but limited due to either added cost of additional redundancy that offers no benefits to an error-free core, or limited coverage, due to the natural redundancy offered by the microarchitecture. We propose to exploit the intercore redundancy in chip multiprocessors for hard-error tolerance. Our scheme combines hardware reconfiguration to ensure reduced functionality of cores, and a runtime layer of software (microvisor) to manage mapping of threads to cores. Microvisor observes the changing phase behavior of threads and initiates thread relocation to match the computational demands of threads to the capabilities of cores. Our results show that in the presence of degraded cores, microvisor mitigates performance losses by an average of two percent.

Error detection mechanisms for massively parallel multiprocessors

1993 Euromicro Workshop on Parallel and Distributed Processing

In this paper a survey on the most important methoh for error detection in multiprocessor systems is presented. A detailed comparison between watchdog processor and masterchecker based fault tolerance is given. The fault coverage, hardware and run-time overhead are discussed, based on the experiencesgained in the development of the MEMSY' faulttolerant multiprocessor system. The cumulative effects resulting from the simultaneous use of different hardware-near and high level fault-tolerance mechanisms are shown. 1. Guest researcher from TU Budapest, Dept. Measurement and Instmentation Engineering 2. The MEMSY project is supported by the DFG (Deursche Forschungsgemeinschaft) as part of the "Sonderforschungsbereich" SFB 182. dundancy has to be employed as sparingly as possible. Accordingly, fault handling is more adequate and cost-effective than providing fault-masking hardware redundancy. Fault handling includes error detection, error location. fault