Fault-tolerant communication in embedded supercomputing (original) (raw)

Application-level fault tolerance as a complement to system-level fault tolerance

The Journal of …, 2000

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.

Performability Analysis of Two Approaches to Fault Tolerance

1996

We present a quantitative comparison of two popular approaches for recovering from CPU errors: Quadruple Modular Redundancy and Backward Error Recovery. Both are used in existing fault-tolerant systems o ering basically the same main features and, in particular, the same fault-tolerance services transparent recovery for hardware faults. We show that the use of performability measures is richer than classical dependability analysis. Given that they take i n to account not only reliability aspects but also performance metrics, they allow a deeper insight i n to the behaviour of the considered systems. For instance, they allow the user to identify di erent mission lengths leading to better adaptation of each t ype of architecture.

Hardware-supported fault tolerance for multiprocessors

Proc. Architektur von …, 1997

To provide a computing system to be dependable fault tolerance mechanisms have to be included. Especially massive parallelism represents a new challenge for fault tolerance. In this paper we discuss basic hardware fault tolerance measures for massively parallel multiprocessors and solutions realized for and integrated into different multiprocessor architectures. Further we present our validation technique for dependability based on simulation-based fault injection.

Software Fault Tolerance in the Application Layer

1995

By software fault tolerance in the application layer, we mean a set of application level software components to detect and recover from faults that are not handled in the hardware or operating system layers of a computer system. We consider those faults that cause an application process to crash or hang; they include application software faults as well as faults in the underlying hardware and operating system layers if they are undetected in those layers. We define four levels of software fault tolerance based on availability and data consistency of an application in the presence of such faults. We describe three reusable software components that provide up to the third level of software fault tolerance. Those components perform automatic detection and restart of failed processes, periodic checkpointing and recovery of critical volatile data, and replication and synchronization of persistent data in an application software system. These components have been ported to a number of UNIX 2 platforms and can be used in any application with minimal programming effort. Some telecommunications products in AT&T have already been enhanced for faulttolerance capability using these three components. Experience with those products to date indicates that these modules provide efficient and economical means to increase the level of fault tolerance in a software system. The performance overhead due to these components depends on the level and varies from 0.1% to 14% based on the amount of critical data being checkpointed and replicated.

Operating System Fault Tolerance Support for Real-Time Embedded Applications

2009

Fault tolerance is a means of achieving high dependability for critical and highavailability systems. Despite the efforts to prevent and remove faults during the development of these systems, the application of fault tolerance is usually required because the hardware may fail during system operation and software faults are very hard to eliminate completely. One of the difficulties in implementing fault tolerance techniques is the lack of support from operating systems and middleware. In most fault tolerant projects, the programmer has to develop a fault tolerance implementation for each application. This strong customization makes the fault-tolerant software costly and difficult to implement and maintain. In particular, for small-scale embedded systems, the introduction of fault tolerance techniques may also have impact on their restricted resources, such as processing power and memory size. Contents Acknowledgements.

A Literature Survey on Improving Fault Tolerance of Software Applications

Adding fault tolerance to any software application is becoming an issue of great significance, especially as these applications support critical parts of our everyday life in the modern " Information Society ". By adding this, the burden of ad hoc fault tolerance programming is removed from the application developer; while at the same time average fault tolerance support taken at operating system level is avoided. Fault-tolerance is achieved by applying a set of analysis and design techniques to create systems with dramatically improved dependability. As new technologies are developed and new applications arise, new fault-tolerance approaches are also needed. In the early days of fault-tolerant computing, it was possible to craft specific hardware and software solutions from the ground up, but now chips contain complex, highly-integrated functions, and hardware and software must be crafted to meet a variety of standards to be economically viable. Thus a great deal of current research focuses on implementing fault tolerance using COTS (Commercial-Off-The-Shelf) technology. In this paper we present a survey on fault tolerance provided in variety of ways.

Fault tolerant supercomputing: a software approach

Information Processing and Technology, 2001

Adding fault tolerance to embedded supercomputing applications is becoming an issue of great significance, especially as these applications support critical parts of our everyday life in the modern "Information Society". To this end, a software middleware framework is presented that features a collection of flexible and reusable fault tolerance modules acting at different levels and coping with common fault tolerance requirements. The burden of ad hoc fault tolerance programming is removed from the application developer, while at the same time average fault tolerance support taken at operating system level is avoided. A high-level description helps the developer specify the fault tolerance strategies of the application as a sort of second application layer; this separates functional from fault tolerance aspects of an application, shortening the development cycle and improving maintainability. Integration of this functionality in real embedded applications validates this approach.

Synergistic coordination between software and hardware fault tolerance techniques

Proceedings International Conference on Dependable Systems and Networks, 2001

This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol that we have devised for tolerating software design faults, and a time-based (TB) checkpointing protocol that was developed by Neves and Fuchs for tolerating hardware faults. By carrying out algorithm modifications that are conducive to synergistic coordination between volatile-storage and stable-storage checkpoint establishments, we are able to circumvent the potential interference between the MDCD and TB protocols, and to allow them to effectively complement each other to extend a system's fault tolerance capability. Moreover, the protocolcoordination approach preserves and enhances the features and advantages of the individual protocols that participate in the coordination, keeping the performance cost low.

Architectural support for designing fault-tolerant open distributed systems

Computer, 2000

distributed system consists of autonomous computing modules that interact with each other using messages. Designing distributed systems is more difficult than designing centralized systems for several reasons. Physical separation and the use of heterogeneous computers complicate interprocessor communication, management of resources, synchronization of cooperating activities, and maintenance of consistency among multiple copies of information. The main advantages of distributed systems include increased fault-tolerance capabilities through the inherent redundancy of resources, improved performance by concurrently executing a single task on several computing modules, resource sharing, and the ability to adapt to a changing environment (extensibility).'

Non-uniform fault tolerance

Proceedings of the 2nd …, 2006

As devices become more susceptible to transient faults that can affect program correctness, processor designers will increasingly compensate by adding hardware or software redundancy. Proposed redundancy techniques and those currently in use are generally applied uniformly to a structure despite non-uniformity in the way errors within the structure manifest themselves in programs. This uniform protection leads to inefficiency in terms of performance, power, and area. Using case studies involving the register file, this paper motivates an alternative Non-Uniform Fault Tolerance approach which improves reliability over uniform approaches by spending the redundancy budget on those areas most susceptible.