A performance evaluation of the software-implemented fault-tolerancecomputer (original) (raw)
Related papers
ArXiv, 2017
Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites however suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Larger satellites therefore continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. In related work, we presented the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained...
SIFT: Design and analysis of a fault-tolerant computer for aircraft control
Proceedings of The IEEE, 1978
SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replication of tasks among processing units. The main processing units are off-the-shelf minicomputers, with standard microcomputers serving as the interface to the I/O system. Fault isolation is achieved by using a specially designed redundant bus system to interconnect the proeessing units. Error detection and analysis and system reconfiguration are performed by software. Iterative tasks are redundantly executed, and the results of each iteration are voted upon before being used. Thus, any single failure in a processing unit or bus can be tolerated with triplication of tasks, and subsequent failures can be tolerated after reconfiguration. Independent execution by separate processors means that the processors need only be loosely synchronized, and a novel fault-tolerant synchronization method is described. The SIFT software is highly structured and is formally specified using the SRI-developed SPECIAL language. The correctness of SIFT is to be proved using a hierarchy of formal models. A Markov model is used both to analyze the reliability of the system and to serve as the formal requirement for the SIFT design. Axioms are given to characterize the high-level behavior of the system, from which a correctness statement has been proved. An engineering test version of SIFT is currently being built.
Software-implemented hardware fault tolerance
2006
Transient errors in computer systems can cause abnormal behavior and degrade system reliability, data integrity and availability. This is especially true in a space environment where transient errors are a major cause of concern. Fault avoidance techniques such as radiation hardening and shielding have been the major approaches to obtaining the required reliability. Recently, unhardened Commercial Off-The-Shelf (COTS) components have been investigated for space applications because of their higher density, faster clock rate, lower power consumption and lower price. Since COTS components are not radiation hardened, and it is desirable to avoid shielding, Software-Implemented Hardware Fault Tolerance (SIHFT) has been proposed to increase the data integrity and availability of COTS systems. This dissertation presents three new SIHFT techniques for error detection: Control Flow Checking by Software Signatures (CFCSS), Error Detection by Duplicated Instructions (EDDI), and Error Detection by Diverse Data and Duplicated Instructions (ED 4 I). Previously studied software techniques are either inadequate or require assistance from special hardware, but CFCSS, EDDI and ED 4 I are pure software methods. In CFCSS, signatures are embedded into the program during compilation and compared with run-time signatures during execution. In EDDI, instructions are duplicated at compile-time, and scheduled by exploiting Instruction-Level Parallelism (ILP) to reduce performance overhead. CFCSS and EDDI detect transient errors but not permanent faults. However, in ED 4 I, a program is compiled to a new program with diverse data so that it can detect a permanent fault. Our fault injection experiment simulating bit flips in memory shows that, for the designs simulated, EDDI provides over 98% fault coverage without any extra hardware. Because of instruction duplication, code size overhead is approximately 100%, but by exploiting ILP, we reduce the performance overhead down to 61% on average. For control flow checking experiment simulating branching faults, CFCSS provides 97% fault coverage. In addition, when we duplicate programs or instructions, we can use ED 4 I to enhance data integrity in the system. Furthermore, for space experiments, we have implemented EDDI and CFCSS in sort and FFT programs running in the ARGOS satellite. During a 136 day period, our techniques have detected a total of 198 out of 203 errors, and show 98% error detection coverage. While traditional error detection and fault tolerance techniques require special dedicated hardware, our SIHFT techniques use time redundancy for error detection and significantly improve data integrity without requiring special hardware.
A flight experiment for the evaluation of hardware and software fault tolerance techniques
2001
In this paper are described the motivations and main features of a flight experiment devoted to gather objective data about the efficiency of different techniques for fault detection and correction in digital architectures. This experiment is intended to be included in the NASA project LWS/SET (Living With a Star/ Space Environment Testbed) project, with a launch expected in 2004. Preliminary data issued from radiation ground testing performed with various heavy ion particle beams is presented. These results provide evidences of the validity of hardware and software mechanisms implemented to cope with the effects of transient fault provoked in integrated digital circuits by energetic particles present in space environment.
A fault-tolerant computer system for India’s satellite launch vehicle programmes
Sadhana, 1987
The on-board computer (OBC) systems that are planned to be used in India's forthcoming launch vehicle programmes, viz, the Augmented Satellite Launch Vehicle (ASLV) and Polar Satellite Launch Vehicle (PSLV) exercise total control over the vehicle during its flight, carrying out complex real-time computations related to vehicle navigation, guidance, autopilot and the generation of mission critical event commands. The success of the country's launch vehicle missions, therefore, depends to a very large extent on the reliable operation of the OBC. To enhance the reliability of such a computer system, faulttolerant design techniques have been resorted to and the system after thorough testing is now ready to be flown on the ASLV. This paper highlights the design of such an OBC mainly from the points of view of the fault-tolerant methods incorporated. The relevance of faulttolerance to critical flight computers is first discussed. This is followed by a presentation of possible fault-tolerant configurations and the considerations that led to the choice of the present system. A brief description of the OBC system architecture and the methods of testing that ensure its reliable operation follow. The paper concludes with an assessment of the present system and possible future improvements.
Issues in the Implementation of a Fault-Tolerant Hardware Platform
IFAC Proceedings Volumes, 2003
In the paper, some discussion of the implementation of a fault-tolerant hardware platform is presented. The discussion is focused on the control applications with less severe integrity requirements (SILl) and where the intelligent system reconfiguration is preferred (instead of the redundancy) in the case offailures. To build a dependable control system, all aspects of the implementation must be considered and integrated into the development from the beginning. A network of simple monitoring modules should be integrated into the system to detect and react to faults as soon as possible. To streamline the implementation, not only the hardware, but also the appropriate system software, must be constructed to hide the particularities of the lowlevel matters from the control application.