A generalized model for distributed comparison-based system-level diagnosis (original) (raw)

A hierarchical adaptive distributed system-level diagnosis algorithm

IEEE Transactions on Computers, 1998

Consider a system composed of N nodes that can be faulty or fault-free. The purpose of distributed system-level diagnosis is to have each fault-free node determine the state of all nodes of the system. This paper presents a Hierarchical Adaptive Distributed System-level Diagnosis (Hi-ADSD) algorithm, which is a fully distributed algorithm that allows every fault-free node to achieve diagnosis in, at most, (log 2 N) 2 testing rounds. Nodes are mapped into progressively larger logical clusters, so that tests are run in a hierarchical fashion. Each node executes its tests independently of the other nodes, i.e., tests are run asynchronously. All the information that nodes exchange is diagnostic information. The algorithm assumes no link faults, a fully-connected network and imposes no bounds on the number of faults. Both the worst-case diagnosis latency and correctness of the algorithm are formally proved. As an example application, the algorithm was implemented on a 37-node Ethernet LAN, integrated to a network management system based on SNMP (Simple Network Management Protocol). Experimental results of fault and repair diagnosis are presented. This implementation by itself is also a significant contribution, for, although fault management is a key functional area of network management systems, currently deployed applications often implement only rudimentary diagnosis mechanisms. Furthermore, experimental results are given through simulation of the algorithm for large systems of 64 nodes and 512 nodes. Index Terms-System-level diagnosis, adaptive diagnosis, distributed diagnosis, network management, fault management, SNMP.

An Isochronous Testing Strategy for Hierarchical Adaptive Distributed System-Level Diagnosis

Journal of Electronic Testing, 2001

Distributed System-level diagnosis allows the fault-free components of a fault-tolerant distributed system to determine which components of the system are faulty and which are fault-free. The time it takes for nodes running the algorithm to diagnose a new event is called the algorithm's latency. In this paper we present a new distributed system-level diagnosis algorithm which presents a latency of O(log N) testing rounds, for a system of N nodes. A previous hierarchical distributed system-level diagnosis algorithm, Hi-ADSD, presents a latency of O(log 2 N) testing rounds. Nodes are grouped in progressively larger logical clusters for the purpose of testing. The algorithm employs an isochronous testing strategy that forces all fault-free nodes to execute tests on clusters of the same size each testing round. This strategy is based on two main principles: a tested node must test its tester in the same round; a node only accepts tests according to a lexical priority order. We present formal proofs that the algorithm's latency is at most 2log N – 1 testing rounds and that the testing strategy of the algorithm leads to the execution of isochronous tests. Simulation results are shown for systems of up to 64 nodes.

A distributed system-level diagnosis algorithm for arbitrary network topologies

IEEE Transactions on Computers, 1995

In this paper, a distributed algorithm is described for detecting and diagnosing faulty processors in an arbitrary network. Fault-free processors perform simple periodic tests on one another; when a fault is detected or a newly-repaired processor joins the network, this new information is disseminated in parallel throughout the network. It is formally proven that the algorithm is correct; and it is also shown that the algorithm is optimal in terms of the time required for all of the fault-free processors in the network to learn of a new event. Simulation results are given for arbitrary network topologies.

A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair

IEEE Transactions on Computers, 1984

The problem of designing distributed fault-tolerant computing systems is considered. A model in which the network nodes are assumed to possess the ability to "test'" certain other network facilities for the presence of failures is employed. Using this model, a distributed algorithm is 'presented which allows all the network nodes to correctly reach independent diagnoses of the condition (faulty or fault-free) of all the network nodes and internode communication facilities, provided the total number of failures does not exceed a given bound. The proposed algorithm allows for the reentry of repaired or replaced faulty facilities back into the network, and it also has provisions for adding new nodes to the system. Sufficient conditions are obtained for designing a distributed fault-tolerant system by employing the given algorithm. The algorithm has the interesting property that it lets as many as all of the nodes and internode communication facilities fail, but upon repair or replacement of faulty facilities, the system can converge to normal operation if no more than a certain number of facilities remain faulty.

Distributed diagnosis in dynamic fault environments

IEEE Transactions on Parallel and Distributed Systems, 2004

The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.

Adaptive diagnosis in distributed systems

IEEE Transactions on Neural Networks, 2005

Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing. Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with non-adaptive (pre-planned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using Dynamic Bayesian networks, and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.

Method for Unit Self-Diagnosis at System Level

International Journal of Intelligent Systems and Applications, 2019

This paper suggests unconventional approach to system level self-diagnosis. Traditionally, system level self-diagnosis focuses on determining the state of the units which are tested by other system units. In contrast, the suggested approach utilizes the results of tests performed by a system unit to determine its own state. Such diagnosis is in many respects close to self-testing, since a unit evaluates its own state, which is inherent in selftesting. However, as distinct from self-testing, in the suggested approach a unit evaluates it on the basis of tests that it does not performs on itself, but on other system units. The paper considers different diagnosis models with various testing assignments and different faulty assumptions including permanent and intermittent faults, and hybrid-fault situations. The diagnosis algorithm for identifying the unit's state has been developed, and correctness of the algorithm has been verified by computer simulation experiments.

Automated monitor based diagnosis in distributed systems

ECE Technical …, 2005

In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnosis process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling Byzantine failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis.

Distributed Dynamic Failure Detection

Journal of Software, 2014

Failure monitoring and detection phase is a critical part in providing a scalability, reliability and high availability in current distributed environment. Heartbeat style of interaction is a widely used technique. This technique is utilized for detecting a fault where it monitors the heartbeats of system resources continuously in a very short interval. However, this approach has its limitations as it requires a period of time to detect the faulty node, causing delay in the impending recovery procedures. This paper presents a fault detection mechanism and service using hybrid heartbeat mechanism and dynamic estimated time of arrival (ETA) for each heartbeat message. This technique introduces the use of index server for indexing the transaction and operates dynamic hybrid heartbeat mechanism and pinging procedure for fault detection. The evaluation outcome signifies the use of the hybrid heartbeat mechanism in reducing approximately 30% of the time taken to detect faults compared to existing techniques and provides a basis for a customizable recovery action to take place.

A Distributed Fault-Detection and Diagnosis System Using On-Line Parameter Estimation

IFAC Proceedings Volumes, 1991

This paper describes a model-based fault-detection and diagnosis system based on a distributed system identification approach. The diagnostic system consists of a two level process including parallel hypothesis testing modules anti a fault mode identification and estimation module. The proposed system is part of a distributed diagnostic system for use in an intelligent control system. The proposed approach utilizes a piecewise linear model to predict the system performance. The deviation between predicted and actual performance is used to identify the associated fault mode. Each hypothesis testing module is associated with a particular class of fault modes and can be viewed as a condition monitor in a distributed diagnostic system hierarchy. The results of the hypothesis modules are processed by the fault-detection and estimation module. Using the results of the on-line diagnosis, the intelligent control system will be able to accommodate the fault modes, reduce maintenance cost, and increase system availability.