Reliability analysis in distributed systems (original) (raw)
Related papers
Estimation of the Reliability of Distributed Applications
2010
In this paper the reliability is presented as an important feature for use in mission-critical distributed applications. Certain aspects of distributed systems make the requested level of reliability more difficult. An obvious benefit of distributed systems is that they serve the global business and social environment in which we live and work. Another benefit is that they can improve the
A survey on reliability in distributed systems
Journal of Computer and System Sciences, 2013
Software's reliability in distributed systems has always been a major concern for all stake holders especially for application's vendors and its users. Various models have been produced to assess or predict reliability of large scale distributed applications including e-government, e-commerce, multimedia services, and end-to-end automotive solutions, but reliability issues with these systems still exists. Ensuring distributed system's reliability in turns requires examining reliability of each individual component or factors involved in enterprise distributed applications before predicting or assessing reliability of whole system, and Implementing transparent fault detection and fault recovery scheme to provide seamless interaction to end users. For this reason we have analyzed in detail existing reliability methodologies from viewpoint of examining reliability of individual component and explained why we still need a comprehensive reliability model for applications running in distributed system. In this paper we have described detailed technical overview of research done in recent years in analyzing and predicting reliability of large scale distributed applications in four parts. We first described some pragmatic requirements for highly reliable systems and highlighted significance and various issues of reliability in different computing environment such as Cloud Computing, Grid Computing, and Service Oriented Architecture. Then we elucidated certain possible factors and various challenges that are nontrivial for highly reliable distributed systems, including fault detection, recovery and removal through testing or various replication techniques. Later we scrutinize various research models which synthesize significant solutions to tackle possible factors and various challenges in predicting as well as measuring reliability of software applications in distributed systems. At the end of this paper we have discussed limitations of existing models and proposed future work for predicting and analyzing reliability of distributed applications in real environment in the light of our analysis.
A Computer Program for Approximating System Reliability
IEEE Transactions on Reliability, 1970
A computer program, which provides bounds for system Shooman [3] provides further mathematical background reliability, is described. The algorithms are based on the concepts of success material and an entire chapter on combinatorial reliability. paths and cut sets. A listing of the elements in the system, their predecessors, In addition, he gives many references to previous work in and the probability of successful operation of each element are the inputs. this area of reliability computation. The outputs are the success paths, the cut sets, and a series of upper and lower reliability bounds; these bounds converge to the reliability which would be calculated if all the terms in the model were evaluated. The algo-II. REVIEW OF USEFUL RESULTS rithm for determining the cuts from the success paths is based on Boolean logic and is relatively simple to understand. Two examples are described, The success probability of a system, typically called the one of which is very simple and the computation can be done by hand, and a system reliability, is defined as the probability of successful second for which there are 55 success paths and 10 cuts and thus machine function of all of the elements in at least one tie set or as the computation is desirable. probability that all cut sets are good. A tie set (success path) Reader Aids: is a directed path from input to output as indicated in the Purpose: Helpful hints simple system in Fig. l(b). The tie sets are (2, 5), (1, 3, 5), Special math needed for explanations: Probability (1 4 5). A cut set is a set of elements which literally cuts all Special math needed for results: None
A new recursive algorithm for the reliability evaluation of a distributed program
Microelectronics Reliability, 1992
The reliability of a distributed program is very sensitive to the ways of file distribution, and communication system topology. It is usually much more complex to analyze the reliability of distributed program than that of the network reliability. This paper presents a distributed program reliability analysis algorithm that is developed based on the concept of sharp operation. The algorithm provides a
Reliability analysis techniques explored through a communication network example
1996
This paper reviews general methods used to perform dependability analysis on a given system. A communication network example is used in relation to a client/server type of application to illustrate the reliability and availability modeling techniques. We review both non-state space as well as state space based methods and discuss the bene ts and limitations of each. The paper assumes a general understanding of probability theory.
A mathematical approach for improving the reliability in computing system
Reliability is the extent to which a measurement procedure yields the same results on repeated trials. Without reliable measures, scientists cannot build or test theory, and therefore cannot develop productive and efficient procedures for improving the quality of life. Reliability analysis for any distributed computing system is the current necessity and most important area of research. There are number of ways to improve the reliability of the distributed computing system. Optimizing the reliability through task allocation is one of the problems in this category. Reliability is the assessment we make of how much measurement error we have experienced in processing our data. The problem presented in this paper is based on the consideration of execution unreliability of the tasks to the processors. The main objective of this problem is to minimize the overall processing unreliability by allocating the tasks optimally to the processors of the distributed computing system using the method of matrix partitioning. The several sets of input data are considered to test the complexity and efficiency of the algorithm. It is found that the algorithm is suitable for arbitrary number of processors with the random program structures and workable in all the cases.
A fast, general-purpose algorithm for reliability evaluation of distributed systems
Ninth Annual International Phoenix Conference on Computers and Communications. 1990 Conference Proceedings, 1990
In this paper a fast and general Purpose algorithm for reliability evaluation of distributed systems is developed. The algorithm is general in the sense that it is used to evaluate different network reliability measures such as terminal, K-termiml, all-temjnal for a specified set of nodes k E V, it is the probability that each pair Of nodes Of via an Operative path. The two mOSt common specid cases of K-terminal reliability is and degraded system reliability. Furthermore, it is used to measure are: terminal reliability, where IK I = 2, and allevaluate the measures above h networks which are directed or undirected as well as in networks with perfect or imperfect nodes. The oDtions for numeric or svmbolic and exact or A well-known result is that K-terminal, terminal and terminal re/iabi/ip, where K = v , approximate re'liability computation ar; also available. The algorithm efficiently calculates reliability of distributed systems with size of practical interest, and has comparative or much better speed than other reported algorithms. Another significant contribution of this paper is the method for very efficient handling of the case of imperfect nodes, an issue which is particularly important for distributed systems with unreliable processing nodes.
Reliability modelling for some computer systems
Microelectronics Reliability, 1994
This paper investigates two mathematical models based on structural computer systems. There are two types of operating environment in computer namely DOS and UNIX. Central Processing Unit (CPU) is the brain of the computer and it guides the monitor and dumb terminal (DT) according to the sequence of instructions as given by operator.A sensitive volume due to micro-chips,exists in the computer. An electromagnetic interfrence with this sensitive volume changes the operating behaviour of computer. These changes generate the partial and complete failure states.Several cost related reliability measures of the system effectiveness are studied by using the regenerative point technique. software systems and the use of computer to control vital and complicated functions. Several researchers [2,3,4,7] have studied the models related to computer systems and they have analysed the same for reliability and availability only, but not much more. The main aim of present study is to introduce and analyse the computer systems (DOS & UNIX) for reliability more measures. In DOS computer system, there are two compartments drive-C and drive-A. Here it is assumed that drive-C /drive-A may work with reduced efficiency due to minor hardware problem.This state of the system is called partially failed state, from this state it may be attained its original state or it reaches to totally failed state due to major hardware problem.
Cost-reliability driven analysis to evaluate performance of a distributed system
INTERNATIONAL SCIENTIFIC AND PRACTICAL CONFERENCE “TECHNOLOGY IN AGRICULTURE, ENERGY AND ECOLOGY” (TAEE2022)
The self-standing computers embodies a distributed system, intertwined by a concatenated web associate degreed endowing the software of districuted system. The hardware fragment of distributed system with remodelling the size of some workstations integrated by one native space network with bazillions of computers twinned with multifold wide area networks. The potential of distributed system is unveiled which supports the trade-off of reliability and cost buttoned up with comprehensive search mechanisms. The diversified attributes like the reliability of execution and reliability, price of execution and communication are the principal components and portrayed in matrices, explicitely the CRM(,), ERM(,), CCM(,), electronic welfare(,). Reorientation and conversion of these matrices is consistant with the conjucted tasks. Every task fusion evaluates the reliability of the distributed computer contiguous with the execution cost as well as the communication cost. The optimum indisputable response of distributed computing system is bagged when the reliability aspect is modulated for all the blends of tasks. A live of elongated performance is achieved finally beyond shadow of doubt.