Reliability Analysis in Parallel and Distributed Systems with Network Contention (original) (raw)

Reliability-¬Based Optimal Task¬ Allocation in Distributed Computing Systems

This paper addresses the problem of task allocation in distributed computing systems with the goal of maximizing the system reliability. It first develops a mathematical model for reliability based on a cost function representing the unreliability caused by the execution of tasks on the system processors and the unreliability caused by the inter-processor communication costs subject to constraints imposed by both the application and the system resources. It then presents an exact algorithm derived from the well known Branch¬-and-Bound technique to this problem. For reducing the computations of finding an optimal allocation, the algorithm solves the dual problem, uses the idea of best first branch strategy for selecting a node to be expanded and handles tasks at the tree levels according to the task of more connectivity.

Reliability-Guaranteed Task Assignment and Scheduling for Heterogeneous Multiprocessors Considering Timing Constraint

Journal of Signal Processing Systems, 2014

Heterogeneous multiprocessors have become the mainstream computing platforms nowadays and are increasingly employed for critical applications. Inherently, heterogeneous systems are more complex than homogeneous systems. The added complexity increases the potential of system failures. This paper addresses this problem by proposing a reliability-guaranteed task assignment and scheduling approach for heterogeneous multiprocessors considering timing constraint. We propose a two-phase approach to solve this problem. In the first phase, we determine assignments for heterogeneous multiprocessors such that both reliability requirement and timing constraint can be satisfied with the minimum total system cost. Efficient

Task allocation for maximizing reliability of distributed systems: a simulated annealing approach

Journal of Parallel and Distributed Computing, 2006

This paper addresses the problem of task allocation in heterogeneous distributed systems with the goal of maximizing the system reliability. It first develops an allocation model for reliability based on a cost function representing the unreliability caused by the execution of tasks on the system processors and the unreliability caused by the interprocessor communication time subject to constraints imposed by both the application and the system resources. It then presents a heuristic algorithm derived from the well-known simulated annealing (SA) technique to quickly solve the mentioned problem. The performance of the proposed algorithm is evaluated through experimental studies on a large number of randomly generated instances. Indeed, the quality of solutions are compared with those derived by using the branch-and-bound (BB) technique.

Reliability Driven Task Scheduling for Heterogeneous Systems

In recent years, more and more heterogeneous processor cores are embedded into a single chip. To deploy such heterogeneous embedded systems in critical applications, e.g., aircraft control, battleship missile launches, nuclear plant safe operations, etc., an important research problem is how to maximize system reliability while satisfying the required time constraint. Therefore, a scheduling scheme is needed to exploit the heterogeneity of a system and satisfy both the reliability requirement and the given time constraint. In this paper, we study the heterogeneous reliability scheduling problem, i.e., given a heterogeneous system, a Directed Acyclic Graph (DAG) that models an application and a time constraint, find a schedule for the DAG so that the system reliability can be maximized and the time constraint can be met. To solve this problem, two heuristic algorithms, MCMS and PRMS, are proposed. The experimental results show that our algorithms can improve system reliability significantly. Among them, PRMS has the best performance and the improvement of reliability can be up to 30%.

Optimal task allocation for maximizing reliability in distributed real-time systems

2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), 2013

Distributed system has been developed as a platform for huge computations. Reliability is one of the prominent issues in such systems. Many studies have been recently done to improve reliability by proper task allocation in distributed systems, but they have only considered some system constraints such as processing load, memory capacity, and communication rate. In this paper, we consider time constraint in form of task deadline to above-mentioned constraints in order to model and analyze reliability in distributed real-time systems. To maximize reliability besides satisfying the constraints, we proposed a new offline task allocation algorithm. The algorithm is Systematic Memorybased Simulated Annealing (SMSA) which uses a monotonic cooling schedule and limited memory to store recently visited solutions to prevent cycling. In addition, an effective greedy heuristic algorithm intensifies SMSA. For evaluating the algorithm, SMSA is compared with Genetic Algorithm (GA) and Simulated Annealing (SA). Results have shown that in contrast to SA and GA, SMSA obtains satisfactory reliability in reasonable execution time. Meanwhile, SMSA meets all deadlines same as SA and GA. Furthermore, SMSA results have low deviation from average reliability.

A heuristic task assignment algorithm to maximize reliability of a distributed system

IEEE Transactions on Reliability, 1993

& Conclusions-Distributed systems potentially provide high reliability owing to the program and data-file redundancy possible. In many applications, high reliability is the major consideration for system design. Some work by Kumar, Hariri, Raghavendra shows that the distribution of programs and datafiles can affect the system reliability appreciably, and that redundancy in resources such as computers, programs, and data-files can improve the reliability of distributed system. This paper first formulates a practical application for a reliability-oriented distributed task assignment problem which is NP-hard. Then, to cope with this challenging problem, we p r o p a greedy algorithm, based upon some heuristics, to find an approximate solution. The simulation shows that, in most cases tested, the algorithm f i d s suboptimal solutions efficiently; therefore, it is a desirable approach to solve these problems.

OVERLAPPED CLUSTERING APPROACH FOR MAXIMIZING THE SERVICE RELIABILITY OF HETEROGENEOUS DISTRIBUTED COMPUTING SYSTEMS

For distributed computing system (DCS) where server nodes can fail permanently with nonzero probability, the reliability of the system can be defined as the probability that the system run the entire tasks successfully assign on it before all the nodes fail. In heterogeneous distributed system where various nodes of the system have different characteristics, reliability of the system is highly dependent on the tasks allocation strategies. So, this paper presents a rigorous framework for efficient tasks allocation in heterogeneous distributed environment, with the goal of maximizing the system reliability. Reliability of the system is characterized in the presence of communication uncertainties and topological changes due to node's failure. Node failure has adverse effects on the system reliability. Thus, one possible way to improve reliability is to make the communication among the tasks as local as possible. For this, an overlapped clustering approach has been used. Further, we calculate the reliability of each node of the DCS to determine the actual capabilities of each node. Here, our purpose is to assign the more costly task to more reliable node of the DCS. Then we utilize the load balancing policies for handling the node's failure effect as well as maximizing the service reliability of the DCS. A numeric example is presented to illustrate the importance of incorporating overlapping cluster and load balancing on the reliability study. , © IAEME 32 collection of geographically dispersed heterogeneous computing resources fully connected to each other. There is no shared memory in these types of systems. Every system has own local memory. The systems of the DCS communicate with each other via message passing over the network. These messages may take arbitrary delay to deliver from source to destination. Unlike parallel computing environment, various nodes of the distributed computing systems offer heterogeneous computing capabilities. In addition, the communication network typically suffers to both low bandwidth and a significant latency in the information changes. So, in order to exploit the processing capability of DCS, a parallel application is divided into independent executable unit of sub applications that are called tasks and executed concurrently on different nodes in the DCS. In literature, such allocation of tasks on the node's of DCS is referred as tasks assignment.

Reliability versus performance for critical applications

Journal of Parallel and Distributed Computing, 2009

Applications implemented on critical systems are subject to both safety critical and real-time constraints. Classically, applications are specified as precedence task graphs that must be scheduled onto a given target multiprocessor heterogeneous architecture. We propose a new method for optimizing simultaneously two objectives: the execution time and the reliability of the schedule. The problem is decomposed in two successive steps: a spatial allocation during which the reliability is maximized (randomized algorithm), and a scheduling during which the makespan is minimized (list scheduling algorithm). It allows us to produce several trade-off solutions among which the user can choose the solution that fits the application's requirements the best. Reliability is increased by replicating adequate tasks onto well chosen processors. Our fault model assumes that processors are fail-silent, that they are subject to transient failures, and that the occurrences of failures follow a constant parameter Poisson law. We assess and validate our method by running extensive simulations on both random graphs and actual application graphs. They show that it is competitive, in terms of makespan, compared to existing reference scheduling methods for heterogeneous processors (HEFT), while providing a better reliability.

An Effective Reliability Efficient Algorithm for Enhancing the Overall Performance of Distributed Computing System

International Journal of Computer Applications, 2013

Distributed computing refers to the use of distributed systems to solve computational problems. A distributed computing system consists of multiple computers that communicate through a computer network. The computers that are in a distributed computing system can be physically close together and connected by a local network, or they can be geographically distant and connected by a wide area network. Distributed computing systems offer the benefits like scalability and redundancy. A task is any single module to be processed. If the number of tasks are more then the number of processors and every processor process the task in a particular time period for processing any particular task then we have to allocate each task to the single processor in such a way that the task should be completed in a optimal reliability manner and also there should not be overloading of task to any single processor. The number of processors and number of tasks are static in nature. The number of processors is denoted by n and the number of tasks is denoted by m. In general for all real world problem the number of tasks are greater then the number of processors i.e. m>n. The requirement is to complete all the tasks by allocating the task so that the results for reliability should be optimal in nature to increase the overall performance of distributed computing system.

Communication contention in task scheduling

IEEE Transactions on Parallel and Distributed Systems, 2000

Task scheduling is an essential aspect of parallel programming. Most heuristics for this NP-hard problem are based on a simple system model that assumes fully connected processors and concurrent interprocessor communication. Hence, contention for communication resources is not considered in task scheduling, yet it has a strong influence on the execution time of a parallel program. This paper investigates the incorporation of contention awareness into task scheduling. A new system model for task scheduling is proposed, allowing us to capture both end-point and network contention. To achieve this, the communication network is reflected by a topology graph for the representation of arbitrary static and dynamic networks. The contention awareness is accomplished by scheduling the communications, represented by the edges in the task graph, onto the links of the topology graph. Edge scheduling is theoretically analyzed, including aspects like heterogeneity, routing, and causality. The proposed contention-aware scheduling preserves the theoretical basis of task scheduling. It is shown how classic list scheduling is easily extended to this more accurate system model. Experimental results show the significantly improved accuracy and efficiency of the produced schedules.