A Taxonomy of Distributed Debuggers Based on Execution Replay (original) (raw)

On-the-fly replay: a practical paradigm and its implementation for distributed debugging

Parallel and Distributed Processing, International Symposium, 1994

This paper presents a practical paradigm, called on-the-fly replay. This paradigm consists of running a distributed program twice at the same time: an original computation is running in a regular fashion, which also includes steps of making non-deterministic choices; this execution is driving a twin execution, whose non-deterministic choices do not have to be evaluated (since they are taken from

Parallel Program Debugging based on Data-Replay

Nondeterministic nature of parallel programs is the major difficulty in debugging. Order-replay, a technique to solve this problem, is widely used because of its small overhead. It has, however, several serious drawbacks: all processes of the parallel program have to participate in replay even when some of them are clearly not involved with the bug; and the programmer cannot stop the process being debugged at an arbitrary point. We adopt another method for deterministic replay, Data-replay, which logs contents of the events rather than their order, and makes it possible to run and stop each process independently. Data-replay is well able to cooperate with reverse execution mechanisms. We applied the Data-replay mechanism to MPI based parallel programs. The result of our experiment with NAS Parallel Benchmarks shows that our mechanism works at a practical cost. Logging communicated data incurs only 24 % overhead while it accelerates replayed execution by 38 %, both in average.

Debugging distributed applications with replay capabilities

Proceedings of the conference on TRI-Ada '97 - TRI-Ada '97, 1997

This paper focuses on the latest developments made by the ENST research team to GLADE, the implementation of the Distributed Systems Annex for the GNAT Ada95 compiler; we have extended GLADE's communication subsystem and added recording facilities and replay capabilities. This makes debugging distributed applications much easier because of the possibility to replay separately each partition by simulating external events at consistent dates without loosing the possible determinism of the original program, and it also eases the debugging in cases where it is not practical to re-run the whole program or when it is impossible to get exactly the same behaviour from one of the part (for example when one or several parts of the application run on embedded targets and send messages depending on sensor inputs while the other parts run on fixed workstations).

Debugging of parallel and distributed programs

This chapter surveys the main issues involved in correctness debugging of parallel and dis-tributed programs. Distributed debugging is an instance of the more general problem of observation of a distributed computation. This chapter briefly summarizes the theoretical foundations of the dis-tributed debugging activity. Then a survey is presented of the main methodologies used for parallel and distributed debugging, including state and event based debugging, deterministic re-execution, systematic state exploration, and correctness predicate evaluation. Such approaches are complemen-tary to one another, and the chapter discusses how they can be supported using distinct techniques for observation and control.

Cyclic Debugging Using Execution Replay

Lecture Notes in Computer Science, 2001

This paper presents a tool that enables programmers to use cyclic debugging techniques for debugging non-deterministic parallel programs. The solution consists of a combination of record/replay with automatic on-the-fly data race detection. This combination enables us to limit the record phase to the more efficient recording of the synchronization operations, and checking for data races during a replayed execution. As the record phase is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time.

Gloabl Condtions in Debugging Distributed Programs

Journal of Parallel and Distributed Computing, 1992

This paper describes algorithms for a distributed program debugger based on a replay technique. Halting at breakpoints and selective tracing are its fundamental features. In distributed systems, a given breakpoint or trace condition does not uniquely define the global state at which to halt or trace, because of the asynchrony of processes and communications. This paper therefore proposes the "firsf" global state Znf(P) to be the best global state at which to halt or trace, for a given condition P. Two kinds of global conditions related to plural processes, Conjunctive Predicates and Disjunctive Predicates, are considered. The authors present an algorithm that halts processes at Znf(P) for a given Conjunctive Predicate P. It is also shown that, for a Disjunctive Predicate P, it is impossible to halt at Znf(P), but possible to halt at some state which satisfies P. An algorithm is also provided for selective tracing when a Conjunctive or Disjunctive Predicate selection condition is &WI. 0 l!w Academic PW, IX.

A Framework to Support Parallel and Distributed Debugging

High-Performance …, 1998

We discuss debugging prototypes that can easily support new functionalities, depending on the requirements of high-level computational models, and allowing a coherent integration with other tools in a software engineering environment. Concerning the rst aspect, we propose a framework that identi es two distinct levels of functionalities that should be supported by a parallel and distributed debugger using: a process and thread-level, and a coordination level concerning sets of processes or threads. An incremental approach is used to e ectively develop prototypes that support both functionalities. Concerning the second aspect, we discuss how the interfacing with other tools has in uenced the design of a process-level debugging interface (PDBG) and a distributed monitoring and control layer called (DAMS).