The TRANSPOSE machine-a global implementation of a parallel graph reducer (original) (raw)

Experiments with a transputer-based parallel graph reduction machine

Concurrency: Practice and Experience, 1991

This paper is concerned with the implementation of functional languages on a parallel architecture, using graph reduction as a model of computation. Parallelism in such systems is automatically derived by the compiler but a major problem is the fine granularity, illustrated in Divide-and-Conquer problems at the leaves of the computational tree. The paper addresses this issue and proposes a method based on static analysis combined with run-time tests to remove the excess in parallelism. We report experiments on a prototype machine, simulated on several connected INMOS transputers. Performance figures show the benefits in adopting the method and the difficulty of automatically deriving the optimum partitioning due to differences among the problems.

Distributed implementation of programmed graph reduction

Lecture Notes in Computer Science, 1989

Programmed graph reduction has been shown to be an e cient implementation technique for lazy functional languages on sequential machines. Considering programmed graph reduction as a generalization of conventional environment-based implementations where the activation records are allocated in a graph instead of on a stack it becomes very easy to use this technique for the execution of functional programs in a parallel machine with distributed memory. We describe in this paper the realization of programmed graph reduction in PAM | a parallel abstract machine with distributed memory. Results of our implementation of PAM on an Occam-Transputersystem are given.

Parallel graph rewriting on loosely coupled machine architectures

Lecture Notes in Computer Science, 1991

Graph rewriting models are very suited to serve as the basic computational model for functional languages and their implementation. Graphs are used to share computations which is needed to make efficient implementations of functional languages on sequential hardware possible. When graphs are rewritten (reduced) on parallel loosely coupled machine architectures, subgraphs have to be copied from one processor to another such that sharing is lost. In this paper we introduce the notion of lazy copying. With lazy copying it is possible to duplicate a graph without duplicating work. Lazy copying can be combined with simple mmotations which control the order of reduction. In principle, only interleaved execution of the individual reduction steps is possible. However, a condition is deduced under which parallel execution is allowed. When only certain combinations of lazy copying and annotations are used it is guarantied that this so-called non-interference condition is fulfilled. Abbreviations for these combinations are introduced. Now complex process behavlours, such as process communication on a loosely coupled parallel machine architecture, can be modelled. This also includes a special case: modelling mnltlprocessing on a single processor. Arbitrary process topologies can be created. Synchronous and asyncbronons process communication can be modelled. The implementation of the language Concurrent Clean, which is based on the proposed graph rewriting model, has shown that complicated parallel algorithms which can go far beyond divide-and-conquar like applications can be expressed.

Efficient shared-memory support for parallel graph reduction

Future Generation Computer Systems, 1997

This paper presents the results of a simulation study of cache coherency issues in parallel implementations of functional programming languages. Parallel graph reduction uses a heap shared between processors for all synchronisation and communication. We show that a high degree of spatial locality is often present and that the rate of synchronisation is much greater than for imperative programs. We propose a modi ed coherency protocol with static cache line ownership and show that this allows locality to be exploited to at least the level of a conventional protocol, but without the unnecessary serialisation and network transactions this usually causes. The new protocol avoids false sharing, and makes it possible to reduce the number of messages exchanged, but relies on increasing the size of the cache lines exchanged to do so. It is therefore of most bene t with a high-bandwidth interconnection network with relatively high communication latencies or message handling overheads.

MaRs: a parallel graph reduction multiprocessor

Computer architecture news, 1988

We describe the MaRS machine: a parallel, distributed control multiprocessor for graph reduction using a functional machine language. The object code language is based on an optimized set of combinators, and its functional character allows an automatic parallelisation of the execution. A programming language, "MARS LISP", has also been developed. A prototype of MaRS is currently being designed in VLSI 1.5-micron CMOS technology with 2 levels of metal, by means of a CAD system. The machine uses three basic types of processors for Reduction, Memory and Communication, plus auxiliary 1/0 and Arithmetic Processors; communications do not constitute an operational bottleneck, as interprocessor messages are routed via an Omega switching network. Initially, a Host Computer will be used for startup, testing and direct memory access. The machine architecture and its functional organization are described, as well as the theoretical execution model. We conclude on a number of specialized hardware and software mechanism that differentiate MaRS machine from other similar projects currently going on.

A Distributed Virtual Machine for Parallel Graph Reduction

8th International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT '07), 2007

Abstract We present the architecture of nreduce, a distributed virtual machine which uses parallel graph reduction to run programs across a set of computers. It executes code written in a simple functional language which supports lazy evaluation and automatic parallelisation. The execution engine abstracts away details of parallelism and distribution, and uses JIT compilation to produce efficient code. Abstract This work is part of a broader project to provide a programming environment for developing distributed applications which hides low-level details from the application developer. The language we have designed plays the role of an intermediate form into which existing functional languages can be transformed. The runtime system demonstrates how distributed execution can be implemented directly within a virtual machine, instead of a separate piece of middleware that coordinates the execution of external programs.

An instruction fetch unit for a graph reduction machine

ACM SIGARCH Computer Architecture News, 1986

The G-machine provides architecture support for the evaluation of functional programming languages by graph reduction. This paper describes an instruction fetch unit for such an architecture that provides a high throughput of instructions, low latency and adequate elasticity in the instruction pipeline. This performance is achieved by a hybrid instruction set and a decoupled RISC architecture. The hybrid instruction set consists of complex instructions that reflect the abstract architecture and simple instructions that reflect the hardware implementation. The instruction fetch unit performs translation from complex instruction to a sequence of simple instructions which can be executed rapidly. A suitable mix of techniques, including cache, buffers and the translation scheme, provide the memory bandwidth required to feed a RISC execution unit. The simulation results identify the performance gains, maximum throughput and minimum latency achieved by various techniques. Results achieved...

Functional programming and parallel graph rewriting

1993

In a declarative programming language a computation is expressed in a static fashion, as a list of declarations. A program in such a language is regarded as a specification that happens to be executable as well. In this textbook we focus on a subclass of the declarative languages, the functional programming languages, sometimes called applicative languages. In these languages a program consists of a list of function definitions. The execution of a program consists of the evaluation of a function application given the functions that have been defined.

Parallel graph reduction with the (v , G)-machine

Proceedings of the fourth international conference on Functional programming languages and computer architecture - FPCA '89, 1989

We have implemented a parallel graph reducer on a commercially available shared memory multiprocessor (a Sequent SymmetryTM), h t h' t a ac reves real speedup compared to a a fast compiled implementation of the conventional Gmachine. Using 15 processors, this speedup ranges between 5 and 11, depending on the program. Underlying the implementation is an abstract machine called the (v, G)-machine. We describe the sequential and the parallel (v, G)-machine, and our implementation of them. We provide performance and speedup figures and graphs.

Divide-and-Conquer and parallel graph reduction

Parallel Computing, 1991

This paper is concerned with the design of a multiprocessor system supporting the parallel execution of functional programs. Parallelism in such systems is automatically derived by the compiler but this parallelism is unlikely to meet the physical constraints of the target machine. In this paper, these problems are identified for the class of Divide-and-Conquer algorithms and a solution which consists of reducing the depth of the computational tree is proposed. A parallel graph reduction machine simulated on a network of transputers, developed for testing the proposed solutions, is described. Experiments have been conducted on some simple Divide-and-Conquer programs and the results are presented. Lastly, some proposals are made for an automatic system that would efficiently execute any problem belonging to the same class taking into account the nature of the problem as well as the physical characteristics of the implementation.