The TRANSPOSE machine-a global implementation of a parallel graph reducer (original) (raw)

This paper describes a new concept for the parallel implementation of functional languages on a network of processors. The implementation uses a special variant of annotated graph reduction 3]. The main features of it are the following: We employ active waiting 6], to avoid complicated runtime data structures. We use a global address space, and a random distribution of the graph nodes over the local memories of the processors, in order to overcome the problems of load-balancing and scheduling. The reduction is organized in cycles during which, all annotated redices are reduced. This notion of "cycles" enables us, to restrict communication between the processors to the execution of a global permutation, de ned by an array of messages M = L LocalMessages P processors ]. This two dimensional (2D) permutation is realized by a simple and fast algorithm, that permutes all messages of M in 2L + 6L log(P) steps, for any L suciently large. This algorithm actually maps any 2D-permutation to a double 2D-transpose operation 19]. Hence the implementation can be used for any network topology that supports the transpose operation (namely Shu e Exchange 1]). The simple syntactic system, and the mapping to one basic transpose operation, produce signi cant simplicity and understanding of the implementation. The potential speedup of graph reduction programs, is compared with the overhead of the implementation, giving deeper insight to parallel graph reductions.