Communication complexity of PRAMs (original) (raw)

Complexity issues in general purpose parallel computing

1991

In recent years, powerful theoretical techniques have been developed for supporting communication, synchronization and fault tolerance in general purpose parallel computing. The proposition of this thesis is that different techniques should be used to support different algorithms. The determining factor is granularity, or the extent to which an algorithm uses long blocks for communication between processors. We consider the Block PRAM model of Aggarwal, Chandra and Snir, a synchronous model of parallel computation in which the processors communicate by accessing a shared memory. In the Block PRAM model, there is a time cost for each access by a processor to a block of locations in the shared memory. This feature of the model encourages the use of long blocks for communication. In the thesis we present Block PRAM algorithms and lower bounds for specific problems on arrays, lists, expression trees, graphs, strings, binary trees and butterflies. These results introduce useful basic techniques for parallel computation in practice, and provide a classification of problems and algorithms according to their granularity. Also presented are optimal algorithms for universal hashing and skewing, which are techniques for supporting conflict-free memory access in general-and special-purpose parallel computations, respectively. We explore the Block PRAM model as a theoretical basis for the design of scalable general purpose parallel computers. Several simulation results are presented which show the Block PRAM model to be comparable to, and competitive with, other models that have been proposed for this role. Two major advantages of machines based on the Block PRAM model is that they are able to preserve the granularity properties of individual algorithms and can efficiently incorporate a significant degree of fault tolerance. The thesis also discusses methods for the design of algorithms that do not use synchronization. We apply these methods to define fast circuits for several fundamental Boolean functions.

Communicai ’ Ion Complexity of Prams

2002

We propose a model. LPRAM. for parallel random access machines uith local memon, that captures bcth the communication and computational requirements in parallel computation. For this model. n? present se\cral interesting resuk including the following: Two n x n matkez can be multiplied in Of n’/p) computation time and O( n“/p’ ‘) communication steps using p processors (for p = Ot n ‘/log’ ’ n) L Furthermore. these bounds are optimal for arithmetic on semiring~ , ,,rng +, x onlgt. It L shown that any algontnm that use) comparisons only and that sorts n words requires fl(n log n/(p log(n/p)I) communication stem for ! s pg n. We also provide an algorithm that sorts n words and uses c)t n log n/p1 computation time and 0( n log n/( p lo& n/p 1) ) communication steps. These bounds also apot) for computing In n-point FIT graph. It is s’lown that computmg any binary tree t with n nodes and hetght h requires R! n/p+ log w + ~4) communication steps, and can always be computed in O(n/p +mint\$...

On communication latency in PRAM computations

Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, 1989

Multiprocessors typicMly have substantial amounts of hardware devoted to communicating between the processors. The reason is that communication delays can have a significant bearing on the performance of the machine. In shared memory machines such as the BBN Butterfly [RT86] or the IBM RP3 [Pf85] access to global memory takes tens of instruction cycles. Messag~passing systems~ on the other hand, have communication latency from hundreds to thousands of instruction cycles [Ka87]. The programmer attempts to minimize the effect of communication by judicious algorithm design. However, the success in doing so depends on the level of abstraction available. For instance, if the model of computation includes local as well as global memory (unlike a pure PRAM), then temporal locality of reference can be utilized to reduce communication [PU87,PY88,AC88]. There is an important aspect of communication complexity that is missing from the above picture, and from much of the research in the area. Typically, it takes a substantial period of time to get the first word from global memory, but after that, subsequent words can be obtained quite rapidly-essentially at the clock speed of the machine. This occurs, for example, in snoopy cache designs [CGBG88], local/global memory designs [GKLS83, RT86, Pf85], message passing systems [AS88], and fast processors with multibank memories [Ca88].

On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms

IEEE Transactions on Computers, 2000

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.

Communication-efficient parallel algorithms for distributed random-access machines

Algorithmica, 1988

This paper introduces a model for parallel computation, called the distributed random-access machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to \shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)-step, linear-processor, linear-space, conservative algorithms for a variety of problems on nnode trees, such as computing treewalk numberings, nding the separator of a tree, and evaluating all subexpressions in an expression tree. We give O(lg 2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of size n, including nding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the pre x computation to trees. We show that any such tree x computation can be performed in O(lg n) steps using a conservative variant of Miller and Reif's tree-contraction technique.

A work-optimal algorithm on logδ n processors for a P-complete problem

We present a parallel algorithm for the Lexicographically First Maximal Independent Set Problem on graphs with bounded degree 3 that is work-optimal on a shared memory machine with up to log δ n processors, for any 0 ¢ δ ¢ 1. Since this problem is P-complete it follows (assuming N C £ ¤ P ) that the algorithmics of coarse grained parallel machines and of fine grained parallel machines differ substantially.

Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques

2008

* Copyright 2007, Uzi Vishkin. These class notes reflect the theorertical part in the Parallel Algorithms course at UMD. The parallel programming part and its computer architecture context within the PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home page www.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: please write to me using my last name at umd.edu