Efficient implementation of tree skeletons on distributed-memory parallel computers (original) (raw)

Implementation of parallel tree skeletons on distributed systems

2002

Trees are useful data types, but developing efficient parallel programs manipulating trees is known to be difficult, because of their irregular and imbalance structure. Parallel tree skeletons are designed to ease parallel programming by encouraging programmers to build parallel programs by combining them. However, for distributed systems, efficient implementations of these parallel tree skeletons are known to be hard. In this paper, we propose an implementation of parallel tree skeletons that run efficiently on distributed systems. Our approach is as follows, first we partition a tree by using m-bridge technique, and locally compute by composing functions, then propagate the results over the tree. The results of several experiments shows that our approach is promising, even if the tree is imbalanced. Furthermore, we present the conditions for efficient implementation. . . .

Implementation of Parallel Tree Skeletons

2002

Trees are a useful data type, but they are not routinely included in parallel programming systems because their irregular structure makes them seem hard to compute with e ciently. We present a method for constructing implementations of skeletons, high-level homomorphic operations on trees, that execute in parallel. In particular, we consider the case where the size of the tree is much larger than the the number of processors available, so that tree data must be partitioned. The approach uses the theory of categorical data types to derive implementation templates based on tree contraction. Many useful tree operations can be computed in time logarithmic in the size of their argument, on a wide range of parallel systems.

Efficient parallel algorithms for tree accumulations

Science of Computer Programming, 1994

Accumulations are higher-order operations on structured objects; they leave the shape of an object unchanged, but replace elements of that object with accumulated information about other elements. Upwards and downwards accumulations on trees are two such operations; they form the basis of many tree algorithms. We present two Erew Pram algorithms for computing accumulations on trees taking O(logn) time on O(n= logn) processors, which is optimal.

Processing M-trees with parallel resources

Proceedings Eighth International Workshop on Research Issues in Data Engineering. Continuous-Media Databases and Applications, 1998

The problem of the design and implementation of parallel metric tree indexes, called M-trees, is elaborated. Four di erent object declustering techniques are proposed and tested in order to get a su cient evidence needed for specifying the pros and cons of their application. In general, the obtained I/O speedup and scaleup levels are high. A way how to deal with the CPU parallelism is also proposed and its speedup and scaleup experimentally tested.

Parallel skeletons for manipulating general trees

Parallel Computing, 2006

Trees are important datatypes that are often used in representing structured data such as XML. Though trees are widely used in sequential programming, it is hard to write efficient parallel programs manipulating trees, because of their irregular and ill-balanced structures. In this paper, we propose a solution based on the skeletal approach. We formalize a set of skeletons (abstracted computational patterns) for rose trees (general trees of arbitrary shapes) based on the theory of Constructive Algorithmics. Our skeletons for rose trees are extensions of those proposed for lists and binary trees. We show that we can implement the skeletons efficiently in parallel, by combining the parallel binary-tree skeletons for which efficient parallel implementations are already known. As far as we are aware, we are the first who have formalized and implemented a set of simple but expressive parallel skeletons for rose trees.

Global Trees: A framework for linked data structures on distributed memory parallel systems

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2008

This paper describes the Global Trees (GT) system that provides a multi-layered interface to a global address space view of distributed tree data structures, while providing scalable performance on distributed memory systems. The Global Trees system utilizes coarse-grained data movement to enhance locality and communication efficiency. We describe the design and implementation of GT, illustrate its use in the context of a gravitational simulation application, and provide experimental results that demonstrate the effectiveness of the approach. The key benefits of using this system include efficient sharedmemory style programming of distributed trees, treespecific optimizations for data access and computation, and the ability to customize many aspects of GT to optimize application performance.

Parallelization with Tree Skeletons

Lecture Notes in Computer Science, 2003

Trees are useful data structures, but to design efficient parallel programs over trees is known to be more difficult than to do over lists. Although several important tree skeletons have been proposed to simplify parallel programming on trees, few studies have been reported on how to systematically use them in solving practical problems; it is neither clear how to make a good combination of skeletons to solve a given problem, nor obvious how to find suitable operators used in a single skeleton. In this paper, we report our first attempt to resolve these problems, proposing two important transformations, the tree diffusion transformation and the tree context preservation transformation. The tree diffusion transformation allows one to use familiar recursive definitions to develop his parallel programs, while the tree context preservation transformation shows how to derive associative operators that are required when using tree skeletons. We illustrate our approach by deriving an efficient parallel program for solving a nontrivial problem called the party planning problem, the tree version of the famous maximum-weight-sum problem.

An Enhanced Distributed System to improve theTime Complexity of Binary Indexed Trees

Zenodo (CERN European Organization for Nuclear Research), 2009

Distributed Computing Systems are usually considered the most suitable model for practical solutions of many parallel algorithms. In this paper an enhanced distributed system is presented to improve the time complexity of Binary Indexed Trees (BIT). The proposed system uses multi-uniform processors with identical architectures and a specially designed distributed memory system. The analysis of this system has shown that it has reduced the time complexity of the read query to O(Log(Log(N))), and the update query to constant complexity, while the naive solution has a time complexity of O(Log(N)) for both queries. The system was implemented and simulated using VHDL and Verilog Hardware Description Languages, with xilinx ISE 10.1, as the development environment and ModelSim 6.1c, similarly as the simulation tool. The simulation has shown that the overhead resulting by the wiring and communication between the system fragments could be fairly neglected, which makes it applicable to practically reach the maximum speed up offered by the proposed model.

Toward a universal mapping algorithm for accessing trees in parallel memory systems

1998

We study the problem of mapping the N nodes of a c omplete t-ary tree o n M memory modules so that they can be a c cessed i n p arallel by templates, i.e. distinct sets of nodes. Typical templates for accessing trees are subtrees, root-to-leaf paths, or levels which will be referred t o a s elementary templates. In this paper, we rst propose a new mapping algorithm for accessing both paths and subtrees of size M with an optimal number of con icts i.e., only one conict when the number of memory modules is limited to M. We also propose another mapping algorithm for a composite template, say V as versatile, such that its size is not xed and an instance o f V is composed of any combination of c instances of elementary templates. The number of con icts for accessing an S-node instance of template V is O S p M log M + c and the memory load is 1 + o1 where l o ad is de ned as the ratio between the maximum and minimum number of data items mapped onto each memory module.

Parallel tree building on a range of shared address space multiprocessors: algorithms and application performance

2002

Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programming advantages to application users it is important that the shared address space model not only perform and scale well in the tightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: While the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular on platforms with less aggressive, commodity-oriented communication architectures (even though it takes less than 3 percent of the time in most sequential executions). We therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this programming model: an SGI Origin2000 (an aggressive hardware cache-coherent machine with physically distributed memory), an SGI Challenge bus-based shared memory multiprocessor, an Intel Paragon running a shared virtual memory protocol in software at page granularity, and a Wisconsin Typhoon-zero in which the granularity of coherence can be varied using hardware support but the protocol runs in software (in the last case using both a page-based and a fine-grained protocol). We find that the algorithms used successfully and widely distributed so far for the first two platforms cause overall application performance to be very poor on the latter two commodityoriented platforms. An alternative algorithm briefly considered earlier for hardware coherent systems but then ignored in that context helps to some extent but not enough. Nor does an algorithm that incrementally updates the tree every time-step rather than rebuilding it. The best algorithm by far is a new one we propose that uses a separate spatial partitioning of the domain for the tree building phase-which is different than the partitioning used in the major force calculation and other phases-and eliminates locking at a significant cost in locality and load balance. By changing the tree building algorithm, we achieve improvements in overall application performance of more than factors of 4-40 on commodity-based systems, even on only 16 processors. This allows commodity shared memory platforms to perform well for hierarchical N-body applications for the first time, and more importantly achieves performance portability since it also performs very well on hardware-coherent systems.