A parallel tree code (original) (raw)

GOTPM: a parallel hybrid particle-mesh treecode

New Astronomy, 2004

We describe a parallel, cosmological N-body code based on a hybrid scheme using the particle-mesh (PM) and Barnes-Hut (BH) oct-tree algorithm. We call the algorithm GOTPM for Grid-of-Oct-Trees-Particle-Mesh. The code is parallelized using the Message Passing Interface (MPI) library and is optimized to run on Beowulf clusters as well as symmetric multi-processors. The gravitational potential is determined on a mesh using a standard PM method with particle forces determined through interpolation. The softened PM force is corrected for short range interactions using a grid of localized BH trees throughout the entire simulation volume in a completely analogous way to P 3 M methods. This method makes no assumptions about the local density for short range force corrections and so is consistent with the results of the P 3 M method in the limit that the treecode opening angle parameter, θ → 0. The PM method is parallelized using one-dimensional slice domain decomposition. Particles are distributed in slices of equal width to allow mass assignment onto mesh points. The Fourier transforms in the PM method are done in parallel using the MPI implementation of the FFTW package. Parallelization for the tree force corrections is achieved again using one-dimensional slices but the width of each slice is allowed to vary according to the amount of computational work required by the particles within each slice to achieve load balance. The tree force corrections dominate the computational load and so imbalances in the PM density assignment step do not impact the overall load balance and performance significantly. The code performance scales well to 128 processors and is significantly better than competing methods. We present preliminary results from simulations run on different platforms containing up to N = 1G particles to verify the code.

A parallel tree code for large N-body simulation: dynamic load balance and data distribution on a CRAY T3D system

Computer Physics Communications, 1997

N-body algorithms for long-range unscreened interactions like gravity belong to a class of highly irregular problems whose optimal solution is a challenging task for present-day massively parallel computers. In this paper we describe a strategy for optimal memory and work distribution which we have applied to our parallel implementation of the Barnes & Hut (1986) recursive tree scheme on a Cray T3D using the CRAFT programming environment. We have performed a series of tests to find an optimal data distribution in the T3D memory, and to identify a strategy for the Dynamic Load Balance in order to obtain good performances when running large simulations (more than 10 million particles). The results of tests show that the step duration depends on two main factors: the data locality and the T3D network contention. Increasing data locality we are able to minimize the step duration if the closest bodies (direct interaction) tend to be located in the same PE local memory (contiguous block subdivison, high granularity), whereas the tree properties have a fine grain distribution. In a very large simulation, due to network contention, an unbalanced load arises. To remedy this we have devised an automatic work redistribution mechanism which provided a good Dynamic Load Balance at the price of an insignificant overhead.

A Modified Parallel Tree Code for N-Body Simulation of the Large-Scale Structure of the Universe

Journal of Computational Physics, 2000

N-body codes for performing simulations of the origin and evolution of the largescale structure of the universe have improved significantly over the past decade in terms of both the resolution achieved and the reduction of the CPU time. However, state-of-the-art N-body codes hardly allow one to deal with particle numbers larger than a few 10 7 , even on the largest parallel systems. In order to allow simulations with larger resolution, we have first reconsidered the grouping strategy as described in J. Barnes (1990, J. Comput. Phys. 87, 161) (hereafter B90) and applied it with some modifications to our WDSH-PT (Work and Data SHaring-Parallel Tree) code (U. Becciani et al., 1996, Comput. Phys. Comm. 99, 1). In the first part of this paper we will give a short description of the code adopting the algorithm of J. E. Barnes and P. Hut (1986, Nature 324, 446) and in particular the memory and work distribution strategy applied to describe the data distribution on a CC-NUMA machine like the CRAY-T3E system. In very large simulations (typically N ≥ 10 7 ), due to network contention and the formation of clusters of galaxies, an uneven load easily verifies. To remedy this, we have devised an automatic work redistribution mechanism which provided a good dynamic load balance without adding significant overhead. In the second part of the paper we describe the modification to the Barnes grouping strategy we have devised to improve the performance of the WDSH-PT code. We will use the property that nearby particles have similar interaction lists. This idea has been checked in B90, where an interaction list is built which applies everywhere within a cell C group containing a small number of particles N crit . B90 reuses this interaction list for each particle p ∈ C group in the cell in turn. We will assume each particle p to have the same interaction list. We consider that the agent force F p on a particle p can be decomposed into two terms F p = F far + F near . The first term F far is the same for each particle in the cell and is generated by the interaction between a hypothetical particle placed in the center of mass of the C group and the farther cells contained in the interaction list. F near is different for each particle p and is generated by the interaction between p and the elements near C group . Thus it has been possible to reduce the CPU time 118

A work- and data-sharing parallel tree N-body code

Computer Physics Communications, 1996

We describe a new parallel N-body code for astrophysical simulations of systems of point masses interacting via the gravitational interaction. The code is based on a work-and data sharing scheme, and is implemented within the Cray Research Corporation's CRAFT c programming environment. Di erent data distribution schemes have been adopted for bodies' and tree's structures. Tests performed for two di erent types of initial distributions show that the performance scales almost ideally as a function of the size of the system and of the number of processors. We discuss the factors a ecting the absolute speedup and how it can be increased with a better tree's data distribution scheme.

Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems

International Journal of High Performance Computing Applications, 1994

We discuss two physical systems from separate disciplines that make use of the same algorithmic and mathematical structures to reduce the number of operations necessary to complete a realistic simulation. In the gravitational N-body problem, the acceleration of an object is given by the familiar Netwonian laws of motion and gravitation. The computational load is reduced by treating groups of bodies as single multipole sources rather than individual bodies. In the simulation of incompressible flows, the flow may be modeled by the dynamics of a set of N interacting vortices. Vortices are vector objects in three dimensions, but their interactions are mathematically similar to that of gravitating masses. The multipole approximation can be used to greatly reduce the time needed to compute the interactions between vortices. Both types of simulations were carried out on the Intel Touchstone Delta, a parallel MIMD computer with 512 processors. Timings are reported for systems of up to 10 million bodies, and demonstrate that the implementation scales well on massively parallel systems. The majority of the code is common between the two applications, which differ only in certain "physics" modules. In particular, the code for parallel tree construction and traversal is shared. 1 INTRODUCTION Tree-based algorithms have had a major impact on the study of the evolution of gravitating systems because they

Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations

Proceedings of the Platform for Advanced Scientific Computing Conference

This paper presents an octree construction method, called Cornerstone, that facilitates global domain decomposition and interactions between particles in mesh-free numerical simulations. Our method is based on algorithms developed for 3D computer graphics, which we extend to distributed high performance computing (HPC) systems. Cornerstone yields global and locally essential octrees and is able to operate on all levels of tree hierarchies in parallel. The resulting octrees are suitable for supporting the computation of various kinds of short and long range interactions in N-body methods, such as Barnes-Hut and the Fast Multipole Method (FMM). While we provide a CPU implementation, Cornerstone may run entirely on GPUs. This results in significantly faster tree construction compared to execution on CPUs and serves as a powerful building block for the design of simulation codes that move beyond an offloading approach, where only numerically intensive tasks are dispatched to GPUs. With data residing exclusively in GPU memory, Cornerstone eliminates data movements between CPUs and GPUs. As an example, we employ Cornerstone to generate locally essential octrees for a Barnes-Hut treecode running on almost the full LUMI-G system with up to 8 trillion particles.

Are you ready to FLY in the universe? A multi-platform N-body tree code for parallel supercomputers

Computer Physics Communications, 2001

In the last few years, cosmological simulations of structures and galaxies formations have assumed a fundamental role in the study of the origin, formation and evolution of the universe. These studies improved enormously with the use of supercomputers and parallel systems, allowing more accurate simulations, in comparison with traditional serial systems. The code we describe, called FLY, is a newly written code (using the tree N-body method), for three-dimensional self-gravitating collisionless systems evolution.

A data-parallel implementation of O(N) hierarchical N-body methods

Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '96, 1996

The O(N) hierarchical N-bodyalgorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.

A Performance Comparison of Tree Data Structures for N-Body Simulation

Journal of Computational Physics, 2002

We present a performance comparison of tree data structures for N-body simulation. The tree data structures examined are the balanced binary tree and the Barnes-Hut (BH) tree. Previous work has compared the performance of BH trees with that of nearest-neighbor trees and the fast multipole method, but the relative merits of BH and binary trees have not been compared systematically. In carrying out this work, a very general computational tool which permits controlled comparison of different tree algorithms was developed. The test problems of interest involve both long-range physics (e.g., gravity) and short-range physics (e.g., smoothed particle hydrodynamics). Our findings show that the Barnes-Hut tree outperforms the binary tree in both cases. However, we present a modified binary tree which is competitive with the Barnes-Hut tree for long-range physics and superior for short-range physics. Thus, if the local search time is a significant portion of the computational effort, a binary tree could offer performance advantages. This result is of particular interest since short-range searches are common in many areas of computational physics, as well as areas outside the scope of N-body simulation such as computational geometry. The possible reasons for this are outlined and suggestions for future algorithm evaluations are given.

Are you ready to FLY in the universe? A multi-platform -body tree code for parallel supercomputers

Computer Physics Communications, 2001

A parallel tree code (original) (raw)

Related papers