Enabling rapid development of parallel tree search applications (original) (raw)

A Framework for Analizing Massive Astro- physical Datasets on a Distributed Grid

Virtual observatories will give astronomers easy access to an unprecedented amount of data. Extracting scientific knowledge from these data will increasingly demand both efficient algorithms as well as the power of parallel computers. Such machines will range in size from small Beowulf clusters to large, massively parallel platforms (MPPs) to collections of MPPs distributed across a Grid, such as the NSF TeraGrid facility. Nearly all efficient analyses of large astronomical datasets use trees as their fundamental data structure. Writing efficient tree-based techniques, a task that is time-consuming even on single-processor computers, is exceedingly cumbersome on parallel or grid-distributed resources. We have developed a framework, Ntropy, that provides a flexible, extensible, and easy-to-use way of developing tree-based data analysis algorithms for both serial and parallel platforms. Our experience has shown that not only does our framework save development time, it also delivers a...

Enabling Knowledge Discovery in a Virtual Universe

2007

Virtual observatories will give astronomers easy access to an unprecedented amount of data. Extracting scientific knowledge from these data will increasingly demand both efficient algorithms as well as the power of parallel computers. Such machines will range in size from small Beowulf clusters to large massively parallel platforms (MPPs) to collections of MPPs distributed across a Grid, such as the NSF TeraGrid facility. Nearly all efficient analyses of large astronomical datasets use trees as their fundamental data structure. Writing efficient tree-based techniques, a task that is time-consuming even on single-processor computers, is exceedingly cumbersome on parallel or grid-distributed resources. We have developed a library, Ntropy, that provides a flexible, extensible, and easy-to-use way of developing tree-based data analysis algorithms for both serial and parallel platforms. Our experience has shown that not only does our library save development time, it also delivers an increase in serial performance. Furthermore, Ntropy makes it easy for an astronomer with little or no parallel programming experience to quickly scale their application to a distributed multiprocessor environment. By minimizing development time for efficient and scalable data analysis, we enable wide-scale knowledge discovery on massive datasets.

Parallel astronomical data processing with Python: Recipes for multicore machines

Astronomy and Computing, 2013

High performance computing has been used in various fields of astrophysical research. But most of it is implemented on massively parallel systems (supercomputers) or graphical processing unit clusters. With the advent of multicore processors in the last decade, many serial software codes have been re-implemented in parallel mode to utilize the full potential of these processors. In this paper, we propose parallel processing recipes for multicore machines for astronomical data processing. The target audience are astronomers who are using Python as their preferred scripting language and who may be using PyRAF/IRAF for data processing. Three problems of varied complexity were benchmarked on three different types of multicore processors to demonstrate the benefits, in terms of execution time, of parallelizing data processing tasks. The native multiprocessing module available in Python makes it a relatively trivial task to implement the parallel code. We have also compared the three multiprocessing approaches-Pool/Map, Process/Queue and Parallel Python. Our test codes are freely available and can be downloaded from our website.

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing

Publications of the Astronomical Society of the Pacific, 2016

The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore present OpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.

Feeding an astrophysical database via distributed computing resources: The case of BaSTI

Astronomy and Computing, 2015

Stellar evolution model databases, spanning a wide ranges of masses and initial chemical compositions, are nowadays a major tool to study Galactic and extragalactic stellar populations. The Bag of Stellar Tracks and Isochrones (BaSTI) database is a VO-compliant theoretical astrophysical catalogue that collects fundamental data sets involving stars formation and evolution. The creation of this database implies a large number of stellar evolutionary computations that are extremely demanding in term of computing power. Here we discuss the efforts devoted to create and update the database using Distributed Computing Infrastructures and a Science Gateway and its future developments within the framework of the Italian Virtual Observatory project.

Are you ready to FLY in the universe? A multi-platform N-body tree code for parallel supercomputers

Computer Physics Communications, 2001

In the last few years, cosmological simulations of structures and galaxies formations have assumed a fundamental role in the study of the origin, formation and evolution of the universe. These studies improved enormously with the use of supercomputers and parallel systems, allowing more accurate simulations, in comparison with traditional serial systems. The code we describe, called FLY, is a newly written code (using the tree N-body method), for three-dimensional self-gravitating collisionless systems evolution.

A work- and data-sharing parallel tree N-body code

Computer Physics Communications, 1996

We describe a new parallel N-body code for astrophysical simulations of systems of point masses interacting via the gravitational interaction. The code is based on a work-and data sharing scheme, and is implemented within the Cray Research Corporation's CRAFT c programming environment. Di erent data distribution schemes have been adopted for bodies' and tree's structures. Tests performed for two di erent types of initial distributions show that the performance scales almost ideally as a function of the size of the system and of the number of processors. We discuss the factors a ecting the absolute speedup and how it can be increased with a better tree's data distribution scheme.

Harnessing grid resources to enable the dynamic analysis of large astronomy datasets

Supercomputing Conference, 2006

Grid computing has emerged as an important new field focusing on large-scale resource sharing and high-performance orientation. The astronomy community has an abundance of imaging datasets at its disposal which are essentially the "crown jewels" for the astronomy community. However, these astronomy datasets are generally terabytes in size and contain hundreds of millions of objects separated into millions of files-factors that make many analyses impractical to perform on small computers. The key question we answer in this paper is: "How can we leverage Grid resources to make the analysis of large astronomy datasets a reality for the astronomy community?" Our answer is "AstroPortal," a gateway to grid resources tailored for the astronomy community. To address this question, we have developed a Web Services-based system, AstroPortal, that uses grid computing to federate large computing and storage resources for dynamic analysis of large datasets. Building on the Globus Toolkit 4, we have built an AstroPortal prototype and implemented a first analysis, "stacking," that sums multiple regions of the sky, a function that can help both identify variable sources and detect faint objects. We have deployed AstroPortal on the TeraGrid distributed infrastructure and applied the stacking function to the Sloan Digital Sky Survey (SDSS), DR4, which comprises about 300 million objects dispersed over 1.3 million files, a total of 3 terabytes of compressed data, with promising results. AstroPortal gives the astronomy community a new tool to advance their research and to open new doors to opportunities never before possible on such a large scale. Furthermore, we have identified that data locality in distributed computing applications is important for the efficient use of the underlying resources. We outline a storage hierarchy that could be used to make more efficient use of the available resources, which could potentially offer orders of magnitude speed ups in the analysis of large datasets.

AstroPortal: a science gateway for large-scale astronomy data analysis

TeraGrid Conference, 2006

The creation of large digital sky surveys presents the astronomy community with tremendous scientific opportunities. However, these astronomy datasets are generally terabytes in size and contain hundreds of millions of objects separated into millions of files—factors that make many analyses impractical to perform on small computers. To address this problem, we have developed a Web Services-based system, AstroPortal, that uses grid computing to federate large computing and storage resources for dynamic analysis of ...

Gasoline: a flexible, parallel implementation of TreeSPH

New Astronomy, 2004

The key features of the Gasoline code for parallel hydrodynamics with self-gravity are described. Gasoline is an extension of the efficient Pkdgrav parallel N-body code using smoothed particle hydrodynamics. Accuracy measurements, performance analysis and tests of the code are presented. Recent successful Gasoline applications are summarized. These cover a diverse set of areas in astrophysics including galaxy clusters, galaxy formation and gas-giant planets. Future directions for gasdynamical simulations in astrophysics and code development strategies for tackling cutting edge problems are discussed.