X10 as a parallel language for scientific computation: Practice and experience (original) (raw)

PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language

Concurrency and Computation: Practice and Experience, 2013

The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task-parallel model is used to express parallelism using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine-grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single-node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster.

The BLAZE language: a parallel language for scientific programming

Microprocessors and Microsystems, 1987

Programming multiprocessor parallel architectures is a complex task. This paper describes a Pascal-like scientific programming language, Blaze, designed to simplify this task. Blaze contains array arithmetic, "forall" loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specifiC program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

Scalable Execution of Legacy Scientific Codes

Lecture Notes in Computer Science, 2006

This paper presents Weaves, a language neutral framework for scalable execution of legacy parallel scientific codes. Weaves supports scalable threads of control and multiple namespaces with selective sharing of state within a single address space. We resort to two examples for illustration of different aspects of the framework and to stress the diversity of its application domains. The more expressive collaborating partial differential equation (PDE) solvers are used to exemplify developmental aspects, while freely available Sweep3D is used for performance results. We outline the framework in the context of shared memory systems, where its benefits are apparent. We also contrast Weaves against existing programming paradigms, present use cases, and outline its implementation. Preliminary performance tests show significant scalability over process-based implementations of Sweep3D.

Practical parallelization of scientific applications with OpenMP, OpenACC and MPI

2021

This work aims at distilling a systematic methodology to modernize existing sequential scientific codes with a little re-designing effort, turning an old codebase into modern code, i.e., parallel and robust code. We propose a semi-automatic methodology to parallelize scientific applications designed with a purely sequential programming mindset, possibly using global variables, aliasing, random number generators, and stateful functions. We demonstrate that the same methodology works for the parallelization in the shared memory model (via OpenMP), message passing model (via MPI), and General Purpose Computing on GPU model (via OpenACC). The method is demonstrated parallelizing four real-world sequential codes in the domain of physics and material science. The methodology itself has been distilled in collaboration with MSc students of the Parallel Computing course at the University of Torino, that applied it for the first time to the project works that they presented for the final exam...

Multilevel Parallelism in Computational Chemistry using Common Component Architecture and Global Arrays

ACM/IEEE SC 2005 Conference (SC'05), 2005

The development of complex scientific applications for high-end systems is a challenging task. Addressing complexity of the involved software and algorithms is becoming increasingly difficult and requires appropriate software engineering approaches to address interoperability, maintenance, and software composition challenges. At the same time, the requirements for performance and scalability to thousand processor configurations magnifies the level of difficulties facing the scientific programmer due to the variable levels of parallelism available in different algorithms or functional modules of the application. This paper demonstrates how the Common Component Architecture (CCA) and Global Arrays (GA) can be used in context of computational chemistry to express and manage multi-level parallelism through the use of processor groups. For example, the numerical Hessian calculation using three levels of parallelism in NWChem computational chemistry package outperformed the original version of the NWChem code based on single level parallelism by a factor of 90% when running on 256 processors.

Parallelism in computational chemistry

Theor Chem Acc, 1993

An account is given of experience gained in implementing computational chemistry application software, including quantum chemistry and macromolecular refinement codes, on distributed memory parallel processors. In quantum chemistry we consider the coarse-grained implementation of Gaussian integral and derivative integral evaluation, the direct-SCF computation of an uncorrelated wavefunction,~the 4-index transformation of two-electron integrals and the direct-CI calculation of correlated wavefunctions. In the refinement of macromolecular conformations, we describe domain decomposition techniques used in implementing general purpose molecular mechanics, molecular dynamics and free energy perturbation calculations. Attention is focused on performance figures obtained on the Intel iPSC/2 and iPSC/860 hypercubes, which are compared with those obtained on a Cray Y-MP/464 and Convex C-220 minisupercomputer. From this data we deduce the cost effectiveness of parallel processors in the field of computational chemistry.

Refactoring a language for parallel computational chemistry

Proceedings of the 2nd Workshop on Refactoring Tools - WRT '08, 2008

We describe a project to provide refactoring support for the SIAL programming language. SIAL is a domain specific parallel programing language designed to express quantum chemistry computations. It incorporates language support for the loop parallelism and distributed array parallel design patterns. In contrast to refactorings typically undertaken for object-oriented programs which have the goal of improving the code structure, SIAL refactorings are usually done to improve the performance or to allow larger problems to be solved.

High-performance computing in chemistry: NW Chem

1996

The impact of high-performance computing in computational chemistry is considered in the light of increasing demands for both the number and complexity of chemical systems amenable to theoretical treatment. Using self-consistent field Density Functional Theory (DFT) as a prototypical application, we describe the development, implementation and performance of the NWChem computational chemistry package that is targeting both present and future generations of massively parallel processors (MPP). The emphasis throughout this development is on scalability and the distribution, as opposed to the replication, of key data structures. To facilitate such capabilities, we describe a shared non-uniform memory access model which simplifies parallel programming while at the same time providing for portability across both distributed-and shared-memory machines. The impact of these developments is illustrated through a performance analysis of the DFT module of NWChem on a variety of MPP systems.

Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study

2003

Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present a case study involving NAMD, a parallel molecular dynamics application, and efforts to scale it to run on 3000 processors with Tera-FLOPS level performance. NAMD is implemented in Charm++, and the performance analysis was carried out using “projections”, the performance visualization/analysis tool associated with Charm++. We will showcase a series of optimizations facilitated by projections. The resultant performance of NAMD led to a Gordon Bell award at SC2002.

High Performance Computing for Computational Science - VECPAR 2006

Springer eBooks, 2007

Software reusability has proven to be an effective practice to speed-up the development of complex high-performance scientific and engineering applications. We promote the reuse of high quality software and general purpose libraries through the Advance CompuTational Software (ACTS) Collection. ACTS tools have continued to provide solutions to many of today's computational problems. In addition, ACTS tools have been successfully ported to a variety of computer platforms; therefore tremendously facilitating the porting of applications that rely on ACTS functionalities. In this contribution we discuss a high-level user interface that provides a faster code prototype and user familiarization with ACTS tools. The high-level user interfaces have been built using Python. Here we focus on Python based interfaces to ScaLAPACK, the PyScaLAPACK component of PyACTS. We briefly introduce their use, functionalities, and benefits. We illustrate a few simple example of their use, as well as exemplar utilization inside large scientific applications. We also comment on existing Python interfaces to other ACTS tools. We present some comparative performance results of PyACTS based versus direct LAPACK and ScaLAPACK code implementations.

X10 as a parallel language for scientific computation: Practice and experience (original) (raw)

Related papers