Performance Analysis of an Embarrassingly Parallel Application in Atmospheric Modeling (original) (raw)

Implementation and performance issues of a massively parallel atmospheric model

Parallel Computing, 1995

We present implementation and performance issues of a data parallel version of the National Center for Atmospheric Research (NCAR) Community Climate Model (CCM2). We describe automatic conversion tools used to aid in converting a production code written for a traditional vector architecture to data parallel code suitable for the Thinking Machines Corporation CM-S. Also, we describe the 3-D transposition method used to parallelize the spherical harmonic transforms in CCM2. This method employs dynamic data mapping techniques to improve data locality and parallel efficiency of these computations. We present performance data for the 3-D transposition method on the CM-5 for machine size up to 512 processors. We conclude that the parallel performance of the 3-D transposition method is adversely affected on the CM-5 by short vector lengths and array padding. We also find that the CM-5 spherical harmonic transforms spend about 70% of their execution time in communication. We detail a transposition-based data parallel implementation of the semi-Lagrangian Transport (SLT) algorithm used in CCM2. We analyze two approaches to parallelizing the SLT, called the departure point and arrival point based methods. We develop a performance model for choosing between these methods. We present SLT performance data which shows that the localized horizontal interpolation in the SLT takes 70% of the time, while the data remapping itself only require approximately 16%. We discuss the importance of scalable I/O to CCM2, and present the I/O rates measured on the CM-5. We compare the performance of the data parallel version of CCM2 on a 32-processor CM-5 with the optimized vector code running on a single processor Cray Y-MP. We show that the CM-S code is 75% faster. We also give the overall performance of CCM2 running at higher resolutions on different numbers of CM-5 processors. We conclude by discussing the significance of these results and their implications for data parallel climate models.

A Distributed Memory Implementation of the Regional Atmospheric Model PROMES

2009

This paper describes the parallelization process of the code PROMES, which represents a regional atmospheric model developed by some of the authors. The parallel code, called PROMESPAR, has been carried out under a distributed platform (cluster of PCs) and using Message Passing Interface (MPI) communication subroutines.

A Comparison of Implicitly Parallel Multithreaded and Data-Parallel Implementations of an Ocean Model

Journal of Parallel and Distributed Computing, 1998

Two parallel implementations of a state-of-the-art ocean model are described and analyzed: one is written in the implicitly parallel language Id for the Monsoon multithreaded dataflow architecture, and the other in data parallel CM Fortran for the CM-5. The multithreaded programming model is inherently more expressive than the data parallel model. One goal of this study is to understand what, if any, are the performance penalties of multithreaded execution when implementing a program that is well suited for data parallel execution. To avoid technology and machine configuration issues, the two implementations are compared in terms of overhead cycles per required floating point operation. When simulating flows in complex geometries typical of ocean basins, the data parallel model only remains efficient if redundant computations are performed over land. The generality of the Id programming model allows one to easily and transparently implement a parallel code that computes only in the ocean. When simulating ocean basins with complex and irregular geometry the normalised performance on Monsoon is comparable with that of the CM-5. For more regular geometries, that map well to the computational domain, the data-parallel approach proves to be a better match. We conclude by examining the extent to which clusters of mainstream symmetric multiprocessor (SMP) systems offer a scientific computing environment which can capitalize on and combine the strengths of the two paradigms.

Role of Parallel Computing in Numerical Weather Forecasting Models

Parallel computing plays a crucial role in state-of-the-art numerical weather and ocean forecasting models like WRF, POM, ROMS and RCAOM. The present study is an attempt to explore and examine the computational time required for the highly complex numerical simulations of weather and ocean models with multi core processors and variable RAM/processor speeds. The simulations, carried out using machines of different computational capability/configuration viz. quad core and Xeon machines, have been investigated with different synthetic experiments to evaluate the role of parallel computing in the operational forecasting system. The saturation rates with different number of processors are also calculated before carrying out forecasting studies. Serial and parallel computations have been carried out with WRF (Weather Forecasting Model) model for simulating the track of a natural hazard viz. the Thane cyclone. The simulations reveal that in the initial stage the computational time decrease...

Performance of parallel computers for spectral atmospheric models

1995

Abstract Massively parallel processing (MPP) computer systems use high-speed interconnection networks to link hundreds or thousands of RISC microprocessors. With each microprocessor having a peak performance of 100 M ops/sec or more, there is at least the possibility of achieving very high performance. However, the question of exactly how to achieve this performance remains unanswered. MPP systems and vector multiprocessors require very di erent coding styles. Di erent MPP systems have widely varying ...

An interactive parallel programming environment applied in atmospheric science

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 1996

This article introduces an interactive parallel programming environmentt (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

Parallel Implementation of a Large-Scale 3-D Air Pollution Model

Lecture Notes in Computer Science, 2001

Air pollution models can efficiently be used in different environmental studies. The atmosphere is the most dynamic component of the environment, where the pollutants can be transported over very long distances. Therefore the models must be defined on a large space domain. Moreover, all relevant physical and chemical processes must be adequately described. This leads to huge computational tasks. That is why it is difficult to handle numerically such models even on the most powerful up-to-date supercomputers. The particular model used in this study is the Danish Eulerian Model. The numerical methods used in the advection-diffusion part of this model consist of finite elements (for discretizing the spatial derivatives) followed by predictor-corrector schemes with several different correctors (in the numerical treatment of the resulting systems of ordinary differential equations). Implicit methods for the solution of stiff systems of ordinary differential equations are used in the chemistry part. This implies the use of Newton-like iterative methods. A special sparse matrix technique is applied in order to increase the efficiency. The model is constantly updated with new faster and more accurate numerical methods. The three-dimensional version of the Danish Eulerian Model is presented in this work. The model is defined on a space domain of 4800 km × 4800 km that covers the whole of Europe together with parts of Asia, Africa and the Atlantic Ocean. A chemical scheme with 35 species is used in this version. Two parallel implementations are discussed; the first one for shared memory parallel computers, the second one -the newly developed version for distributed memory computers. Standard tools are used to achieve parallelism: OpenMP for shared memory computers and MPI for distributed memory computers. Results from many experiments, which were carried out on a SUN SMP cluster and on a CRAY T3E at the Edinburgh Parallel Computer Centre (EPCC), are presented and analyzed.

Optimization of atmospheric transport models on HPC platforms

Computers & Geosciences, 2016

The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed.

Environmental modeling on massively parallel computers

Environmental Modelling & Software, 2000

In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection-diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.