Improving Atmospheric Model Performance on a Multi-Core Cluster System (original) (raw)

Atmospheric models hybrid OpenMP/MPI implementation multicore cluster evaluation

International Journal of Information Technology, Communications and Convergence, 2012

Atmospheric models usually demand high processing power and generate large amounts of data. As the degree of parallelism grows, the I/O operations may become the major impacting factor of their performance. This work shows that a hybrid MPI/OpenMP implementation can improve the performance of the atmospheric model ocean-land-atmosphere model (OLAM) on a multicore cluster environment. We show that the hybrid MPI/OpenMP version of OLAM decreases the number of output files, resulting in better performance for I/O operations. We have evaluated OLAM on the parallel file system PVFS and shown that storing the files on PVFS results in lower performance than using the local disks of the cluster nodes due as a consequence of file creation and network concurrency. We have also shown that further parallel optimisations should be included in the hybrid version in order to improve the parallel execution time of OLAM.

NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

Explicit-in-time CG a continuous Galerkin discretization of the compressible Euler mini-app with explicit time integration; Explicit-in-time DG a discontinuous Galerkin discretization of the compressible Euler mini-app with explicit time integration; Vertically Semi-Implicit CG a continuous Galerkin discretization of the compressible Euler mini-app with vertically implicit semi-implicit time integration; Vertically Semi-Implicit DG a discontinuous Galerkin discretization of the compressible Euler mini-app with vertically implicit semi-implicit time integration; Once the performance of a mini-app is accepted it will be considered for adoption into NUMA. Extending the kernels in NUMA is being lead by Giraldo and his postdoctoral researcher Abdi. We will also make these mini-apps available to the community to be imported into other codes if desired. Wilcox is working closely with Warburton and his team to lead the effort to develop the mini-apps including hand rolled computational kernels optimized for GPU accelerators. These kernels are "hand-written" in OCCA, a library Warburton's group is developing that allows a single kernel to be compiled using many different threading frameworks, such as CUDA, OpenCL, OpenMP, and Pthreads. We are initially developing handwritten kernels to provide a performance target for the Loo.py generated kernels. Parallel communication between computational nodes will use the MPI standard to enable the mini-apps to run on large scale clusters. Using these community standards for parallel programing will allow our mini-apps to be portable to many platforms, however the performance may not be portable across devices. For performance portability, we, lead by Klöckner, are using Loo.py to generate OCCA kernels which can be automatically tuned for current many-core devices along with future ones.

Implementation and performance issues of a massively parallel atmospheric model

Parallel Computing, 1995

We present implementation and performance issues of a data parallel version of the National Center for Atmospheric Research (NCAR) Community Climate Model (CCM2). We describe automatic conversion tools used to aid in converting a production code written for a traditional vector architecture to data parallel code suitable for the Thinking Machines Corporation CM-S. Also, we describe the 3-D transposition method used to parallelize the spherical harmonic transforms in CCM2. This method employs dynamic data mapping techniques to improve data locality and parallel efficiency of these computations. We present performance data for the 3-D transposition method on the CM-5 for machine size up to 512 processors. We conclude that the parallel performance of the 3-D transposition method is adversely affected on the CM-5 by short vector lengths and array padding. We also find that the CM-5 spherical harmonic transforms spend about 70% of their execution time in communication. We detail a transposition-based data parallel implementation of the semi-Lagrangian Transport (SLT) algorithm used in CCM2. We analyze two approaches to parallelizing the SLT, called the departure point and arrival point based methods. We develop a performance model for choosing between these methods. We present SLT performance data which shows that the localized horizontal interpolation in the SLT takes 70% of the time, while the data remapping itself only require approximately 16%. We discuss the importance of scalable I/O to CCM2, and present the I/O rates measured on the CM-5. We compare the performance of the data parallel version of CCM2 on a 32-processor CM-5 with the optimized vector code running on a single processor Cray Y-MP. We show that the CM-S code is 75% faster. We also give the overall performance of CCM2 running at higher resolutions on different numbers of CM-5 processors. We conclude by discussing the significance of these results and their implications for data parallel climate models.

Performance Analysis of an Embarrassingly Parallel Application in Atmospheric Modeling

Research Journal of Applied Sciences, Engineering and Technology, 2015

This study aims at making a comparative study of various parallel programming models for a compute intensive application pertaining to Atmospheric modeling. Atmospheric modeling deals with predicting the behavior of atmosphere through mathematical equations governing the atmospheric fluid flows. The mathematical equations are nonlinear partial differential equations which are difficult to solve analytically. Thus fundamental governing equations of atmospheric motion are discretized into algebraic forms that are solved using numerical methods to obtain flow-field values at discrete points in time and/or space. Solving these equations often requires huge computational resource, which is normally available with high-speed supercomputers. Shallow Water equations provide a useful framework for the analysis of dynamics of large-scale atmospheric flow and for the analysis of various numerical methods that might be applied to the solution of these equations. In this study, Finite volume approach has been used for discretizing these equations that leads to a number of algebraic equations equal to the number of time instants at which the flow field values are to be evaluated. It is apparent that the application is embarrassingly parallel and its parallelization will suppress communication overhead. A High Performance Compute cluster has been employed for solving the equations involved in atmospheric modeling. Use of OpenMP and MPI APIs has paved the way to study the behavior of shared memory programming model and the message passing programming model in the context of such a highly compute intensive application. It is observed that no additional benefit can be enjoyed by creating too many software threads than the available hardware threads, as the execution resources should be shared among them.

Atlas : A library for numerical weather prediction and climate modelling

Computer Physics Communications, 2017

The algorithms underlying numerical weather prediction (NWP) and climate models that have been developed in the past few decades face an increasing challenge caused by the paradigm shift imposed by hardware vendors towards more energy-efficient devices. In order to provide a sustainable path to exascale High Performance Computing (HPC), applications become increasingly restricted by energy consumption. As a result, the emerging diverse and complex hardware solutions have a large impact on the programming models traditionally used in NWP software, triggering a rethink of design choices for future massively parallel software frameworks. In this paper, we present Atlas, a new software library that is currently being developed at the European Centre for Medium-Range Weather Forecasts (ECMWF), with the scope of handling data structures required for NWP applications in a flexible and massively parallel way. Atlas provides a versatile framework for the future development of efficient NWP and climate applications on emerging HPC architectures. The applications range from full Earth system models, to specific tools required for post-processing weather forecast products. The Atlas library thus constitutes a step towards affordable exascale high-performance simulations by providing the necessary abstractions that facilitate the application in heterogeneous HPC environments by promoting the co-design of NWP algorithms with the underlying hardware.

Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation

2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications, 2011

This work shows how a Hybrid MPI/OpenMP implementation can improve the performance of the Ocean-Land-Atmosphere Model (OLAM) on a multi-core cluster environment, which is a typical HPC many small files workload application. Previous experiments have shown that the scalability of this application on clusters is limited by the performance of the output operations. We show that the Hybrid MPI/OpenMP version of OLAM decreases the number of output files, resulting in better performance for I/O operations. We also observe that the MPI version of OLAM performs better for unbalanced workloads and that further parallel optimizations should be included on the hybrid version in order to improve the parallel execution time of OLAM.

Supercomputer Technologies as a Tool for High-resolution Atmospheric Modelling towards the Climatological Timescales

Supercomputing frontiers and innovations, 2018

Estimation of the recent and future climate changes is the most important challenge in the modern Earth sciences. Numerical climate models are an essential tool in this field of research. However, modelling results are highly sensitive to the spatial resolution of the model. The most of the climate change studies utilize the global atmospheric models with a grid cell size of tens of kilometres or more. High-resolution mesoscale models are much more detailed, but require significantly more computational resources. Applications of such high-resolution models in climate studies are usually limited by regional simulations and by relatively short timespan. In this paper we consider the experience of the long-term regional climate studies based on the mesoscale modelling. On the examples of urban climate studies and extreme wind assessments, we demonstrate the principle advantage of long-term high-resolution simulations, which were carried out on the modern supercomputers.

Performance Portability of the Aeras Atmosphere Model to Next Generation Architectures using Kokkos

The subject of this report is the performance portability of the Aeras global atmosphere dynamical core (implemented within the Albany multi-physics code) to new and emerging architecture machines using the Kokkos library and programming model. We describe the process of refactoring the finite element assembly process for the 3D hydrostatic model in Aeras and highlight common issues associated with development on GPU architectures. After giving detailed build and execute instructions for Aeras with MPI, OpenMP and CUDA on the Shannon cluster at Sandia National Laboratories and the Titan supercomputer at Oak Ridge National Laboratory, we evaluate the performance of the code on a canonical test case known as the baroclinic instability problem. We show a speedup of up to 4 times on 8 OpenMP threads, but we were unable to achieve a speedup on the GPU due to memory constraints. We conclude by providing methods for improving the performance of the code for future optimization.

Role of Parallel Computing in Numerical Weather Forecasting Models

Parallel computing plays a crucial role in state-of-the-art numerical weather and ocean forecasting models like WRF, POM, ROMS and RCAOM. The present study is an attempt to explore and examine the computational time required for the highly complex numerical simulations of weather and ocean models with multi core processors and variable RAM/processor speeds. The simulations, carried out using machines of different computational capability/configuration viz. quad core and Xeon machines, have been investigated with different synthetic experiments to evaluate the role of parallel computing in the operational forecasting system. The saturation rates with different number of processors are also calculated before carrying out forecasting studies. Serial and parallel computations have been carried out with WRF (Weather Forecasting Model) model for simulating the track of a natural hazard viz. the Thane cyclone. The simulations reveal that in the initial stage the computational time decrease...