Hybrid Programming Model for Implicit PDE Simulations on Multicore Architectures (original) (raw)

Understanding the parallel scalability of an implicit unstructured mesh cfd code

2000

In this paper, we identify the scalability bottlenecks of an unstructured grid CFD code (PETSc-FUN3D) by studying the impact of several algorithmic and architectural parameters and by examining different programming models. We discuss the basic performance characteristics of this PDE code with the help of simple performance models developed in our earlier work, presenting primarily experimental results. In addition to achieving good per-processor performance (which has been addressed in our cited work and without which scalability claims are suspect) we strive to improve the implementation and convergence scalability of PETSc-FUN3D on thousands of processors.

Analyzing the Parallel Scalability of an Implicit Unstructured Mesh CFD Code

High Performance Computing, 2000

In this paper, we identify the scalability bottlenecks of an unstructured grid CFD code (PETSc-FUN3D) by studying the impact of several algorithmic and architectural parameters and by examiningdif ferent programmingmodels. We discuss the basic performance characteristics of this PDE code with the help of simple performance models developed in our earlier work, presentingprimarily experimental results. In addition to achievingg ood

Multiple threads and parallel challenges for large simulations to accelerate a general Navier–Stokes CFD code on massively parallel systems

2012

Computational fluid dynamics is an increasingly important application domain for computational scientists. In this paper, we propose and analyze optimizations necessary to run CFD simulations consisting of multibillion-cell mesh models on large processor systems. Our investigation leverages the general industrial Navier-Stokes CFD application, Code_Saturne, developed by Electricité de France for incompressible and nearly compressible flows. In this paper, we outline the main bottlenecks and challenges for massively parallel systems and emerging processor features such as many-core, transactional memory, and thread level speculation. We also present an approach based on an octree search algorithm to facilitate the joining of mesh parts and to build complex larger unstructured meshes of several billion grid cells. We describe two parallel strategies of an algebraic multigrid solver and we detail how to introduce new levels of parallelism based on compiler directives with OpenMP, transactional memory and thread level speculation, for finite volume cell-centered formulation and face-based loops. A renumbering scheme for mesh faces is proposed to enhance thread-level parallelism. and implementations capable of simulating with multibillions of cells or particles are beginning to emerge within the research community. Nevertheless, one of the bigger challenges is to reach this capability with general CFD Navier-Stokes industrial software.

Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes

A computational Fluid Dynamics (CFD) code for steady simulations solves a set of non-linear partial differential equations using an iterative time stepping process, which could follow an explicit or an implicit scheme. On the CPU, the difference between both time stepping methods with respect to stability and performance has been well covered in the literature. However, it has not been extended to consider modern high-performance computing systems such as Graphics Processing Units (GPU). In this work, we first present an implementation of the two time-stepping methods on the GPU, highlighting the different challenges on the programming approach. Then we introduce a classification of basic CFD operations, found on the degree of parallelism they expose, and study the potential of GPU acceleration for every class. The classification provides local speedups of basic operations, which are finally used to compare the performance of both methods on the GPU. The target of this work is to enable an informed-decision on the most efficient combination of hardware and method when facing a new application. Our findings prove, that the choice between explicit and implicit time integration relies mainly on the convergence of explicit solvers and the efficiency of preconditioners on the GPU.

Experiences Using Hybrid MPI/OpenMP in the Real World: Parallelization of a 3D CFD Solver for Multi-Core Node Clusters

Scientific Programming, 2010

Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: shared-memory nodes with several multi-core CPUs are connected via a network infrastructure. When parallelizing an application for these architectures it seems natural to employ a hierarchical programming model such as combining MPI and OpenMP. Nevertheless, there is the general lore that pure MPI outperforms the hybrid MPI/OpenMP approach. In this paper, we describe the hybrid MPI/OpenMP parallelization of IR3D (Incompressible Realistic 3-D) code, a full-scale real-world application, which simulates the environmental effects on the evolution of vortices trailing behind control surfaces of underwater vehicles. We discuss performance, scalability and limitations of the pure MPI version of the code on a variety of hardware platforms and show how the hybrid approach can help to overcome certain limitations.

Efficiency of Large-Scale CFD Simulations on Modern Supercomputers Using Thousands of Cpus and Hybrid Mpi+openmp Parallelization

This work represents an experience in using the hybrid parallel model to perform large-scale DNS. Advantages of the hybrid approach compared to the MPI-only approach are presented and discussed. The use of OpenMP in addition to MPI is demonstrated for modelling of compressible and incompressible flows using both structured and unstructured meshes. A parallel Poisson solver for incompressible flows with one periodic direction extended with the hybrid parallelization is presented. A two-level domain decomposition approach is considered for improving parallel algorithms for compressible flows. An alternative strategy with partial data replication is represented as well. Several DNS examples with mesh sizes varying from 10 6 to 10 8 control volumes are given to demonstrate efficient usage of the upgraded algorithms. Performance tests and simulations have been carried out on several parallel systems including Marenostrum, MVS-100000 and Lomonosov supercomputers.

High-performance parallel implicit CFD

Parallel Computing, 2001

Fluid dynamical simulations based on ®nite discretizations on (quasi-)static grids scale well in parallel, but execute at a disappointing percentage of per-processor peak¯oating point operation rates without special attention to layout and access ordering of data. We document both claims from our experience with an unstructured grid CFD code that is typical of the state of the practice at NASA. These basic performance characteristics of PDE-based codes can be understood with surprisingly simple models, for which we quote earlier work, presenting primarily experimental results. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per node performance. This snapshot of ongoing work updates our 1999 Bell Prize-winning simulation on ASCI computers. Ó 2001 Published by Elsevier Science B.V.

A Scalable Strategy for the Parallelization of Multiphysics Unstructured Mesh-Iterative Codes on Distributed-Memory Systems

International Journal of High Performance Computing Applications, 2000

Realizing scalable performance on high performance computing systems is not straightforward for single-phenomenon codes (such as computational fluid dynamics [CFD]). This task is magnified considerably when the target software involves the interactions of a range of phenomena that have distinctive solution procedures involving different discretization methods. The problems of addressing the key issues of retaining data integrity and the ordering of the calculation procedures are significant. A strategy for parallelizing this multiphysics family of codes is described for software exploiting finite-volume discretization methods on unstructured meshes using iterative solution procedures. A mesh partitioning-based SPMD approach is used. However, since different variables use distinct discretization schemes, this means that distinct partitions are required; techniques for addressing this issue are described using the mesh-partitioning tool, JOSTLE. In this contribution, the strategy is tested for a variety of test cases under a wide range of conditions (e.g., problem size, number of processors, asynchronous/synchronous communications, etc.) using a variety of strategies for mapping the mesh partition onto the processor topology.

Superlinear speedup in OpenMP parallelization of a local PDE solver

This paper analyses the application of OpenMP parallelization on shared-memory systems, such as the increasingly available multicore systems. The parallelization of the local meshless numerical method is considered. The presented solution procedure is suitable for solving systems of coupled partial differential equations. The superlinear speedup is demonstrated on a solution of the fluid mechanics problem. Local core caches are identified as the source of superlinearity and a set of experiments is performed for the analysis of a cache induced superlinear speedup. For the experiments, a simple algorithm simulating the workload of the local meshless numerical method is used for the method complexity assessment.