Optimised Hybrid Parallelisation of a CFD Code on Many Core Architectures (original) (raw)

HICFD: Highly Efficient Implementation of CFD Codes for HPC Many-Core Architectures

Competence in High Performance Computing 2010, 2011

The objective of the German BMBF research project Highly Efficient Implementation of CFD Codes for HPC Many-Core Architectures (HICFD) is to develop new methods and tools for the analysis and optimization of the performance of parallel computational fluid dynamics (CFD) codes on high performance computer systems with many-core processors. In the work packages of the project it is investigated how the performance of parallel CFD codes written in C can be increased by the optimal use of all parallelism levels. On the highest level MPI is utilized. Furthermore, on the level of the many-core architecture, highly scaling, hybrid OpenMP/MPI methods are implemented. On the level of the processor cores the parallel SIMD units provided by modern CPUs are exploited.

Highly scalable computational algorithms on emerging parallel machine multicore architectures: development and implementation in CFD context

International Journal for Numerical Methods in Fluids, 2013

In this paper, the first in a series, the authors have developed and implemented new computational algorithms for improving the scalability of CFD simulations on emerging architectures such as multicore high performance computing (HPC) platforms. These algorithmic developments and implementations are classified into three categories: (i) improved partition for multicore platforms, (ii) improved and optimized communication for HPC and (iii) enhancing scalability using computer science based methods. In the first category, the multilevel partitioning strategy was modified to reduce the number of out-of-core communications. This resulted in noticeable speedup even for small cases. In the second category, the authors came up with a next generation communication procedure optimized for the architecture and the partitioning. This next generation communication resulted in noticeable speedups. In the third category, improvements with respect to better management of memory were implemented. This again resulted in a speedup of nearly 10%. The overall scalability, as a result of the three algorithmic implementations, yielded ideal and at times superlinear scalability until 3000 processors. In general, the scalability results are very promising and indicate that the approach has a great potential for more complicated multidisciplinary problems such as fluid-structure interaction and aeroelastic simulations.

High performance parallel computing of flows in complex geometries

Comptes Rendus Mécanique, 2011

Informatique, algorithmique Calcul parallèle Dynamique des fluides numérique Efficient numerical tools taking advantage of the ever increasing power of high-performance computers, become key elements in the fields of energy supply and transportation, not only from a purely scientific point of view, but also at the design stage in industry. Indeed, flow phenomena that occur in or around the industrial applications such as gas turbines or aircraft are still not mastered. In fact, most Computational Fluid Dynamics (CFD) predictions produced today focus on reduced or simplified versions of the real systems and are usually solved with a steady state assumption. This article shows how recent developments of CFD codes and parallel computer architectures can help overcoming this barrier. With this new environment, new scientific and technological challenges can be addressed provided that thousands of computing cores are efficiently used in parallel. Strategies of modern flow solvers are discussed with particular emphases on meshpartitioning, load balancing and communication. These concepts are used in two CFD codes developed by CERFACS: a multi-block structured code dedicated to aircrafts and turbomachinery as well as an unstructured code for gas turbine flow predictions. Leading edge computations obtained with these high-end massively parallel CFD codes are illustrated and discussed in the context of aircrafts, turbo-machinery and gas turbine applications. Finally, future developments of CFD and high-end computers are proposed to provide leading edge tools and end applications with strong industrial implications at the design stage of the next generation of aircraft and gas turbines.

Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review

Computational fluid dynamics (CFD) is one of the most emerging fields of fluid mechanics used to analyze fluid flow situation. This analysis is based on simulations carried out on computing machines. For complex configurations, the grid points are so large that the computational time required to obtain the results are very high. Parallel computing is adopted to reduce the computational time of CFD by utilizing the available resource of computing. Parallel computing tools like OpenMP, MPI, CUDA, combination of these and few others are used to achieve parallelization of CFD software. This article provides a comprehensive state of the art review of important CFD areas and parallelization strategies for the related software. Issues related to the computational time complexities and parallelization of CFD software are highlighted. Benefits and issues of using various parallel computing tools for parallelization of CFD software are briefed. Open areas of CFD where parallelization is not much attempted are identified and parallel computing tools which can be useful for parallelization of CFD software are spotlighted. Few suggestions for future work in parallel computing of CFD software are also provided.

Evaluation of CFD Computing Performance on Multi-Core Processors for Flow Simulations

Journal of Advanced Research in Applied Sciences and Engineering Technology

Previous parallel computing implementations for Computational Fluid Dynamics (CFD) focused extensively on Complex Instruction Set Computer (CISC). Parallel programming was incorporated into the previous generation of the Raspberry Pi Reduced Instruction Set Computer (RISC). However, it yielded poor computing performance due to the processing power limits of the time. This research focuses on utilising two Raspberry Pi 3 B+ with increased processing capability compared to its previous generation to tackle fluid flow problems using numerical analysis and CFD. Parallel computing elements such as Secure Shell (SSH) and the Message Passing Interface (MPI) protocol were implemented for Advanced RISC Machine (ARM) processors. The parallel network was then validated by a processor call attempt and core execution test. Parallelisation of the processors enables the study of fluid flow and computational fluid dynamics (CFD) problems, such as validation of the NACA 0012 airfoil and an additiona...

Efficiency of Large-Scale CFD Simulations on Modern Supercomputers Using Thousands of Cpus and Hybrid Mpi+openmp Parallelization

This work represents an experience in using the hybrid parallel model to perform large-scale DNS. Advantages of the hybrid approach compared to the MPI-only approach are presented and discussed. The use of OpenMP in addition to MPI is demonstrated for modelling of compressible and incompressible flows using both structured and unstructured meshes. A parallel Poisson solver for incompressible flows with one periodic direction extended with the hybrid parallelization is presented. A two-level domain decomposition approach is considered for improving parallel algorithms for compressible flows. An alternative strategy with partial data replication is represented as well. Several DNS examples with mesh sizes varying from 10 6 to 10 8 control volumes are given to demonstrate efficient usage of the upgraded algorithms. Performance tests and simulations have been carried out on several parallel systems including Marenostrum, MVS-100000 and Lomonosov supercomputers.

Multiple threads and parallel challenges for large simulations to accelerate a general Navier–Stokes CFD code on massively parallel systems

2012

Computational fluid dynamics is an increasingly important application domain for computational scientists. In this paper, we propose and analyze optimizations necessary to run CFD simulations consisting of multibillion-cell mesh models on large processor systems. Our investigation leverages the general industrial Navier-Stokes CFD application, Code_Saturne, developed by Electricité de France for incompressible and nearly compressible flows. In this paper, we outline the main bottlenecks and challenges for massively parallel systems and emerging processor features such as many-core, transactional memory, and thread level speculation. We also present an approach based on an octree search algorithm to facilitate the joining of mesh parts and to build complex larger unstructured meshes of several billion grid cells. We describe two parallel strategies of an algebraic multigrid solver and we detail how to introduce new levels of parallelism based on compiler directives with OpenMP, transactional memory and thread level speculation, for finite volume cell-centered formulation and face-based loops. A renumbering scheme for mesh faces is proposed to enhance thread-level parallelism. and implementations capable of simulating with multibillions of cells or particles are beginning to emerge within the research community. Nevertheless, one of the bigger challenges is to reach this capability with general CFD Navier-Stokes industrial software.

Towards the Implementation of Wind Turbine Simulations on Many-Core Systems

53rd AIAA Aerospace Sciences Meeting, 2015

We are concerned with alternative approaches to accelerate the matrix construction step that is a computationally intensive portion of the Finite Element Method (FEM) frame-work. Our target application is part of a wind turbine simulation tool-chain modeled using the Navier-Stokes equations for incompressible flow and discretized with the discontinuous Galerkin (DG) finite element method and implicit time-stepping. The Poisson pressure correction equation and the structural part of the fluid-structure interaction are formu-lated and solved numerically in the continuous Galerkin framework. The construction of the required matrix is performed by exploiting multiple Graphics Processing Units of a cluster, using the CUDA programming model. The performance results indicate that our approach scales well as more nodes of the cluster, as well as more GPUs within each node are exploited.

Parallelization of a three-dimensional flow solver for Euler rotorcraft aerodynamics predictions

AIAA Journal, 1996

An approach for parallelizing the three-dimensional Euler/Navier-Stokes rotorcraft computational fluid dynamics flow solver transonic unsteady rotor Navier-Stokes (TURNS) is introduced. Parallelization is performed using a domain decomposition technique that is developed for distributed-memory parallel architectures. Communication between the subdomains on each processor is performed via message passing in the form of message passing interface subroutine calls. The most difficult portion of the TURNS algorithm to implement efficiently in parallel is the implicit time step using the lower-upper symmetric Gauss-Seidel (LU-SGS) algorithm. Two modifications of LU-SGS are proposed to improve the parallel performance. First, a previously introduced Jacobi-like method called data-parallel lower upper relaxation (DP-LUR) is used. Second, a new hybrid method is introduced that combines the Jacobi sweeping approach in DP-LUR for interprocessor communications and the symmetric Gauss-Seidel algorithm in LU-SGS for on-processor computations. The parallelized TURNS code with the modified implicit operator is implemented on two distributed-memory multiprocessor, the IBM SP2 and Thinking Machines CM-5, and used to compute the three-dimensional quasisteady and unsteady flowfield of a helicopter rotor in forward flight. Good parallel speedups with a low percentage of communication are exhibited by the code. The proposed hybrid algorithm requires less CPU time than DP-LUR while maintaining comparable parallel speedups and communication costs. Execution rates found on the IBM SP2 are impressive; on 114 processors of the SP2, the solution time of both quasisteady and unsteady calculations is reduced by a factor of about 12 over a single processor of the Cray C-90.