Execution of a parallel edge-based Navier–Stokes solver on commodity graphics processor units (original) (raw)
Related papers
Acceleration of iterative Navier-Stokes solvers on graphics processing units
International Journal of Computational Fluid Dynamics, 2013
We implemented the pressure-implicit with splitting of operators (PISO) and semi-implicit method for pressure-linked equations (SIMPLE) solvers of the Navier-Stokes equations on Fermi-class graphics processing units (GPUs) using the CUDA technology. We also introduced a new format of sparse matrices optimized for performing elementary CFD operations, like gradient or divergence discretization, on GPUs. We verified the validity of the implementation on several standard, steady and unsteady problems. Computational efficiency of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms implemented in OpenFOAM. The results show that a GPU (Tesla C2070) can outperform a server-class 6-core, 12-thread CPU (Intel Xeon X5670) by a factor of 4.2.
Computational Fluid Dynamics Simulations Using Many Graphics Processors
Computing in Science & Engineering, 2012
Unsteady computational fluid dynamics simulations of turbulence are performed using up to 64 graphics processors. The results from two GPU clusters and a CPU cluster are compared. A second-order staggered-mesh spatial discretization is coupled with a low storage three-step Runge-Kutta time advancement and pressure projection at each substep. The pressure Poisson equation dominates the solution time and is solved with the preconditioned Conjugate Gradient method. The CFD algorithm is optimized to use the fast shared-memory on the GPUs and to use communication/computation overlapping. Detailed timings reveal that the internal calculations now occur so efficiently that the operations related to communication are the scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. 2. Implementation 2.1 Hardware We will primarily discuss results computed on the Lincoln supercomputer housed at NCSA. This machine has 96 Tesla S1070s (384 GPUs total). Each GPU has 4GB of memory and a theoretical bandwidth of 102 GBs to that memory. Each GPU has a 4x PCI-e Gen2 connection (2 GB/s) to its CPU host. In addition, we will also perform tests on Orion, which is an in-house GPU machine containing 4 NVIDIA 295 cards, 8 GPUs. On Orion, each GPU has 0.9 GB of memory and a theoretical bandwidth of 112 GBs. The connection between the GPUs on Orion uses a 8x PCI-e Gen2 connection (4 GB/s) and for simplicity communication still uses the MPI protocol. Also we replaced the first and second GPUs with GTX 480 and Tesla C2070 cards in order to run some cases with these new GPUs. These GPU results will be compared to the CPU cores on Lincoln (quad core Intel 64 Harpertown). The low-level CFD algorithm structure is dictated by two key features of the GPU hardware. First, the GPUs read/write memory is one order of magnitude faster when the memory is read linearly. Random reads/writes are comparatively slow on a GPU. In addition, each multi-processor on the GPU has some very fast on-chip memory (shared memory) which serves essentially as an addressable programsupervised cache. CFD, like all most three-dimensional PDE solution applications, requires considerable random memory accesses (even when using structured meshes). Roughly 90% of these slow random memory accesses can be eliminated by: (1) linearly reading large chunks of data into the shared-memory space, which is fast for all accesses, (2) operating on the data in the shared-memory, and then (3) writing the processed data back to the main GPU memory (global memory) linearly. This optimization is the key to the 45x speedup of the GPU over a CPU.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
1995
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies — the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms
32nd Aerospace Sciences Meeting and Exhibit
A breakthrough in computer performance is possible, as has been demonstrated in the last years [I], by using the massively parallel machines 121. Massively parallel computers use very large number of processors operating simultaneously in either SIMD (Single Instruction Multiple Data), MIMD (Multiple Instruction Multiple Data) or a combination of the two. The availability of massively parallel machines in the market created a need for software which is capable of taking advantage of the new technology. Several problems arise when an efficient implementation on a massively parallel machine is sought. The most time consuming part of a massively parallel computation is the interprocessor communication rather than floating point operations [3]. An effort must be directed, thus, to more efficient communication. An efficient treatment of boundary conditions is mandatory in massively parallel applications. Typically only a small portion of the 4
Strategies for parallelizing a Navier-Stokes code on the intel touchstone machines
International Journal for Numerical Methods in Fluids, 1992
The purpose of this paper is to predict the efficiency of the Navier-Stokes code NSS*, which will run on an MIMD architecture parallel machine. Computations are performed using a three-dimensional overlapping structured multiblock grid. Each processor works with some of these blocks, and exchanges data across the boundaries of the blocks. The efficiency of such a code depends on the number of grid points per processor, the amount of computation per grid point, and the amount of communication per boundary point. In this paper, we estimate these quantities for NSS*, and present measurements of communication times for two parallel machines, the Intel Touchstone Delta machine, and an Intel iPSC/860 machine, consisting of 520 and 64 Intel i860 processors respectively. The peak performance of the Delta machine is 32 Gflops. Secondly it is shown how, starting from a 7-block grid of about 5 000 000 points for the Hermes space plane, a mesh of 512 equally sized blocks is constructed, retaining the original topology. This example demonstrates that multiblock grids provide sufficient control over both the number and size of blocks. Therefore, it will be possible to simulate realistic configurations on massively parallel systems with a specified number of processors, while achieving good quality load balancing.
A GPU cluster optimized multigrid scheme for computing unsteady incompressible fluid flow
arXiv (Cornell University), 2013
A multigrid scheme is proposed for the pressure equation of the incompressible unsteady fluid flow equations, allowing efficient implementation on clusters of modern CPUs, many integrated core devices (MICs), and graphics processing units (GPUs). It is shown that the total number of the synchronization events can be significantly reduced when a deep, 2h grid hierarchy is replaced with a two-level scheme using 16h-32h restriction, fitting to the the width of the SIMD engine of modern CPUs and GPUs. In addition, optimal memory transfer is also ensured, since no strided memory access is required. We report increasing arithmetic intensity of the smoothing steps when compared to the conventional additive correction multigrid (ACM), however it is counterbalanced in runtime by the decreasing number of the expensive restriction steps. A systematic construction methodology for the coarse grid stencil is also presented that helps in moderating the excess arithmetic intensity associated with the aggressive coarsening. Our higher order interpolated stencil improves convergence rate via minimizing spurious interference between the coarse and the fine scale solutions. The method is demonstrated on solving the pressure equation for 2D incompressible fluid flow: The benchmark setups cover shear driven laminar flow in cavity, and direct numerical simulation (DNS) of a turbulent jet. We have compared our scheme to the ACM in terms of the arithmetic intensity of the iterations and the number of the synchronization calls required. Also the strong scaling is plotted for our scheme when using a hybrid OpenCl/MPI based parallelization.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
The Journal of Supercomputing, 1997
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time-accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared-memory multiprocessor (the CRAY Y-MP), and distributed-memory multiprocessors with different topologies (the IBM SP and the CRAY T3D). We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message-passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Computational Fluid Dynamics and Gpus
2015
Computational Fluid Dynamics (CFD) simulations are aimed to reconstruct the reality of fluid motion and behaviour as accurately as possible, to better understand the natural phenomena under specified conditions. Ideally, computational models would need to cover different scales and geometric configurations, and the classic CFD solvers most often require long computational times to satisfy the convergence criteria. With the advent of heterogeneous compute platforms (including Graphics Processing Units GPUs), CFD algorithms can now be implemented to give results in near real-time. The current paper briefly reviews and demonstrates in a general way, two methods able to harness the power of GPUs, to speed up numerical simulations of fluid flows for industrial applications. These include the Highly Simplified Marker and Cell Method (HSMAC), and Lattice-Boltzmann Method (LBM) implemented on GPUs using OpenCL. This paper describes general capabilities for compute and graphics, and method p...
Asynchronous Navier-Stokes Solver on 3D Unstructured Grids for the Exascale Era
2019
This project has developed multiple fluid dynamics solvers for complex 3D flows using fully asynchronous distributed-memory task-parallel algorithms on top of the Charm++ [1] runtime system. The algorithms solve the Euler or Navier-Stokes equations of compressible flows using unstructured tetrahedron meshes with optional solution-adaptive mesh-, and polynomial-order refinement. We have demonstrated excellent strong scaling up 50K compute cores and the benefits of Charm++'s automatic load balancing. 2 Accomplishments at a glance • Implemented the first unstructured-mesh partial differential equations (PDE) solver on the Charm++ runtime system with automatic load balancing. • Demonstrated, for the first time, that excellent parallel performance can be achieved using Charm++ of a PDE solver on unstructured grids, useful for complex 3D engineering problem geometries. • Implemented both node-centered and cell-centered finite element algorithms for the simulation of compressible high-speed flows. All algorithms are in 3D, fully asynchronous, task-based, distributedmemory-parallel, and exhibit excellent strong scaling up to 50K CPU cores, the most tested. • Developed and implemented a new adaptive DG algorithm that automatically adjusts the order of the approximation polynomial based on local error estimators and exercised it for single-material verification cases on 3D unstructured meshes with Charm++'s automatic load balancing capabilities. • Developed, implemented, and verified a new DG method for compressible multi-material flows. • Implemented adaptive mesh refinement in 3D using an, asynchronous distributed-memory-, taskparallel algorithm. • Developed the code in a production fashion, with extensive unit-, and regression test suites and highdegree of code reuse using the C++17 standard. Also exercise mandatory code reviews, test code coverage analysis, using LANL-internal and public-facing continuous integration servers. • Implemented various code capabilities that enable large-scale fluid dynamics, e.g., file/rank N-to-M parallel I/O, checkpoint/restart, and compile-time-configurable zero-runtime-overhead memory layout for large-data arrays to enable enable performance-portability across different architectures. • Released the code as open source, see https://quinoacomputing.org.
Finite difference simulations of the Navier-Stokes equations using parallel distributed computing
Proceedings. 15th Symposium on Computer Architecture and High Performance Computing, 2003
This paper discusses the implementation of a numerical algorithm for simulating incompressible fluid flows based on the finite difference method and designed for parallel computing platforms with distributed-memory, particularly for clusters of workstations. The solution algorithm for the Navier-Stokes equations utilizes an explicit scheme for pressure and an implicit scheme for velocities, i. e., the velocity field at a new time step can be computed once the corresponding pressure is known. The parallel implementation is based on domain decomposition, where the original calculation domain is decomposed into several blocks, each of which given to a separate processing node. All nodes then execute computations in parallel, each node on its associated sub-domain. The parallel computations include initialization, coefficient generation, linear solution on the subdomain, and inter-node communication. The exchange of information across the sub-domains, or processors, is achieved using the message passing interface standard, MPI. The use of MPI ensures portability across different computing platforms ranging from massively parallel machines to clusters of workstations. The execution time and speed-up are evaluated through comparing the performance of different numbers of processors. The results indicate that the parallel code can significantly improve prediction capability and efficiency for large-scale simulations.