Improving the performance of large-scale unstructured PDE applications (original) (raw)

Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics

International Journal for Numerical Methods in Engineering, 1993

Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the 'divide and conquer' paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two-and three-dimensional finite element and finite difference meshes is presented and discussed in view of a domain decomposed solution procedure and parallel processing. The influence of the algorithmic aspects of a solution method (implicit/explicit computations), and the architectural specifics of a multiprocessor (SIM D/MIMD, startup/transmission time), on the design of a mesh partitioning algorithm are discussed. The impact of the partitioning strategy on load balancing, operation count, operator conditioning, rate of convergence and processor mapping is also addressed. Finally, the proposed mesh decomposition algorithms are demonstrated with realistic examples of finite element, finite volume, and finite difference meshes associated with the parallel solution of solid and fluid mechanics problems on the iPSC/2 and iPSC/860 multiprocessors.

A Scalable Strategy for the Parallelization of Multiphysics Unstructured Mesh-Iterative Codes on Distributed-Memory Systems

International Journal of High Performance Computing Applications, 2000

Realizing scalable performance on high performance computing systems is not straightforward for single-phenomenon codes (such as computational fluid dynamics [CFD]). This task is magnified considerably when the target software involves the interactions of a range of phenomena that have distinctive solution procedures involving different discretization methods. The problems of addressing the key issues of retaining data integrity and the ordering of the calculation procedures are significant. A strategy for parallelizing this multiphysics family of codes is described for software exploiting finite-volume discretization methods on unstructured meshes using iterative solution procedures. A mesh partitioning-based SPMD approach is used. However, since different variables use distinct discretization schemes, this means that distinct partitions are required; techniques for addressing this issue are described using the mesh-partitioning tool, JOSTLE. In this contribution, the strategy is tested for a variety of test cases under a wide range of conditions (e.g., problem size, number of processors, asynchronous/synchronous communications, etc.) using a variety of strategies for mapping the mesh partition onto the processor topology.

Domain decomposer: A software tool for mapping PDE computations to parallel architectures

1990

Domain decomposition methods have proved to be an efficient approach for parallel processing of partial differential equations (PDEs) on parallel architectures. Their built in course grain parallelism makes them suitable for MIMD computing as a methodology to assure that the algebraic data are generated and distributed in different processors so that the processor workload is balanced and their synchronization/communication cost is kept minimum. These requirements can introduce serious computation costs since many times optimum workload balance and minimum synchronization/communication cost involve the solution of NP-hard problems. In this paper we outline a software infrastructure consisting of "fast" heuristics for determining "optimal" mapping of PDE data suitable for domain decomposition methods. Furthermore we describe a software system which assists the . user to visualize and manipulate such mappings in the environment of parallel-ELLPACK system. ·University of Thessaloniki, Polytec:hnic School, Thessaloniki, GREECE rThis re!learch was supported in parl by AFOSR 88-0234, ARO graI1t DAAG29-83-K-0026, NSF grant CCF·8619817 and ESPRIT projed GENESIS.

Mesh Partitioning Algorithms for the Parallel Solution of Partial Differential Equations

Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the " divide and conquer " paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two-and three-dimensional finite element and finite difference meshes are presented and discussed in view of a domain-decomposed solution procedure and parallel processing.

Computational Efficiency of Parallel Unstructured Finite Element Simulations

High Performance Computing on Vector Systems, 2006

In this paper we address various efficiency aspects of finite element (FE) simulations on vector computers. Especially for the numerical simulation of large scale Computational Fluid Dynamics (CFD) and Fluid-Structure Interaction (FSI) problems efficiency and robustness of the algorithms are two key requirements. In the first part of this paper a straightforward concept is described to increase the performance of the integration of finite elements in arbitrary, unstructured meshes by allowing for vectorization. In addition the effect of different programming languages and different array management techniques on the performance will be investigated. Besides the element calculation, the solution of the linear system of equations takes a considerable part of computation time. Using the jagged diagonal format (JAD) for the sparse matrix, the average vector length can be increased. Block oriented computation schemes lead to considerably less indirect addressing and at the same time packaging more instructions. Thus, the overall performance of the iterative solver can be improved. The last part discusses the input and output facility of parallel scientific software. Next to efficiency the crucial requirements for the IO subsystem in a parallel setting are scalability, flexibility and long term reliability. 2

Parallel domain discretization algorithm for RBF-FD and other meshless numerical methods for solving PDEs

Computers & Structures, 2022

In this paper, we present a novel parallel dimension-independent node positioning algorithm that is capable of generating nodes with variable density, suitable for meshless numerical analysis. A very efficient sequential algorithm based on Poisson disc sampling is parallelized for use on shared-memory computers, such as the modern workstations with multi-core processors. The parallel algorithm uses a global spatial indexing method with its data divided into two levels, which allows for an efficient multi-threaded implementation. The addition of bootstrapping enables the algorithm to use any number of parallel threads while remaining as general as its sequential variant. We demonstrate the algorithm performance on six complex 2-and 3-dimensional domains, which are either of non rectangular shape or have varying nodal spacing or both. We perform a run-time analysis of the algorithm, to demonstrate its ability to reach high speedups regardless of the domain and to show how well it scales on the experimental hardware with 16 processor cores. We also analyse the algorithm in terms of the effects of domain shape, quality of point placement, and various parallelization overheads.

A DOMAIN-DECOMPOSITION BASED PARALLEL PROCEDURE FOR THE COMBINED FINITE-DISCRETE ELEMENT METHOD IN 2D

Although the Combined Finite-Discrete Element Method (FDEM) has proven itself in dealing with problems of complex shapes, fracture and fragmentation, there is a stark reality of CPU requirements when dealing with industrial scale problems; in other words there is a compelling need for a parallel-processing framework to address large scale and grand challenge type of problems. One of the more recent development efforts in the context of FDEM was directed to implement the parallelization techniques needed for this method. In this paper a FDEM parallelization framework has been developed. Static domain decomposition and message passing inter-processor communication have been implemented in the FDEM code. The performance of the FDEM code in three typical problems is presented. For a discrete particle problem over 900 times speed-up has been obtained on 1000 processors. It has also been shown that the performance, especially efficiency of the parallelized software, still depends on the p...

The efficient parallel solution of PDEs

Computers & Mathematics with Applications, 1996

The report presents some results in solving finite element equations via a parallel version of the preconditioned cg-method (ParPCG). We use a nonoverlapping domain decomposition and construct preconditioners based on Additive and Multiplicative Schwarz Methods (ASM/MSM). As components in the preconditioner, multigrid methods, hierarchical bases, new extension techniques and modified BPS-and BPX-preconditioners for handling the unknowns at the coupling nodes on the boundaries between subdomains are used. The scale up efficiency (e.g., an increasing number of processors causes an increasing problem size) of the algorithm by doubling the number of processors is larger than 95%. Even the practical not relevant speed up efficiency (e.g., an increasing number of processors and a constant problem size) reaches 80% by doubling the number of processors.

Domain decomposition on parallel computers

IMPACT of Computing in Science and Engineering, 1989

YALE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE &9 1,2 27 132' We consider the application of domain decomposition techniques to the solution of sparse linear systems arising from implicit PDE discretizations on parallel computers. Representatives of two popular MIMID architectures, message passing (the Intel iPSC/2-SX) and shared memory (the Encore Multimax 320), are employed. We run the same numerical experiments on each, namely stripwise and boxwise decompositions of the unit square, using up to 64 subdomains and containing up to 64K degrees of freedom. We produce a tight-fitting complexity model for the former and discuss the difficulty of doing so for the latter. We also evaluate which of three types of domain decomposition preconditioners that have appeared in the literature of self-adjoint elliptic problems are most efficient in different regions of machine-problem parameter space. Some form of global sharing of information in the preconditioner is required for efficient overall parallel implementation in the region of most practical interest (large problem sizes and large numbers of processors); otherwise, an increasing iteration count inveighs against the gains of concurrency. Our resuits on a per iteration basis also hold for sparse discrete systems arising from other types of nartial differential equations, but in the absence of a theory for the dependence of the convergence rate upon the granularity of the decomposition, the overall results are only suggestive for more general svsterns.

Parallel Library for Unstructured Mesh Problems

The growing class of applications which solve partial di erential equations (PDEs) on unstructured adaptive meshes are considered. Solution to such sparse, non-symmetric and in most cases ill-conditioned systems is often obtained using iterative methods. The programming complexity of such applications on parallel architectures is well known. The development of a Parallel Library for Unstructured Mesh Problems (PLUMP), which supports the transparent use of parallel machines for such applications, is addressed. PLUMP exploits the common denominators in such problems, provides key kernels such as the matrix-vector product and preconditioners for a wide range of iterative solvers, and supports the parallelization of this class of applications in a clean and concise manner. The PLUMP library is implemented in C and FORTRAN77 using the Message-Passing Interface (MPI) and is available free under copyright for research purposes.