Salvatore Filippone - Profile on Academia.edu (original) (raw)

Papers by Salvatore Filippone

Parallel Computing, Mar 1, 2018

Coarrays have been part of the Fortran standard since Fortran 2008 and provide a syntactic extens... more Coarrays have been part of the Fortran standard since Fortran 2008 and provide a syntactic extension of Fortran to support parallel programming, often called Coarray Fortran (CAF). Although MPI is the de facto standard for parallel programs running on distributed memory systems and little scientific software is written in CAF, many scientific applications could benefit from the use of CAF. We present the migration from MPI to CAF of the libraries PSBLAS and MLD2P4 for the solution of large systems of equations using iterative methods and preconditioners. In this paper we describe some investigations for implementing the necessary communication steps in PSBLAS and MLD2P4 and provide performance results obtained on linear systems arising from discretization of 2D and 3D PDEs.

Parallel Computing, Apr 1, 2016

Accelerators such as NVIDIA GPUs and Intel MICs are currently provided as co-processor devices, u... more Accelerators such as NVIDIA GPUs and Intel MICs are currently provided as co-processor devices, usable only through a CPU host. For Intel MICs it is planned that this constraint will be lifted in the near future: CPU and accelerator(s) will then form a single, many-core, processor capable of peak performance of several Teraflops with high energy efficiency. In order to exploit the available computational power, the user will be compelled to write a code more "hardware-aware", in contrast to the common philosophy of hiding hardware details as much as possible. The simple two-sided communication approach often used in message-passing applications introduces synchronization costs that may limit the performance on the next generation machines. PGAS languages, like coarray Fortran and UPC, propose a one-sided approach where a process accesses directly the remote memory of another process without interrupting its execution. In this paper, we propose a CUDA-aware coarray implementation, capable of merging the expressive syntax of coarrays with the computational power of GPUs. We propose a new keyword for the Fortran language, which allows the user to map with a high-level syntax some hardware features of the many-core machines. Our hybrid coarray implementation is based on OpenCoarrays, the coarray transport layer currently adopted by the GNU Fortran compiler.

Toward test-driven development of scientific applications with coarray Fortran

Vectorized ILU preconditioners for general sparsity patterns

Trilinos Tutorial (Slides)

Many scientists who implement computational science and engineering software have adopted the obj... more Many scientists who implement computational science and engineering software have adopted the object-oriented (OO) Fortran paradigm. One of the challenges faced by OO Fortran developers is the inability to obtain high level software design descriptions of existing applications. Knowledge of the overall software design is not only valuable in the absence of documentation, it can also serve to assist developers with accomplishing different tasks during the software development process, especially maintenance and refactoring. The software engineering community commonly uses reverse engineering techniques to deal with this challenge. A number of reverse engineering-based tools have been proposed, but few of them can be applied to OO Fortran applications. In this paper, we propose a software tool to extract unified modeling language (UML) class diagrams from Fortran code. The UML class diagram facilitates the developers' ability to examine the entities and their relationships in the software system. The extracted diagrams enhance software maintenance and evolution. The experiments carried out to evaluate the proposed tool show its accuracy and a few of the limitations.

Domain decomposition based on spatial locality is a classical dataparallel problem whose solution... more Domain decomposition based on spatial locality is a classical dataparallel problem whose solution may improve by orders of magnitude when implemented on a GPU. Among the data structures involved in domain decomposition, uniform grids are widely used to speed up simulations in a number of fields, including computational physics and graphics. In this work, we present two commonly used approaches to generate uniform grids on GPUs and propose a new single-pass method that has several advantages over the previous ones. We also present some performance results of our CUDA implementation of a broad-phase collision detection algorithm for particles simulation, comparing the different methods. In some tests our method achieves a speedup of 2 compared to the fastest known method supporting a fixed maximum number of elements per cell, and a speedup of 7 compared with the fastest method without such a constraint.

Approximate Inverse Preconditioners for Krylov Methods on Heterogeneous Parallel Computers

Parallel Computing, 2013

The popularity of GPGPUs in high performance platforms for scientific computing in recent times h... more The popularity of GPGPUs in high performance platforms for scientific computing in recent times has renewed interest in approximate inverse preconditioners for Krylov methods. We have recently introduced some new algorithmic variants [6] of popular approximate inverse methods. We now report on the behaviour of these variations in high performance multilevel preconditioning frameworks, and we present the software framework that enables

Why diffusion‐based preconditioning of Richards equation works: Spectral analysis and computational experiments at very large scale

Numerical Linear Algebra with Applications

We consider here a cell‐centered finite difference approximation of the Richards equation in thre... more We consider here a cell‐centered finite difference approximation of the Richards equation in three dimensions, averaging for interface values the hydraulic conductivity , a highly nonlinear function, by arithmetic, upstream and harmonic means. The nonlinearities in the equation can lead to changes in soil conductivity over several orders of magnitude and discretizations with respect to space variables often produce stiff systems of differential equations. A fully implicit time discretization is provided by backward Euler one‐step formula; the resulting nonlinear algebraic system is solved by an inexact Newton Armijo–Goldstein algorithm, requiring the solution of a sequence of linear systems involving Jacobian matrices. We prove some new results concerning the distribution of the Jacobians eigenvalues and the explicit expression of their entries. Moreover, we explore some connections between the saturation of the soil and the ill conditioning of the Jacobians. The information on eige...

Computers & Mathematics with Applications

In this paper, we discuss the convergence of an Algebraic MultiGrid (AMG) method for general symm... more In this paper, we discuss the convergence of an Algebraic MultiGrid (AMG) method for general symmetric positive-definite matrices. The method relies on an aggregation algorithm, named coarsening based on compatible weighted matching, which exploits the interplay between the principle of compatible relaxation and the maximum product matching in undirected weighted graphs. The results are based on a general convergence analysis theory applied to the class of AMG methods employing unsmoothed aggregation and identifying a quality measure for the coarsening; similar quality measures were originally introduced and applied to other methods as tools to obtain good quality aggregates leading to optimal convergence for M-matrices. The analysis, as well as the coarsening

In this paper we describe some work aimed at upgrading the Alya code with up-to-date parallel lin... more In this paper we describe some work aimed at upgrading the Alya code with up-to-date parallel linear solvers capable of achieving reliability, efficiency and scalability in the computation of the pressure field at each time step of the numerical procedure for solving a LES formulation of the incompressible Navier-Stokes equations. We developed a software module in the Alya's kernel to interface the libraries included in the current version of PSCToolkit, a framework for the iterative solution of sparse linear systems on parallel distributed-memory computers by Krylov methods coupled to Algebraic MultiGrid preconditioners. The Toolkit has undergone some extensions within the EoCoE-II project with the primary goal to face the exascale challenge. Results on a realistic benchmark for airflow simulations in wind farm applications show that the PSCToolkit solvers significantly outperform the original versions of the Conjugate Gradient method available in the Alya kernel in terms of sc...

AMG Preconditioners based on Parallel Hybrid Coarsening and Multi-objective Graph Matching

2023 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

arXiv (Cornell University), Dec 9, 2021

We consider here a cell-centered finite difference approximation of the Richards equation in thre... more We consider here a cell-centered finite difference approximation of the Richards equation in three dimensions, averaging for interface values the hydraulic conductivity = (), a highly nonlinear function, by arithmetic, upstream and harmonic means. The nonlinearities in the equation can lead to changes in soil conductivity over several orders of magnitude and discretizations with respect to space variables often produce stiff systems of differential equations. A fully implicit time discretization is provided by backward Euler one-step formula; the resulting nonlinear algebraic system is solved by an inexact Newton Armijo-Goldstein algorithm, requiring the solution of a sequence of linear systems involving Jacobian matrices. We prove some new results concerning the distribution of the Jacobians eigenvalues and the explicit expression of their entries. Moreover, we explore some connections between the saturation of the soil and the ill conditioning of the Jacobians. The information on eigenvalues justifies the effectiveness of some preconditioner approaches which are widely used in the solution of Richards equation. We also propose a new software framework to experiment with scalable and robust preconditioners suitable for efficient parallel simulations at very large scales. Performance results on a literature test case show that our framework is very promising in the advance towards realistic simulations at extreme scale.

arXiv (Cornell University), Oct 29, 2022

In this paper, we describe an upgrade of the Alya code with up-to-date parallel linear solvers ca... more In this paper, we describe an upgrade of the Alya code with up-to-date parallel linear solvers capable of achieving reliability, efficiency and scalability in the computation of the pressure field at each time step of the numerical procedure for solving a Large Eddy Simulation formulation of the incompressible Navier-Stokes equations. We developed a software module in Alya's kernel to interface the libraries included in the current version of PSCToolkit, a framework for the iterative solution of sparse linear systems on parallel distributed-memory computers by Krylov methods coupled to Algebraic MultiGrid preconditioners. The Toolkit has undergone various extensions within the EoCoE-II project with the primary goal of facing the exascale challenge. Results on a realistic benchmark for airflow simulations in wind farm applications show that the PSCToolkit solvers significantly outperform the original versions of the Conjugate Gradient method available in the Alya's kernel in terms of scalability and parallel efficiency and represent a very promising software layer to move the Alya code towards exascale.

arXiv (Cornell University), Jan 27, 2020

Parallel Sparse Computation Toolkit

Software impacts, Mar 1, 2023

The multiplication of a sparse matrix by a dense vector is a centerpiece of scientific computing ... more The multiplication of a sparse matrix by a dense vector is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrixvector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high performance computing architectures. The introduction of General Purpose Graphics Programming Units (GPGPUs) is no exception, and many articles have been devoted to this problem. In this report we propose three novel matrix formats, ELL-G and HLL which derive from ELL, and HDIA for matrices having mostly a diagonal sparsity pattern. We compare the performance of the proposed formats to that of state-of-the-art formats (i.e., HYB and ELL-RT) with experiments run on different GPU platforms and test matrices coming from various application domains. * This Technical Report has been issued as a Research Report for early dissemination of its contents. No part of its text nor any illustration can be reproduced without written permission of the Authors.

The solution of large and sparse linear systems is one of the main computational kernels in CFD a... more The solution of large and sparse linear systems is one of the main computational kernels in CFD applications and is often a very time-consuming task, thus requiring the use of effective algorithms on high-performance computers. Preconditioned Krylov solvers are the methods of choice for these systems, but the availability of "good" preconditioners is crucial to achieve efficiency and robustness. In this paper we discuss some issues concerning the design and the implementation of scalable algebraic multilevel preconditioners, that have shown to be able to enhance the performance of Krylov solvers in parallel settings. In this context, we outline the main objectives and the related design choices of MLD2P4, a package of multilevel preconditioners based on Schwarz methods and on the smoothed aggregation technique, that has been developed to provide scalable and easy-to-use preconditioners in the Parallel Sparse BLAS computing framework. Results concerning the application of various MLD2P4 preconditioners within a large eddy simulation of a turbulent channel flow are discussed.

Sparse matrix computations are ubiquitous in scientific computing; General-Purpose computing on G... more Sparse matrix computations are ubiquitous in scientific computing; General-Purpose computing on Graphics Processing Units (GPGPU) is fast becoming a key component of high performance computing systems. It is therefore natural that a substantial amount of effort has been devoted to implementing sparse matrix computations on GPUs. In this paper, we discuss our work in this field, starting with the data structures we have employed to implement common operations, together with the software architecture we have devised to allow interoperability with existing software packages. To test the effectiveness of our approach we have run experiments with it on two platforms; the experimental results show that our data structures allow us to achieve very good performance results, significantly better than what can be obtained with the most recent version of the CUSPARSE library. * This Technical Report has been issued as a Research Report for early dissemination of its contents. No part of its text nor any illustration can be reproduced without written permission of the Authors.

Neural Computing and Applications, 2022

The importance of robust flight delay prediction has recently increased in the air transportation... more The importance of robust flight delay prediction has recently increased in the air transportation industry. This industry seeks alternative methods and technologies for more robust flight delay prediction because of its significance for all stakeholders. The most affected are airlines that suffer from monetary and passenger loyalty losses. Several studies have attempted to analysed and solve flight delay prediction problems using machine learning methods. This research proposes a novel alternative method, namely social ski driver conditional autoregressive-based (SSDCA-based) deep learning. Our proposed method combines the Social Ski Driver algorithm with Conditional Autoregressive Value at Risk by Regression Quantiles. We consider the most relevant instances from the training dataset, which are the delayed flights. We applied data transformation to stabilise the data variance using Yeo-Johnson. We then perform the training and testing of our data using deep recurrent neural network...