Software Libraries for Linear Algebra Computations on High Performance Computers (original) (raw)

Scalability Issues Affecting the Design of a Dense Linear Algebra Library

Journal of Parallel and Distributed Computing, 1994

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run e ciently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between di erent levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.

Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization

2008 IEEE International Symposium on Parallel and Distributed Processing, 2008

The scalable parallel implementation, targeting SMP and/or multicore architectures, of dense linear algebra libraries is analyzed. Using the LU factorization as a case study, it is shown that an algorithmby-blocks exposes a higher degree of parallelism than traditional implementations based on multithreaded BLAS. The implementation of this algorithm using the SuperMatrix runtime system is discussed and the scalability of the solution is demonstrated on two different platforms with 16 processors.

A linear algebra package for a local memory multiprocessor: Problems, proposals and solutions

Parallel Computing, 1988

Almtmgt. This paper is concerned with principal considerations for developing a linear algebra package for the SUPRENUM computer. The design goals, as well as the mapping strategy of the parallelization methodology, are described briefly. Finally, a basic factorization scheme is introduced which can be readily tailored to the LU, Cholesky and QR factorization provided that the corresponding matrices are distributed according to the column-oriented wrap mapping.

ScaLAPACK: a portable linear algebra library for distributed memory computers

Computer Physics Communications, 1996

This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

IEEE Access, 2019

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution. INDEX TERMS Solution of linear systems, multi-threading, workload balancing, thread malleability, basic linear algebra subprograms (BLAS), linear algebra package (LAPACK).

An object oriented design for high performance linear algebra on distributed memory architectures

1993

We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebra computations on distributed memory multicomputers. This package, when complete, will support distributed matrix operations for symmetric, positive-de nite, and non-symmetric cases. In ScaLA-PACK++ we h a v e employed object oriented design methods to enchance scalability, portability, exibility, and ease-of-use. We illustrate some of these points by describing the implementation of basic algorithms and comment on tradeo s between elegance, generality, and performance.

Methods for Implementing Linear Algebra Algorithms on High Performance Architectures

In this paper we consider the data distribution and data movement issues related to the solution of the basic linear algebra problems on high performance systems. The algorithms we discuss in details are the Gauss and Gausss-Jordan methods for solving a system of linear equations, the Cholesky's algorithm for LL T factorization, and QR-factorization algorithm using Householder transformations. It is shown that all those algorithm can be executed efficiently, with partial pivoting, on a parallel system with simple and regular links. A detailed implementation of the algorithms is described on a systolic-type architecture using a simple parallel language. Both the theoretical analysis and the simulation results show speedups on moderately large problems close to the optimal. 1 Introduction In many scientific and practical computations the linear algebra algorithms are the most time consuming tasks. For example, the simulation of multiphase fluid flows in porous media and other comp...

Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software, 2009

With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.

ScaLAPACK: A portable linear algebra library for distributed memory computers—Design issues and performance

Computer Physics …, 1996

Software Libraries for Linear Algebra Computations on High Performance Computers (original) (raw)

Related papers