Towards Usable and Lean Parallel Linear Algebra Libraries (original) (raw)
Related papers
PLAPACK: Parallel Linear Algebra Libraries Design Overview
Over the last twenty years, dense linear algebra libraries have gone through three generations of public domain general purpose packages. In the seventies, the rst generation of packages were EISPACK and LINPACK, which implemented a broad spectrum of algorithms for solving dense linear eigenproblems and dense linear systems. In the late eighties, the second generation package called LAPACK was developed. This package attains high performance in a portable fashion while also improving upon the functionality and robustness of LINPACK and EISPACK. Finally, Since the early nineties, an e ort to port LAPACK to distributed memory networks of computers has been underway as part of the ScaLAPACK project.
Transforming linear algebra libraries: From abstraction to parallelism
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
We have built a body of evidence which shows that, given a mathematical specification of a dense linear algebra operation to be implemented, it is possible to mechanically derive families of algorithms and subsequently to mechanically translate these algorithms into high-performing code. In this paper, we add to this evidence by showing that the algorithms can be statically analyzed and translated into directed acyclic graphs (DAGs) of coarse-grained operations that are to be performed. DAGs naturally express parallelism, which we illustrate by representing the DAGs with the G graphical programming language used by LabVIEW. The LabVIEW compiler and runtime execution system then exploit parallelism from the resulting code. Respectable speedup on a sixteen core architecture is reported.
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
Journal of Parallel and Distributed Computing, 1994
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run e ciently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between di erent levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.
plapackJava: Towards an Efficient Java Interface for High Performance Parallel Linear Algebra
Information Processing Letters, 2000
Java is gaining acceptance as a language for high performance computing, as it is platform independent and safe. A parallel linear algebra package is fundamental for developing parallel numerical applications. In this paper, we present plapackJava, a Java interface to PLAPACK, a parallel linear algebra library. This interface is simple to use and object-oriented, with good support for initialization of distributed objects. The experiments we have performed indicate that plapackJava does not introduce a significant overhead with respect to PLAPACK
Software Libraries for Linear Algebra Computations on High Performance Computers
SIAM review, 1995
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between di erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing blockpartitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.
Rapid Development of High-Performance Linear Algebra Libraries
Lecture Notes in Computer Science, 2006
We present a systematic methodology for deriving and implementing linear algebra libraries. It is quite common that an application requires a library of routines for the computation of linear algebra operations that are not (exactly) supported by commonly used libraries like LAPACK. In this situation, the application developer has the option of casting the operation into one supported by an existing library, often at the expense of performance, or implementing a custom library, often requiring considerable effort. Our recent discovery of a methodology based on formal derivation of algorithm allows such a user to quickly derive proven correct algorithms. Furthermore it provides an API that allows the so-derived algorithms to be quickly translated into high-performance implementations.
Designing polylibraries to speed up linear algebra computations
International Journal of High Performance Computing and Networking, 2004
In this paper we analyse the design of polylibraries, where the programs call for routines from different libraries according to the characteristics of the problem and of the system used to solve it. An architecture for this type of library is proposed. Our aim is to develop a methodology which can be used in the design of parallel libraries. To evaluate the viability of the proposed method, the typical linear algebra libraries hierarchy has been considered. Experiments have been performed in different systems and with linear algebra routines from different levels of the hierarchy. The results confirm the design of polylibraries as a good technique for speeding up computations.
ScaLAPACK: a portable linear algebra library for distributed memory computers
Computer Physics Communications, 1996
This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.
Computer Physics …, 1996
This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK, and indicate the difficulties inherent in producing correct codes for networks of heterogeneous processors. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.