Rapid Development of High-Performance Linear Algebra Libraries (original) (raw)

LAPACK: A portable linear algebra library for high-performance computers

Concurrency: Practice and Experience, 1991

The goal of the LAPACK pmJect Is to design and implement a portable linear algebra library for etfieient use on a variety of high-performance computers. The library is based on the widely w d LINPACK and EISPACK packages for solving linear equations, eigenvalue problems, and linear least-squares problems, but extends their functionallty in a number of ways. The major methodology for making the algorithms run faster Is to restructure them to perform block matrix operations (e.g. matrix-matrix multiplication) in their inner loops. These block operations may be optimized to exploit the memory hierarchy of a specific architecture. In particular, we discuss algorithms and benchmarks for the singular value decomposition. are developing a transportable linear algebra library in Fortran 77. The library is intended to provide a uniform set of subroutines to solve the most common linear algebra problems and to run efficiently on a wide range of high-performance computers.

Transforming linear algebra libraries: From abstraction to parallelism

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010

We have built a body of evidence which shows that, given a mathematical specification of a dense linear algebra operation to be implemented, it is possible to mechanically derive families of algorithms and subsequently to mechanically translate these algorithms into high-performing code. In this paper, we add to this evidence by showing that the algorithms can be statically analyzed and translated into directed acyclic graphs (DAGs) of coarse-grained operations that are to be performed. DAGs naturally express parallelism, which we illustrate by representing the DAGs with the G graphical programming language used by LabVIEW. The LabVIEW compiler and runtime execution system then exploit parallelism from the resulting code. Respectable speedup on a sixteen core architecture is reported.

PLAPACK: Parallel Linear Algebra Libraries Design Overview

Over the last twenty years, dense linear algebra libraries have gone through three generations of public domain general purpose packages. In the seventies, the rst generation of packages were EISPACK and LINPACK, which implemented a broad spectrum of algorithms for solving dense linear eigenproblems and dense linear systems. In the late eighties, the second generation package called LAPACK was developed. This package attains high performance in a portable fashion while also improving upon the functionality and robustness of LINPACK and EISPACK. Finally, Since the early nineties, an e ort to port LAPACK to distributed memory networks of computers has been underway as part of the ScaLAPACK project.

Software Libraries for Linear Algebra Computations on High Performance Computers

SIAM review, 1995

This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between di erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing blockpartitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.

Scalapack: A linear algebra library for message-passing computers

1997

This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributed-memory computers. The importance of developing standards for computational and message-passing interfaces is discussed. We present the di erent components and building blocks of ScaLAPACK and provide initial performance results for selected PBLAS routines and a subset of ScaLAPACK driver routines.

Design and evaluation of a linear algebra package for Java

Proceedings of the ACM 2000 conference on Java Grande - JAVA '00, 2000

This paper describes the design of a high-performance linear algebra library for Java. Linear algebra libraries such as ESSL and LAPACK are important tools of computational science and engineering, and have been available to C and Fortran programmers for quite a while. If Java is to become a serious language for the development of large scale numerical applications, it must provide equivalent functionality. From the many possible alternatives to accomplish this goal, we took the approach of designing a linear algebra library entirely in Java. This approach leads to good portability and maintainability of the code. It is also a good test of how far we can push Java performance. We adopted an object-oriented design in which the linear algebra operations are implemented as strategy design pattems. The higher level algorithms, optimized for the memory hierarchies of present-clay machines, are described in a type independent manner. Type specific methods capture the lower level optimizations for operations on matrices of single-precision, double-precision, or complex numbers. We evaluate the performance of our linear algebra package on three different machines. Our preliminary results show that our Java library achieves up to 85% of the performance of the highly optimized ESSL. 1. INTRODUCTION Scientists and engineers developing numerical applications in established languages such as Fortran and C have a vast collection of standard libraries available in their toolbox. Of particular importance are libraries for numerical linear algebra, such as LA-PACK [1] and ESSL [17], libraries for discrete Fourier transform (FFT), and elementary functions libraries. If Java is to become a serious language for large scale numerical computing, it must provide a similar set of tools. Many efforts are under way to provide Java with such libraries [4, 5, 6, 7]. This paper describes and evaluates our particular design of a linear algebra package for Java. The goal of this package is to provide a portable and high performance Java analog to BLAS [ 11 ] and LA-PACK. We chose to develop this package entirely in Java, with no Pem~ission to make digital or hard copies of all or part of this work for personal or classroom use is granted without t%e provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a t~e.

Towards Usable and Lean Parallel Linear Algebra Libraries

1996

In this paper, we introduce a new parallel library effort, as part of the PLAPACK project, that attempts to address discrepencies between the needs of applications and parallel libraries. A number of contributions are made, including a new approach to matrix distribution, new insights into layering parallel linear algebra libraries, and the application of ``object based'''' programming techniques which have recently become popular for (parallel) scientific libraries. We present an overview of a prototype library, the SL_Library, which incorporates these ideas. Preliminary performance data shows this more application-centric approach to libraries does not necessarily adversely impact performance, compared to more traditional approaches.

Automatic Generation of Tiled and Parallel Linear Algebra Routines

Exploiting parallelism in modern hardware is necessary to achieve high performance in linear algebra routines. Unfortunately, modern architectures are complex so many optimization choices must be considered to find the combination that delivers the best performance. Exploring optimizations by hand is costly and time consuming. Auto-tuning systems offer a method for quickly generating and evaluating optimization choices. In this paper we describe a dataparallel extension to our auto-tuning system, Build to Order BLAS. We introduce an abstraction for partitioning matrices and vectors and we introduce an algorithm to partitioning linear algebra operations. We generate code for shared-memory machine using Pthreads. Results from the prototype show that our auto-tuning approach is competitive with existing state-of-the-art parallel libraries. We achieve speedups of up to 2.7 times faster than MKL and speedups up to 6 times faster than our best-optimized serial code on an Intel Core2Quad.

P1635R0 : A Design for an Inter-Operable and Customizable Linear Algebra Library

2019

In the past few years, there has been an effort to add a linear algebra library to the standard library. The motivations for the effort can be found in [2]. There are already quite a few papers in flight concerning the design, with the primary one being [1]. The primary idea behind the design in the paper is to provide a collection of standard linear algebra types and functions with facilities for customization for special purposes. There is another approach which one can pursue, wherein, we ask the types implemented by the standard library and the user alike to satisfy certain requirements and then write generic code for those types. This is the idea which is primarily employed by the standard template library. In this paper, we take the second approach to the design of a linear algebra library. Now, while the two approach do look different in principle, the practical considerations involved and the resulting design from the two approaches might end up being very similar. For examp...