Transforming linear algebra libraries: From abstraction to parallelism (original) (raw)

Automatic Generation of Tiled and Parallel Linear Algebra Routines

Exploiting parallelism in modern hardware is necessary to achieve high performance in linear algebra routines. Unfortunately, modern architectures are complex so many optimization choices must be considered to find the combination that delivers the best performance. Exploring optimizations by hand is costly and time consuming. Auto-tuning systems offer a method for quickly generating and evaluating optimization choices. In this paper we describe a dataparallel extension to our auto-tuning system, Build to Order BLAS. We introduce an abstraction for partitioning matrices and vectors and we introduce an algorithm to partitioning linear algebra operations. We generate code for shared-memory machine using Pthreads. Results from the prototype show that our auto-tuning approach is competitive with existing state-of-the-art parallel libraries. We achieve speedups of up to 2.7 times faster than MKL and speedups up to 6 times faster than our best-optimized serial code on an Intel Core2Quad.

Rapid Development of High-Performance Linear Algebra Libraries

Lecture Notes in Computer Science, 2006

We present a systematic methodology for deriving and implementing linear algebra libraries. It is quite common that an application requires a library of routines for the computation of linear algebra operations that are not (exactly) supported by commonly used libraries like LAPACK. In this situation, the application developer has the option of casting the operation into one supported by an existing library, often at the expense of performance, or implementing a custom library, often requiring considerable effort. Our recent discovery of a methodology based on formal derivation of algorithm allows such a user to quickly derive proven correct algorithms. Furthermore it provides an API that allows the so-derived algorithms to be quickly translated into high-performance implementations.

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

2011

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nested-loops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.

Methods for Implementing Linear Algebra Algorithms on High Performance Architectures

In this paper we consider the data distribution and data movement issues related to the solution of the basic linear algebra problems on high performance systems. The algorithms we discuss in details are the Gauss and Gausss-Jordan methods for solving a system of linear equations, the Cholesky's algorithm for LL T factorization, and QR-factorization algorithm using Householder transformations. It is shown that all those algorithm can be executed efficiently, with partial pivoting, on a parallel system with simple and regular links. A detailed implementation of the algorithms is described on a systolic-type architecture using a simple parallel language. Both the theoretical analysis and the simulation results show speedups on moderately large problems close to the optimal. 1 Introduction In many scientific and practical computations the linear algebra algorithms are the most time consuming tasks. For example, the simulation of multiphase fluid flows in porous media and other comp...

GLAME@ lab: An M-script API for Linear Algebra Operations on Graphics Processors

2008

We propose two high-level application programming interfaces (APIs) to use a graphics processing unit (GPU) as a coprocessor for dense linear algebra operations. Combined with an extension of the FLAME API and an implementation on top of NVIDIA CUBLAS, the result is an efficient and user-friendly tool to design, implement, and execute dense linear algebra operations on the current generation of NVIDIA graphics processors, of wide-appeal to scientists and engineers. As an application of the developed APIs, we implement and evaluate the performance of three different variants of the Cholesky factorization.

Towards Usable and Lean Parallel Linear Algebra Libraries

1996

In this paper, we introduce a new parallel library effort, as part of the PLAPACK project, that attempts to address discrepencies between the needs of applications and parallel libraries. A number of contributions are made, including a new approach to matrix distribution, new insights into layering parallel linear algebra libraries, and the application of ``object based'''' programming techniques which have recently become popular for (parallel) scientific libraries. We present an overview of a prototype library, the SL_Library, which incorporates these ideas. Preliminary performance data shows this more application-centric approach to libraries does not necessarily adversely impact performance, compared to more traditional approaches.

A Lightweight Run-Time Support for Fast Dense Linear Algebra on Multi-Core

Software Engineering / 811: Parallel and Distributed Computing and Networks / 816: Artificial Intelligence and Applications, 2014

The work proposes ffMDF, a lightweight dynamic run-time support able to achieve high performance in the execution of dense linear algebra kernels on shared-cache multi-core. ffMDF implements a dynamic macro-dataflow interpreter processing DAG graphs generated on-the-fly out of standard numeric kernel code. The experimental results demonstrate that the performance obtained using ffMDF on both fine-grain and coarse-grain problems is comparable with or even better than that achieved by de-facto standard solutions (notably PLASMA library), which use separate run-time supports specifically optimised for different computational grains on modern multi-core.

Software Libraries for Linear Algebra Computations on High Performance Computers

SIAM review, 1995

This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between di erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing blockpartitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.

Making programming synonymous with programming for linear algebra libraries

2008

We have invested heavily in hardware development but software tools and methods to use the hardware continue to fall behind. The Sky is Falling. Panic is rampant. We respectfully disagree for the domain of linear algebra, which is a typical example used to motivate high performance. Over the last ten years, we have been developing software tools and methods targeting this domain specifically to stay ahead of architectural development. In this paper, we give an overview of these methods and software tools, developed as part of the FLAME project. We show how when applied to a new architecture (GPUs), they provide an out-of-the-box solution that attains high performance almost effortlessly.