Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters (original) (raw)

Optimizing dense linear algebra algorithms on heterogeneous machines

2006

This paper addresses the execution of inherently sequential linear algebra algorithms namely LU factorization, tridiagonal reduction and the symmetric QR factorization algorithm used for eigenvector computation, which are significant building blocks for applications in our target image processing and analysis domain. These algorithms present additional difficulties to optimize the processing time due to the fact that the computational load for data matrix columns increases with their index, requiring a fine tuned load assignment and distribution. We present an efficient methodology to determine the optimal number of processors to be used in a computation, as well as a new static load distribution strategy that achieves better results than other algorithms developed for the same purpose.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load

2003

In this work an architecture of an automatically tuned linear algebra library proposed in previous works is extended in order to adapt it to platforms where both the CPU load and the network traffic vary. During the installation process in a system, the linear algebra routines will be tuned automatically to the system conditions: hardware characteristics and basic libraries used in the linear algebra routines. At run-time the parameters that define the system characteristics are adjusted to the actual load of the platform. The design methodology is analysed with a block LU factorisation. Variants for a sequential and parallel version of this routine on a logical rectangular mesh of processors are considered. The behaviour of the algorithm is studied with message-passing, using MPI on a cluster of PCs. The experiments show that the configurable parameters of the linear algebra routines can be adjusted during the run-time process despite the variability of the environment.

Data parallel scheduling of operations in linear algebra on heterogeneous clusters

Proceedings of the 5th WSEAS …, 2005

The aim of data and task parallel scheduling for dense linear algebra kernels is to minimize the processing time of an application composed by several linear algebra kernels. The scheduling strategy presented here combines the task parallelism used when scheduling independent tasks and the data parallelism used for linear algebra kernels. This problem has been studied for scheduling independent tasks on homogeneous machines. Here it is proposed a methodology for heterogeneous clusters and it is shown that significant improvements can be achieved with this strategy.

Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers

Journal of Parallel and Distributed Computing, 2001

This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on heterogeneous distribution of processes over processors and homogeneous block cyclic distribution of data over the processes. The second is based on homogeneous distribution of processes over processors and heterogeneous block cyclic distribution of data over the processes. Both strategies were implemented in the mpC language a dedicated parallel extension of ANSI C for efficient and portable programming of heterogeneous networks of computers. The first strategy was implemented using calls to ScaLAPACK; the second strategy was implemented with calls to LAPACK and BLAS. Cholesky factorization on a heterogeneous network of workstations is used to demonstrate that the heterogeneous distributions have an advantage over the traditional homogeneous distribution.

mpC + ScaLAPACK = Efficient Solving Linear Algebra Problems on Heterogeneous Networks

Lecture Notes in Computer Science, 1999

The paper presents experience of using mpC for accelerating ScaLAPACK applications on heterogeneous networks of computers. The mpC is a language, specially designed for parallel programming for heterogeneous networks. It has facilities for distribution of participating processes over processors in accordance with performances of the latters. An mpC application carring out Cholesky factorization on a heterogeneous network of workstations is used to demonstrate that the heterogeneous process distribution has an essential advantage over the traditional homogeneous distribution. The application is implemented using calls to ScaLAPACK routines by means of the interface mpC -ScaLAPACK.

Automatic Generation of Tiled and Parallel Linear Algebra Routines

Exploiting parallelism in modern hardware is necessary to achieve high performance in linear algebra routines. Unfortunately, modern architectures are complex so many optimization choices must be considered to find the combination that delivers the best performance. Exploring optimizations by hand is costly and time consuming. Auto-tuning systems offer a method for quickly generating and evaluating optimization choices. In this paper we describe a dataparallel extension to our auto-tuning system, Build to Order BLAS. We introduce an abstraction for partitioning matrices and vectors and we introduce an algorithm to partitioning linear algebra operations. We generate code for shared-memory machine using Pthreads. Results from the prototype show that our auto-tuning approach is competitive with existing state-of-the-art parallel libraries. We achieve speedups of up to 2.7 times faster than MKL and speedups up to 6 times faster than our best-optimized serial code on an Intel Core2Quad.

Scalability Issues Affecting the Design of a Dense Linear Algebra Library

Journal of Parallel and Distributed Computing, 1994

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run e ciently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between di erent levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.