Optimizing dense linear algebra algorithms on heterogeneous machines (original) (raw)

Linear algebra algorithms in a heterogeneous cluster of personal computers

… Workshop, 2000.(HCW …, 2000

Cluster computing is presently a major research area, mostly for high performance computing. The work herein presented refers to the application of cluster computing in a small scale where a virtual machine is composed by a small number of off-the-shelf personal computers connected by a low cost network. A methodology to determine the optimal number of processors to be used in a computation is presented as well as the speedup results obtained for the matrix-matrix multiplication and for the symmetric QR algorithm for eigenvector computation which are significant building blocks for applications in the target image processing and analysis domain. The load balancing strategy is also addressed.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load

2003

In this work an architecture of an automatically tuned linear algebra library proposed in previous works is extended in order to adapt it to platforms where both the CPU load and the network traffic vary. During the installation process in a system, the linear algebra routines will be tuned automatically to the system conditions: hardware characteristics and basic libraries used in the linear algebra routines. At run-time the parameters that define the system characteristics are adjusted to the actual load of the platform. The design methodology is analysed with a block LU factorisation. Variants for a sequential and parallel version of this routine on a logical rectangular mesh of processors are considered. The behaviour of the algorithm is studied with message-passing, using MPI on a cluster of PCs. The experiments show that the configurable parameters of the linear algebra routines can be adjusted during the run-time process despite the variability of the environment.

Data parallel scheduling of operations in linear algebra on heterogeneous clusters

Proceedings of the 5th WSEAS …, 2005

The aim of data and task parallel scheduling for dense linear algebra kernels is to minimize the processing time of an application composed by several linear algebra kernels. The scheduling strategy presented here combines the task parallelism used when scheduling independent tasks and the data parallelism used for linear algebra kernels. This problem has been studied for scheduling independent tasks on homogeneous machines. Here it is proposed a methodology for heterogeneous clusters and it is shown that significant improvements can be achieved with this strategy.

Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

2005

This paper presents a self-optimization methodology for parallel linear algebra routines on heterogeneous systems. For each routine, a series of decisions is taken automatically in order to obtain an execution time close to the optimum (without rewriting the routine's code). Some of these decisions are: the number of processes to generate, the heterogeneous distribution of these processes over the network of processors, the logical topology of the generated processes, ... To reduce the searching space of such decisions, different heuristics have been used. The experiments have been performed with a parallel LU factorization routine similar to the ScaLAPACK one, and good results have been obtained on different heterogeneous platforms.

Methods for Implementing Linear Algebra Algorithms on High Performance Architectures

In this paper we consider the data distribution and data movement issues related to the solution of the basic linear algebra problems on high performance systems. The algorithms we discuss in details are the Gauss and Gausss-Jordan methods for solving a system of linear equations, the Cholesky's algorithm for LL T factorization, and QR-factorization algorithm using Householder transformations. It is shown that all those algorithm can be executed efficiently, with partial pivoting, on a parallel system with simple and regular links. A detailed implementation of the algorithms is described on a systolic-type architecture using a simple parallel language. Both the theoretical analysis and the simulation results show speedups on moderately large problems close to the optimal. 1 Introduction In many scientific and practical computations the linear algebra algorithms are the most time consuming tasks. For example, the simulation of multiphase fluid flows in porous media and other comp...

Performance study of matrix computations using multi-core programming tools

2012

Abstract Basic matrix computations such as vector and matrix addition, dot product, outer product, matrix transpose, matrix-vector and matrix multiplication are very challenging computational kernels arising in scientific computing. In this paper, we parallelize those basic matrix computations using the multi-core and parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and FastFlow.

A Parallel Implementation of an Image Processing Algorithm

This paper describes the parallel implementation of an algorithm used for timeseries classification of remotely sensed image data. Such classifications form the basis of many operational programs aimed at monitoring land use and cover change, and monitoring programs typically comprise data of the order of hundreds of megabytes to terabytes. The use of time-series models for classification has lead to great increases in the accuracy of land-cover change estimates, at the cost of increased computation, being of the order of a CPU year for larger monitoring programs. In this paper we explore the use of clusters of computers for reducing computation times. We describe our port using the MPI standard, and give timing performance results on a homogeneous and a heterogeneous cluster, being examples of dedicated and opportunistic clusters respectively. Results for both clusters show good CPU efficiency leading to useful processing time reductions. We describe issues encountered while doing the port and subtle algorithm design issues associated with the two different clusters.