Efficient computation of the matrix square root in heterogeneous platforms (original) (raw)

Performance Evaluation of Matrix Products on Multicore Architectures

International Journal of Advances in Engineering and Management, 2022

Scientific applications tend to deal with large volumes of data, as is the case with simulations of natural phenomena, which demand high computational power. Alternatively, using multi-core computers for processing contributes to performance improvement. However, performing specific optimizations for the target architecture can further influence performance. Therefore, the objective of this work is to evaluate the impact of optimization techniques on application performance as well as test the performance of these techniques in multi-core and many-core architectures. For that, a matrix multiplication algorithm was chosen for the application of Loop Interchange, and Loop Tiling techniques. Furthermore, this algorithm was parallelized with OpenMP and CUDA to explore the different processing cores of the computational architectures used. The results show that algorithms optimized for a target architecture gain performance, and this gain can reach 11 times in sequential optimizations for cache memory and 100 times in parallel execution with OpenMP on Intel Xeon E5-2650 processors. Furthermore, this performance gain can be leveraged on the NVidia TITAN Xp GPU up to 1720 times.

A comparison of three commodity-level parallel architectures: Multi-core CPU, cell BE and GPU

2010

The CPU has traditionally been the computational work horse in scientific computing, but we have seen a tremendous increase in the use of accelerators, such as Graphics Processing Units (GPUs), in the last decade. These architectures are used because they consume less power and offer higher performance than equivalent CPU solutions. They are typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.

Obtención de Altas Prestaciones en Computación de Carácter General sobre

2015

The increase in performance of the last generations of graphics processors (GPUs) has made this class of hardware a coprocessing platform of remarkable success in certain types of operations. In this paper we evaluate the performance of linear algebra and image processing routines, both on classical and unified GPU architectures and traditional processors (CPUs). From this study, we gain insights on the properties that make an algorithm likely to deliver high perfor-mance on a GPU. Keywords: Graphics processors (GPUs), general purpose computing on GPU, linear algebra, image processing, high performance. 1 Departamento de Ingenieŕıa y Ciencia de los Computadores

Análisis de performance para el arreglo de sufijos sobre plataformas multi-core

Performance analysis helps to understand how a particular invocation of an algorithm executes. Using the information provided by specific tools like the profiler tool Perf or the Performance Application Programming Interface (PAPI), the performance analysis process provides a bridging relationship between the algorithm execution and processor events according to the metrics defined by the developer. It is also useful to find performance limitations which depend exclusively on the code. Furthermore, to change an algorithm in order to optimize the code requires more than understanding of the obtained performance. It requires understanding the problem being solved. In this work we evaluate the performance achieved by a suffix array over a 32-core platform. Suffix arrays are efficient data structures for solving complex queries in a number of applications related to text databases, for instance, biological databases. We perform experiments to evaluate hardware features directly aimed to parallelize computation. Moreover, according to the results obtained by the performance evaluation tools, we propose an optimization technique to improve the use of the cache memory. In particular, we aim to reduce the number of cache memory replacement performed each time a new query is processed.

Computer architecture and high performance computing

Concurrency and Computation: Practice and Experience

The scope of the current special issue is broad and representative of the multidisciplinary nature of high performance and distributed computing, covering a wide range of subjects such as architecture issues, compiler optimization, analysis of HPC applications, job scheduling, and energy efficiency. The title of the first paper is "An efficient virtual system clock for the wireless Raspberry Pi computer platform," by Diego L. C. Dutra, Edilson C. Corrêa, and Claudio L. Amorim [1]. In this paper, the authors present the design and experimental evaluation of an implementation of the RVEC virtual system clock in the Linux kernel for the EE (Energy-Efficient) Wireless Raspberry Pi (RasPi) platform. In the RasPi platform, the use of DVFS (Dynamic Voltage and Frequency) for reducing the energy consumption hinders the direct use of the cycle count of the ARM11 processor core for building an efficient system clock. Therefore, a distinct feature of RVEC is to obviate this obstacle, such that it can make use of the cycle count circuit for precise and accurate time measurements, concurrently with the use of DVFS by the operating system of the ARM11 processor core. In the second contribution, entitled "Portability with efficiency of the advection of BRAMS between multi-core and many-core architectures," the authors, Manoel Baptista Silva Junior, Jairo Panetta, and Stephan Stephany [2], show the feasibility of writing a single portable code embedding both interfaces (the OpenMP programming interface and OpenACC). It presents acceptable efficiency when executed on nodes with multi-core or many-core architecture. The code chosen as a case study is the advection of scalars, a part of the dynamics of the regional atmospheric model Brazilian Regional Atmospheric Modeling System (BRAMS). The dynamics of this model is hard to parallelize due to data dependencies between adjacent grid points. Single-node executions of the advections of scalars for different grid sizes using OpenMP or OpenACC yielded similar speed-ups, showing the feasibility of the proposed approach. In the third contribution, entitled "SMT-based context-bounded model checking for CUDA programs," the authors (Phillipe Pereira, Higo Albuquerque, Isabela da Silva, Hendrio Marques, Felipe Monteiro, Ricardo Ferreira, and Lucas Cordeiro) [3] present the ESBMC-GPU tool, an extension to the Efficient SMT-Based Context-Bounded Model Checker (ESBMC), which is aimed at verifying Graphics Processing Unit (GPU) programs written for the Compute Unified Device Architecture (CUDA) platform. ESBMC-GPU uses an operational model, that is, an abstract representation of the standard CUDA libraries, which conservatively approximates their semantics, in order to verify CUDA-based programs. It then explicitly explores the possible interleavings (up to the given context bound), while treats each interleaving itself symbolically. Additionally, ESBMC-GPU employs the monotonic partial order reduction and the two-thread analysis to prune the state space exploration. The fourth contribution, entitled "Contextual Spaces Re-Ranking: accelerating the Resort Ranked Lists step on heterogeneous systems," by

Parallel Acceleration on Manycore Systems and Its Performance Analysis: OpenCL Case Study

2013

OpenCL (Open Computing Language) is a heterogeneous programming framework for developing applications that executes across a range of device types made by different vendors[11] which efficiently maps to both heterogeneous and homogeneous, single or multiple device system consisting of CPUs, GPUs and others types of devices. OpenCL provides many benefits in the field of high-performance computing and one of the most important aspects is its portability. This paper presents a comparison of the performance of OpenCL executing a matrix multiplication over a manycore CPU and GPU with performance analysis. The analysis are carried out to understand manycore CPU and GPU performance characteristics. Such analysis approach can be further extended to include more system parameters and refined to fit the actual execution time of parallelized applications. The simulation uses Ubuntu 12.04 in a desktop with an Intel i7 960 processor and a graphic card Nvidia GeForce GTX 460.

Implementing Multithreaded Programs using CUDA for GPGPU to Solve Matrix Multiplication

JOURNAL OF XI'AN UNIVERSITY OF ARCHITECTURE & TECHNOLOGY, 2020

A most significant process in linear algebra is the performance variation through datasets of similar size in matrix-vector multiplication. Hence, there might be a new storing design of a new "two-dimensional blocking technique" that can successfully cope with variety of challenges. This format is called "blocked row-column" (BRC). The central aim of the present paper is to design and implement multithreaded programing algorithm using CUDA for GPGPU and analyze the performance of CUDA program. Moreover, the paper also aims at comparing the performance of the open MP program with the previous work. The algorithm is designed using the CUDA libraries in order to process matrix multiplication on the GPU. Using the libraries of CUDA and some of its functions, we optimize the performance by using maximum size of the GPU blocks, which are used to compute the matrix multiplication. The results of the paper show that the memory usage in 2D array method CUDA design utilizes more space in comparison to the 1D array-based. Therefore, CUDA 1D array (flatten 2D array to 1D array) is the finest format as the parallel matrix multiplication considering critical factors of "high speed and minimal memory space consumption". So, the paper concludes that GPU version of the matrix multiplication in the GPU performance is better than the Open MP version of the parallel GPU.

Performance study of matrix computations using multi-core programming tools

2012

Abstract Basic matrix computations such as vector and matrix addition, dot product, outer product, matrix transpose, matrix-vector and matrix multiplication are very challenging computational kernels arising in scientific computing. In this paper, we parallelize those basic matrix computations using the multi-core and parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and FastFlow.

IJERT-Performance Optimization Of Highly Computational Tasks Using CUDA

International Journal of Engineering Research and Technology (IJERT), 2013

https://www.ijert.org/performance-optimization-of-highly-computational-tasks-using-cuda https://www.ijert.org/research/performance-optimization-of-highly-computational-tasks-using-cuda-IJERTV2IS2577.pdf The paper analyses the features and generalized optimization methods, on establishing strategies for improving software performance when using the Compute Unified Device Architecture (CUDA) implemented in the latest generation GPUs. The performance for progressively optimizing a matrix multiplication, prime number search for a very large data in CUDA is evaluated. A particular interest was to investigate how well, does CUDA optimizes the speed of computing as compared to a Central Processing Unit(CPU).