Matrix Multiplication Algorithm Research Papers (original) (raw)
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and... more
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear.
This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations... more
This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.
- by João Miguel and +1
- •
- Parallel Computing, Matrix Multiplication Algorithm
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique... more
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique called double random matrix key encoding then three new coding image matrices are constructed. To obtain the reconstructed image that is the same as the original image in the receipted side; simple extracted and decryption operations can be maintained. The results shown that the proposed technique is powerful for color image encryption and decryption and a MATLAB and simulations were used to get the results. The proposed technique has high security features because each color component is separately treated using its own double random matrix key which is generated randomly and make the process of hacking the three keys very difficult.
A spanning tree of a connected graph is a sub graph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how... more
A spanning tree of a connected graph is a sub graph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its components. Our objective is to find minimum cost (weight) spanning tree using the algorithm which is based on the weight matrix of weighted graph.
The Matrix multiplication algorithms are one of most popular solution used to show the functionalities of Parallel Algorithm, so, it became an important tool to solve hard problems in a reasonable time. The essence in the solutions of... more
The Matrix multiplication algorithms are one of most popular solution used to show the functionalities of Parallel Algorithm, so, it became an important tool to solve hard problems in a reasonable time. The essence in the solutions of this problem are mainly focus on the pattern found to distribute the data on the matrix, so the process of the basic operations can be execute as most independent possible. Specifically, this document presents two solutions to the problem mentioned, the 1-D algorithm and 2-D algorithm. For each presented solutions it will be specified the characteristics and the set of results obtain with the tests. This paper proves that given enough parallel support to the environment of the algorithms, it will be obtain efficient and satisfactory results.
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique... more
An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique called double random matrix key encoding then three new coding image matrices are constructed. To obtain the reconstructed image that is the same as the original image in the receipted side; simple extracted and decryption operations can be maintained. The results shown that the proposed technique is powerful for color image encryption and decryption and a MATLAB and simulations were used to get the results. The proposed technique has high security features because each color component is separately treated using its own double random matrix key which is generated randomly and make the process of hacking the three keys very difficult.
The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn't depend on any other tasks in the same stage, and... more
The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn't depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon's distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon's algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24% performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a groundbreaking role in faster, more efficient execution of even the most complicated machine learning tasks.
Normally, program execution spends most of the time on loops. Automated test data generation devotes special attention to loops for better coverage. Automated test data generation for programs having loops with variable number of... more
Normally, program execution spends most of the time on loops. Automated test data generation devotes special attention to loops for better coverage. Automated test data generation for programs having loops with variable number of iteration and variable length array is a challenging problem. It is so because the number of paths may increase exponentially with the increase of array size for some programming constructs, like merge sort. We propose a method that finds heuristic for different types of programming constructs with loops and arrays. Linear search, Bubble sort, merge sort, and matrix multiplication programs are included in an attempt to highlight the difference in execution between single loop, variable length array and nested loops with one and two dimensional arrays. We have used two parameters/heuristics to predict the minimum number of iterations required for generating automated test data. They are longest path level (k L) and saturation level (k S). The proceedings of our work includes the instrumentation of source code at the elementary level, followed by the application of the random inputs until all feasible paths or all paths having longest paths are collected. However, duplicate paths are avoided by using a filter. Our test data is the random numbers that cover each feasible path.
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and... more
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear.
Resumo. Sistemas computacionais são utilizados em diversas áreas do conhecimento e seu desempenho é fundamental para maior precisão nos resultados. Para problemas complexos a aplicação de otimizações tornam-se necessárias. O objetivo... more
Resumo. Sistemas computacionais são utilizados em diversas áreas do conhecimento e seu desempenho é fundamental para maior precisão nos resultados. Para problemas complexos a aplicação de otimizações tornam-se necessárias. O objetivo deste trabalho é analisar o desempenho de algoritmos de Multiplicação de Matrizes e agilizar este processo através de diferentes métodos. O ganho computacional foi 60 vezes maior após estas otimizações.
In this paper we introduce efficient algorithm for the multiplication of trigintaduonions. The direct multiplication of two trigintaduonions requires 1024 real multiplications and 992 real additions. We show how to compute a... more
In this paper we introduce efficient algorithm for the multiplication of trigintaduonions. The direct multiplication of two trigintaduonions requires 1024 real multiplications and 992 real additions. We show how to compute a trigintaduonion product with 498 real multiplications and 943 real additions. During synthesis of the discussed algorithm we use a fact that trigintaduonion multipli-cation may be represented by a vector-matrix product. Such representation provides a possibility to discover repeating elements in the matrix structure and to use specific properties of their mutual placement to decrease the number of real multiplications needed to compute the product of two trigintaduonions.
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and... more
Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear.
Numerical solutions to Viscous Incompressible flow between two parallel plates have been developed over the last half century. Multi-level techniques are successfully developed and applied for the same problem where significant speed-ups... more
Numerical solutions to Viscous Incompressible flow between two parallel plates have been developed over the last half century. Multi-level techniques are successfully developed and applied for the same problem where significant speed-ups are achieved. Due to complexity of Navier-Stokes equation, obtaining analytical solutions for viscous flow problems if difficult sometimes. Obtaining exact solution of Navier-Stokes equation is possible for some classical cases of steady, laminar, viscous and incompressible flows. Numerical research in this area will help researchers to develop techniques which yield accurate and faster solutions. The present work is on estimating the velocity distribution in Couette flow between two parallel plates by using Newtonian fluids as lubricant. The resulting partial differential equations of Navier-Stokes equation after the boundary conditions are substituted are solved by Finite Difference Method. An algorithm named as "Tri Di m" is generated for the same. By agonal Matrix Algorith using MATLAB, numerical analysis is performed to obtain the velocity distribution between two parallel plates.