Matrix Multiplication Algorithm Research Papers (original) (raw)

The Matrix multiplication algorithms are one of most popular solution used to show the functionalities of Parallel Algorithm, so, it became an important tool to solve hard problems in a reasonable time. The essence in the solutions of... more

The Matrix multiplication algorithms are one of most popular solution used to show the functionalities of Parallel Algorithm, so, it became an important tool to solve hard problems in a reasonable time. The essence in the solutions of this problem are mainly focus on the pattern found to distribute the data on the matrix, so the process of the basic operations can be execute as most independent possible. Specifically, this document presents two solutions to the problem mentioned, the 1-D algorithm and 2-D algorithm. For each presented solutions it will be specified the characteristics and the set of results obtain with the tests. This paper proves that given enough parallel support to the environment of the algorithms, it will be obtain efficient and satisfactory results.

An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique... more

An enhanced technique of color image encryption based on random matrix key encoding is proposed. To encrypt the color image a separation into Red Green and Blue (R, G, B) channels will applied. Each channel is encrypted using a technique called double random matrix key encoding then three new coding image matrices are constructed. To obtain the reconstructed image that is the same as the original image in the receipted side; simple extracted and decryption operations can be maintained. The results shown that the proposed technique is powerful for color image encryption and decryption and a MATLAB and simulations were used to get the results. The proposed technique has high security features because each color component is separately treated using its own double random matrix key which is generated randomly and make the process of hacking the three keys very difficult.

Resumo. Sistemas computacionais são utilizados em diversas áreas do conhecimento e seu desempenho é fundamental para maior precisão nos resultados. Para problemas complexos a aplicação de otimizações tornam-se necessárias. O objetivo... more

Resumo. Sistemas computacionais são utilizados em diversas áreas do conhecimento e seu desempenho é fundamental para maior precisão nos resultados. Para problemas complexos a aplicação de otimizações tornam-se necessárias. O objetivo deste trabalho é analisar o desempenho de algoritmos de Multiplicação de Matrizes e agilizar este processo através de diferentes métodos. O ganho computacional foi 60 vezes maior após estas otimizações.

A spanning tree of a connected graph is a sub graph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how... more

A spanning tree of a connected graph is a sub graph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its components. Our objective is to find minimum cost (weight) spanning tree using the algorithm which is based on the weight matrix of weighted graph.

Normally, program execution spends most of the time on loops. Automated test data generation devotes special attention to loops for better coverage. Automated test data generation for programs having loops with variable number of... more

Normally, program execution spends most of the time on loops. Automated test data generation devotes special attention to loops for better coverage. Automated test data generation for programs having loops with variable number of iteration and variable length array is a challenging problem. It is so because the number of paths may increase exponentially with the increase of array size for some programming constructs, like merge sort. We propose a method that finds heuristic for different types of programming constructs with loops and arrays. Linear search, Bubble sort, merge sort, and matrix multiplication programs are included in an attempt to highlight the difference in execution between single loop, variable length array and nested loops with one and two dimensional arrays. We have used two parameters/heuristics to predict the minimum number of iterations required for generating automated test data. They are longest path level (k L) and saturation level (k S). The proceedings of our work includes the instrumentation of source code at the elementary level, followed by the application of the random inputs until all feasible paths or all paths having longest paths are collected. However, duplicate paths are avoided by using a filter. Our test data is the random numbers that cover each feasible path.

This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations... more

This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.

The main focus of this paper is on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel algorithms from functional... more

The main focus of this paper is on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel algorithms from functional specifications. It extends previous work by systematically generating efficient circuits and mapping them into reconfigurable hardware. The massive parallelisation of the algorithm works by carefully composing "off the shelf" highly parallel implementations of each of the basic building blocks involved in the algorithm. These basic building blocks are a small collection of well-known higher order functions such as map, fold, and zipwith. By using function decomposition and data refinement techniques, these powerful functions are refined into highly parallel implementations described in Hoare's CSP. The CSP descriptions are very closely associated with Handle-C program fragments. Handle-C is a programming language based on C and extended by parallelism and communication primitives taken from CSP. In the final stage the circuit description is generated by compiling Handle-C programs and mapping them onto the targeted reconfigurable hardware such as the Celoxica RC-1000 FPGA system. This approach is illustrated by a case study involving the generation of several versions of the matrix multiplication algorithm.

As software profiling is conducted to determine which section of program demand high processing computation in monocular SLAM inverse depth estimation, matrix multiplication is identified to be one of the most time consuming process. The... more

As software profiling is conducted to determine which section of program demand high processing computation in monocular SLAM inverse depth estimation, matrix multiplication is identified to be one of the most time consuming process. The processing is more demanding when the number of features inserted to the image is increased. For that reason, this paper proposes a parallel matrix multiplier design which could accelerate the execution time. In this design, Field Programmable Gate Array (FPGA) technology which allows parallel design to be implemented is presented. The design manipulates existing classical matrix multiplication algorithm into an architecture that would enable data to be processed concurrently.

Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and... more

Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear.

The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn't depend on any other tasks in the same stage, and... more

The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn't depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon's distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon's algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24% performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a groundbreaking role in faster, more efficient execution of even the most complicated machine learning tasks.

Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and... more

Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear.