Design and Performance Analysis of Fixed-point Jacobi SVD Algorithm on Reconfigurable System (original) (raw)

A systolic VLSI architecture for complex SVD

[Proceedings] 1992 IEEE International Symposium on Circuits and Systems, 1992

This thesis presents a systolic algorithm for the SVD of arbitrary complex matrices, based on the cyclic Jacobi method with \parallel ordering". As a basic step in the algorithm, a two-step, two-sided unitary transformation scheme is employed to diagonalize a complex 2 2 matrix. The transformations are tailored to the use of CORDIC (COordinate Rotation Digital Computer) algorithms for high speed arithmetic. The complex SVD array is modeled on the Brent-Luk-VanLoan array for real SVD. An array with O(n 2 ) processors is required to compute the SVD of a n n matrix in O(n log n) time. An architecture for the complex 2 2 processor with an area complexity twice that of a real 2 2 processor, is shown to have the best area/time tradeo for VLSI implementation. Despite the involved nature of computations on complex data, the computation time for the complex SVD array is less than three times that f o r a r e a l SVD array with a similar CORDIC based implementation.

Highly Parallel Hardware-oriented Algorithm for Jacobi SVD of Hermitian Quaternion Valued Matrix

In this study, new highly parallel algorithm of two-sided Jacobi 8-D transformation is suggested. It is oriented on VLSI-implementation of special processor array. This array is built using 8-D CORDIC algorithm for quaternion valued matrix singular value decomposition. Accuracy analysis and simulation results are added. Such array can be utilized to speed up the Jacobi method realization to compute the SVD of a quaternion matrix in signal and image processing.

Application of on-line arithmetic algorithms to the SVD computation: preliminary results

[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic, 1991

A scheme for the singular value decomposition (SVD) problem, based on on-line arithmetic, is discussed. The design, using radix-2 floating-point on-line operations, implemented in the LSI' HCMOS gale array technology, is compared with a compatible conventional arithmetic implementation. The preliminary results indicate that the proposed on-line approach achieves a speedup of 2.4-3.2 with respect to the conventional solutions, with 1.3-5.5 more gates and more than 6 tames fewer interconnections.

Reconfigurable FPGA-Based Unit for Singular Value Decomposition of Large m x n Matrices

2011 International Conference on Reconfigurable Computing and FPGAs, 2011

Singular value decomposition (SVD) allows the factorization of real or complex matrices providing quantitative information with fewer dimensions along which data points exhibit more variation. These days SVD computation is being used in numerous applications, and because of its importance, different approaches for SVD hardware computation have been proposed; however, their application is limited by the inherent SVD calculation complexity making it possible to analyze up to 8 × 8 matrices until now, complying certain constrains like symmetry. This paper presents a generic and novel FPGA-based hardware architecture for SVD computation on large m × n matrices utilizing Hestenes approach and one-side Jacobi rotations. Four different study cases (2 × 2, 8 × 7, 16 × 32, and 32 × 127 matrices) validate the performance of the FPGA-based computation unit reaching a maximum estimation error of 3.3718 % in the SVD estimation of a large matrix.

Redundant and on-line CORDIC: application to matrix triangularization and SVD

IEEE Transactions on Computers, 1990

Several modifications to the CORDIC method of computing angles and performing rotations are presented: 1) the use of redundant (carry-free) addition instead of a conventional (carry-propagate) one; 2) a representation of angles in a decomposed form to reduce area and communication bandwidth; 3) the use of on-line addition (left-to-right, digit-serial addition) to replace shifters by delays; and 4) the use of on-line multiplication, square root, and division to compute scaling factors and perform the scaling operations. The modifications presented improve the speed and the area of CORDIC implementations. The proposed scheme uses efficiently floating-point representations. We discuss the application of the modified CORDIC method to matrix triangularization by Givens' rotations and to the computation of the singular value decomposition (SVD) .

A block JRS algorithm for highly parallel computation of SVDs

2007

This paper presents a new algorithm for computing the singular value decomposition (SVD) on multilevel memory hierarchy architectures. This algorithm is based on one-sided JRS iteration, which enables the computation of all Jacobi rotations of a sweep in parallel. One key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms. Another key point is that on a reasonably large number of processors the number of sweeps is less than that of one-sided JRS iteration algorithm and closer to the cyclic Jacobi method even though not all rotations in a block are independent. The relaxation technique helps to calculate and apply all independent rotations per block at the same time. On blocks of size b×n, the block JRS performs O(b 2 n) floating-point operations on O(bn) elements, which reuses the loaded data in cache memory by a factor of b. Besides, on P parallel processors, (2P-1) steps based on block computations are needed per sweep.

Parallel singular value decomposition of complex matrices using multidimensional CORDIC algorithms

IEEE Transactions on Signal Processing, 1996

The singular value decomposition (SVD) of complex matrices is computed in a highly parallel fashion on a square array of processors using Kogbetliantz's analog of Jacobi's eigenvalue decomposition method. To gain further speed, new algorithms for the basic SVD operations are proposed and their implementation as specialized processors is presented. The algorithms are 3-D and 4-D extensions of the CORDIC algorithm for plane rotations. When these extensions are used in concert with an additive decomposition of 2 x 2 complex matrices, which enhances parallelism, and with low resolution rotations early on in the SVD process, which reduce operation count, a fivefold speedup can be achieved over the fastest alternative approach. I. INTRODUCTION ONCURRENTLY with the rapid increase in computing C power over the last two decades, signal processing algorithms based on the computation-intensive singular value decomposition (SVD) have become increasingly popular. An international workshop on SVD and Signal Processing has convened regularly since 1987. Since SVD algorithms consist essentially of the repeated computation of orthogonal transformations, when very high throughputs are required the design of the arithmetic units to be used as building blocks for the parallel computation of the orthogonal transformations becomes critical as it may provide an order of magnitude speed-up. The coordinate rotation digital computer (CORDIC) [34], 13.51 provides a good model for such arithmetic units, as it enables the efficient implementation of plane rotations using simple hardware components, mainly adders and shifters. Moreover, if needed, its computations may be accelerated to some extent with redundant arithmetic techniques [15], [29], [32]. Thus, much research has been directed toward integrating the CORDIC arithmetic algorithms and parallel matrix algorithms to obtain specialized parallel architectures for basic problems such as QR decomposition [2], [221, [231, eigenvalue and singular value decomposition [41, [51, [61, [SI,

Adaptive Scalable SVD Unit for Fast Processing of Large LSE Problems

Singular Value Decomposition (SVD) is a key linear algebraic operation in many scientific and engineering applications. In particular, many computational intelligence systems rely on machine learning methods involving high dimensionality datasets that have to be fast processed for real-time adaptability. In this paper we describe a practical FPGA (Field Programmable Gate Array) implementation of a SVD processor for accelerating the solution of large LSE problems. The design approach has been comprehensive, from the algorithmic refinement to the numerical analysis to the customization for an efficient hardware realization. The processing scheme rests on an adaptive vector rotation evaluator for error regularization that enhances convergence speed with no penalty on the solution accuracy. The proposed architecture, which follows a data transfer scheme, is scalable and based on the interconnection of simple rotations units, which allows for a trade-off between occupied area and process...

Implementation in Fpgas of Jacobi Method to Solve the Eigenvalue and Eigenvector Problem

2006

This work shows a modular architecture based on FPGA's to solve the eigenvalue problem according to the Jacobi method. This method is able to solve the eigenvalues and eigenvectors concurrently. The main contribution of this work is the low execution time compared with other sequential algorithms, and minimal internal FPGA consumed resources, mainly due to the fact of using the CORDIC algorithm. Two CORDIC modules have been designed to solve the trigonometric operations involved. A parallel CORDIC architecture is proposed as it is the best option to compute the eigenvalues with this method. Both CORDIC modules can work in rotation and vector mode. The whole system has been done in VHDL language, attempting to optimize the design. * Project SILPAR (Ministerio de Ciencia y Tecnología ref: DPI2003-05067) and "Cátedra de control electrónico en transportes" founded by LOGYTEL and RENFE.

On parallelism of the I-SVD algorithm with a multi-core processor

The I-SVD algorithm is a singular value decomposition algorithm consisting of the modified discrete Lotka-Volterra system (mdLVs) scheme and the discrete Lotka-Volterra (dLV) twisted factorization. By assigning each piece of computations to each core of a multi-core processor, the I-SVD algorithm is parallelized partly. The basic idea is a use of splitting and deflation in the mdLVs. The splitting divides a bidiagonal matrix into two smaller matrices. The deflation gives one of the singular values, and then the corresponding singular vector becomes computable by the dLV. Numerical experiments are done on a multi-core processor, and the algorithm can be about 5 times faster with 8 cores.