Parallel SVD Algorithm for a Three-Diagonal Matrix on a Video Card Using the Nvidia CUDA Architecture (original) (raw)
Related papers
Parallel Three-Dimensional LAD Model on Cartesian Grids of nested structure
Keldysh Institute Preprints, 2016
This preprint addresses a generalization of the technology of nested-type Locally Adaptive (LAD) Cartesian grids to the three-dimensional case and a description of the corresponding software library. The library is written in C++ programming language using object-oriented programming principles and is adapted for parallel computing on multi-core shared memory processes with the OpenMP application interface. The library takes into account the specificity of multi-threaded Cartesiangrid calculations of 3D problems. This can significantly minimize the usage of computer memory by avoiding the storage of grid information. Data grid as the coordinates of the nodes, normals, areas, and volumes are not stored, and are evaluated as needed. The order of cell traversing is represented by a special list, which simplifies parallel implementation by means of the OpenMP directives. This preprint contains a description of tree-like data structure on the LAD grid, the basic principles of the discrete model, and main functions of the library developed. Preliminary results of testing the library and the estimation of the effectiveness of parallel calculations are given for calculations of the problem of the evolution of an explosion in a closed space.
Multiple-precision matrix-vector multiplication on graphics processing units
Program Systems: Theory and Applications, 2020
Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.
Применение параллельного программирования технологии CUDA в методах случайной оптимизации
2017
Цель работы: с помощью параллельных вычислений максимально ускорить процесс получения экстремума в задачах многомерной оптимизации и тем самым показать преимущества использования вышеуказанной технологии по сравнению с технологией последовательных вычислений В процессе исследования проводилось изучение алгоритмов случайного поиска, функций многомерной оптимизации, основные принципы разработки алгоритмов для GPU. В результате исследования было разработано приложение, реализующее методы многомерной оптимизации. В приложении есть возможность добавления новых функций.Objective: using parallel computing to speed up the process of getting extremum in problems of multidimensional optimization, and thus to demonstrate the advantages of using the above technology compared to the technology of sequential computations In the process of research was conducted to study of algorithms of random search, functions, multidimensional optimization, the basic principles of developing algorithms for the ...
Keldysh Institute Preprints
Modern graphics accelerators (GPUs) can significantly speed up the execution of numerical tasks. However, porting programs to graphics accelerators is not an easy task, sometimes requiring their almost complete rewriting. CUDA graphics accelerators, thanks to technology developed by NVIDIA, allow you to have a single source code for both conventional processors (CPUs) and CUDA. However, in this single source code, you need to somehow tell the compiler which parts of this code to parallelize on shared memory. The use of the functional programming library developed by the authors allows you to hide the use of one or another parallelization mechanism on shared memory within the library and make the user source code completely independent of the computing device used (CPU or CUDA). This article shows how this can be done.
2019
Discord is a refinement of the concept of anomalous subsequence of a time series. The task of discords discovery is applied in a wide range of subject domains related to time series: medicine, economics, climate modeling, etc. In this paper, we propose a novel parallel algorithm for discords discovery for the Intel Xeon Phi Knights Landing (KNL) many-core systems for the case when input data fit in main memory. The algorithm exploits the ability to independently calculate Euclidean distances between the subsequences of the time series. Computations are paralleled through OpenMP technology. The algorithm consists of two stages, namely precomputations and discovery. At the precomputations stage, we construct the auxiliary matrix data structures, which ensure efficient vectorization of computations on Intel Xeon Phi KNL. At the discovery stage, the algorithm finds discord based upon the structures above. Experimental evaluation confirms the high scalability of the developed algorithm.
2020 IEEE ANDESCON, 2020
The present work implements and adapts a fast shape recognition algorithm on the Xilinx VC707 VIRTEX-7 FPGA platform for HD color images. The hardware design was developed to correctly implement the algorithm using hardware description language (HDL) and intellectual property blocks (IP). The necessary configurations, specific for the selected platform, were developed to obtain a video output for images in RGB 4:4:4 format with a color depth per pixel of 36 bits through an HDMI interface to display the resulting image from the application of the algorithm. The algorithm was successfully tested on images in 720p-HD resolution in black and white, grayscale, and color formats, although is able to process images of a maximum resolution of 1080p-60 full-HD. The system performance was optimal in all tests without requiring to modify the original algorithm despite variations in resolution or image format.
Eastern European Scientific Journal, 2017
Annotation: hybrid adder architecture that pre computes the pseudo carry Signals by exploiting the symmetry of 3X multiple and the final carry generation by Ling prefix network the adder in the carry path reduces the complexity, the proposed tested in significant Improvement in the power delay product (~25-55%) compared to state of the art adders for 64 bits. Conclusion: this adder also exhibits appreciable decrease in logic gates in the critical path, leading to a reduction in static power consumption. This is considered an important design aspect for the processors in deep submicron (VLSI) technology. The mathematical expressions for pseudo and final carry signals are formulated for the proposed hybrid adder. The expressions for final carry signals are similar to Ling's equation and therefore Ling's prefix network has been adopted for final carry generation. The adder has only two levels of prefix computation which is independent of input bit width and results in a regular structure without increasing the fan-out problems, This reduction leads to appreciable decrease in static power consumption, which enhances the performance of processors fabricated in deep submicron
Vector Modulation Scheme using Three Phase Modulator
2019 International Conference on Numerical Simulation of Optoelectronic Devices (NUSOD), 2019
Publishing house "Sreda" 1 Content is licensed under the Creative Commons Attribution 4.0 license (CC-BY 4.0) Асеева Татьяна Александровна д-р с.-х. наук, директор Трифунтова Ирина Борисовна научный сотрудник ФГБНУ «Дальневосточный научно-исследовательский институт сельского хозяйства» с. Восточное, Хабаровский край
3D Machine Vision Systems on Microcontrollers and Microprocessors (Part 3) / Глава 3. Системы трехмерного машинного зрения на микроконтроллерах и микропроцессорах, 2021
В монографии представлены обобщенные результаты исследования способов адаптации методов машинного зрения для микропроцессорных систем. В качестве примера рассматриваются основные алгоритмы работы оптических 3D-сканеров на базе LiDAR, ориентированных на сбор геометрических метрик в виде трехмерного объекта с автоматизированным сопоставлением их с картой текстур. В качестве дополнения в монографии рассматривается вопрос этико-правовых аспектов применения автоматизированных автономных систем машинного зрения.