Vladimir Knyazkov | Penza State University (original) (raw)

Papers by Vladimir Knyazkov

Research paper thumbnail of Parallel multiple-precision arithmetic based on residue number system

Программные системы: теория и приложения, 2016

Research paper thumbnail of Look-Up Table Method of Forward Binary to Residue Number System Conversion for Moduli {2 F - 1}

Research paper thumbnail of Fast Multiplication in Residue Number Systems Using Interval Logarithmic Characteristic

Современные наукоемкие технологии (Modern High Technologies), 2018

Для решения многих задач вычислительной математики, математической физики, экономики, биохимии, к... more Для решения многих задач вычислительной математики, математической физики, экономики, биохимии, криптографии требуется высокая, до 512-1024 бит и более, точность. Операция умножения-одна из наиболее часто выполняемых арифметических операций в таких расчетах. На сегодняшний день основным способом работы с длинными числами являются программные библиотеки позиционной длинной арифметики, главным недостатком которых является резкое снижение быстродействия вследствие возникающих единиц переноса между двоичными разрядами длинного позиционного числа. На практике даже самые быстрые способы умножения дают значительное снижение быстродействия на очень длинных числах, представленных в позиционных системах счисления, поэтому повышение скорости является основной целью при разработке методов умножения. В данной работе рассматривается метод умножения двух длинных чисел с плавающей точкой в гибридном модулярном интервально-логарифмическом формате представления. Мантисса представлена в системе остаточных классов (СОК), которая позволяет заменить последовательные методы выполнения арифметических операций над длинными целыми числами в позиционных системах счисления на параллельные методы выполнения арифметических операций над наборами коротких целых чисел, что существенно ускоряет выполнение операций сложения и умножения. В качестве метаданных формат содержит интервально-логарифмическую характеристику, необходимую для быстрого выполнения операций сравнения и масштабирования. Разработанный метод в среднем приблизительно в 3,4 раза быстрее метода, основанного на позиционной длинной арифметике. Ключевые слова: система остаточных классов, умножение, логарифмическая система счисления, интервальная арифметика, формат с плавающей точкой

Research paper thumbnail of Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units

Journal of Parallel and Distributed Computing, Jun 1, 2020

In the present study a new multiple-precision arithmetic environment in MATLAB is developed based... more In the present study a new multiple-precision arithmetic environment in MATLAB is developed based on exflib, a fast multiple-precision arithmetic in the programming language C, C++ and FORTRAN. We discuss design and implementation of interface between MATLAB and exflib without any modifications to exflib. Although the proposed design incurs the overhead in execution, it is more accurate and faster than Variable Precision Arithmetic (VPA), which is the official multiple-precision arithmetic environment in MATLAB. We also exhibit an efficiency of the proposed environment for reliable computation to overcome the difficulties caused by rounding errors.

Research paper thumbnail of Presentation: Study of efficiency of multiple-precision integer computations in residue number system on GPU (in Russian)

Research paper thumbnail of Extended abstract: Fast Power-of-Two RNS Scaling Algorithm for Large Dynamic Ranges (in Russian)

Research paper thumbnail of Presentation: Supporting extended-range arithmetic on CPU and GPU (in Russian)

Research paper thumbnail of Presentation: High-precision floating-point arithmetic algorithms based on residue number system (in Russian)

Research paper thumbnail of Presentation: High-Precision Computations on CPU and GPU Using Residue Number System (in Russian)

Research paper thumbnail of A parallel multiple-precision arithmetic library for high performance systems (in Russian)

Research paper thumbnail of Fast Power-of-Two RNS Scaling Algorithm for Large Dynamic Ranges

2017 IVth International Conference on Engineering and Telecommunication (EnT), 2017

This paper presents a new efficient algorithm for scaling by power of two in the residue number s... more This paper presents a new efficient algorithm for scaling by power of two in the residue number system (RNS). It focuses on arbitrary moduli sets with large dynamic ranges. In this algorithm, in order to determine the remainder when dividing the number to be scaled by the scaling factor, an interval estimation of the RNS representation is used. The proposed algorithm requires only machine-precision integer and floating-point operations, and is well parallelized. The algorithm is implemented for CPU, as well as for GPU using CUDA C language.

Research paper thumbnail of RNS-Based Data Representation for Handling Multiple-Precision Integers on Parallel Architectures

2016 International Conference on Engineering and Telecommunication (EnT), 2016

In most computer programs and general-purpose computing environments, the precision of any calcul... more In most computer programs and general-purpose computing environments, the precision of any calculation is limited by the word size of the computer. However, for some applications, such as cryptography, this precision is not sufficient. In these cases, it is necessary to use multiple-precision numbers. Operations on such numbers in most computer software are implemented by third party libraries that provide data types and subroutines to store numbers with the requested precision and to perform computations. In this paper, we consider an approach for representing large integers based on the residue number system (RNS). Due to the non-positional nature of RNS, operations on multiple-precision numbers can be split into several reduced-precision operations executed in parallel. This achieves high performance and effective use of the resources of modern parallel computing architectures such as graphics processing units.

Research paper thumbnail of Data-Parallel High-Precision Multiplication on Graphics Processing Units

Communications in Computer and Information Science, 2019

In this article, we consider parallel algorithms for highprecision floating-point multiplication ... more In this article, we consider parallel algorithms for highprecision floating-point multiplication on graphics processing units (GPUs). Our underlying high-precision format is based on the residue number system (RNS). In RNS, the number is represented as a tuple of residues obtained by dividing this number by a given set of moduli. The residues are mutually independent, which eliminates carry propagation delays and introduces parallelism in arithmetic operations. Firstly, we consider a basic algorithm for multiplying high-precision floating-point numbers. Next, we provide three parallel GPU implementations of this algorithm in the context of componentwise vector multiplication. Experiments indicate that our implementations are several times faster than existing high-precision libraries.

Research paper thumbnail of Tool System for Scalable Parallel Programs of Numerical Calculations Design

Journal Scientific and Technical Of Information Technologies, Mechanics and Optics, 2018

Многие прикладные задачи высокой размерности требуют выполнения высокоточных численных расчетов в... more Многие прикладные задачи высокой размерности требуют выполнения высокоточных численных расчетов в базисе матричной алгебры над полем больших чисел. Современные мультипроцессорные вычислительные системы, как правило, имеют 32-битную или 64-битную архитектуру, что затрудняет выполнение операций над большими числами и уменьшает эффективность их использования. В результате работы над проблемой предложен новый метод выполнения вычислений, основанный на использовании в качестве базиса для выполнения численных расчетов системы счисления с параллельной архитектурой-системы остаточных классов. На основании предложенного метода разрабатывается инструментальный комплекс. Ключевые слова: программный комплекс, параллельные масштабируемые расчеты, система остаточных классов, большие числа, MPI, OpenMP, технологии параллельного программирования, параллельные программы.

Research paper thumbnail of Computing the Sparse Matrix-Vector Product in High-Precision Arithmetic for GPU Architectures

Communications in Computer and Information Science, 2021

The multiplication of a sparse matrix by a vector (SpMV) is the main and most expensive component... more The multiplication of a sparse matrix by a vector (SpMV) is the main and most expensive component of iterative methods for sparse linear systems and eigenvalue problems. As rounding errors often lead to poor convergence of iterative methods, in this article we implement and evaluate the SpMV using high-precision arithmetic on graphics processing units (GPUs). We present two implementations that use the compressed sparse row (CSR) format. The first implementation is a scalar high-precision CSR kernel using one thread per matrix row. The second implementation consists of two steps. At the first step, the matrix and vector are multiplied element-by-element. The high efficiency of this step is achieved by using a residue number system, which allows all digits of a high-precision number to be computed in parallel using multiple threads. The second step is a segmented reduction of the intermediate results. Experimental evaluation demonstrates that with the same precision, our implementati...

Research paper thumbnail of Parallel Computation of Normalized Legendre Polynomials Using Graphics Processors

To carry out some calculations in physics and Earth sciences, for example, to determine spherical... more To carry out some calculations in physics and Earth sciences, for example, to determine spherical harmonics in geodesy or angular momentum in quantum mechanics, it is necessary to compute normalized Legendre polynomials. We consider the solution to this problem on modern graphics processing units, whose massively parallel architectures allow to perform calculations for many arguments, orders and degrees of polynomials simultaneously. For higher degrees of a polynomial, computations are characterized by a considerable spread in numerical values and lead to overflow and/or underflow problems. In order to avoid such problems, support for extended-range arithmetic has been implemented.

Research paper thumbnail of Efficient GPU Implementation of Multiple-Precision Addition based on Residue Arithmetic

International Journal of Advanced Computer Science and Applications, 2020

In this work, the residue number system (RNS) is applied for efficient addition of multiple-preci... more In this work, the residue number system (RNS) is applied for efficient addition of multiple-precision integers using graphics processing units (GPUs) that support the Compute Unified Device Architecture (CUDA) platform. The RNS allows calculations with the digits of a multiple-precision number to be performed in an element-wise fashion, without the overhead of communication between them, which is especially useful for massively parallel architectures such as the GPU architecture. The paper discusses two multiple-precision integer algorithms. The first algorithm relies on if-else statements to test the signs of the operands. In turn, the second algorithm uses radix complement RNS arithmetic to handle negative numbers. While the first algorithm is more straightforward, the second one avoids branch divergence among threads that concurrently compute different elements of a multiple-precision array. As a result, the second algorithm shows significantly better performance compared to the first algorithm. Both algorithms running on an NVIDIA RTX 2080 Ti GPU are faster than the multi-core GNU MP implementation running on an Intel Xeon 4100 processor.

Research paper thumbnail of Multiple-precision matrix-vector multiplication on graphics processing units

Program Systems: Theory and Applications, 2020

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для... more Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Research paper thumbnail of Multiple-Precision BLAS Library for Graphics Processing Units

The binary32 and binary64 floating-point formats provide good performance on current hardware, bu... more The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Research paper thumbnail of The Multiplication Method with Scaling the Result for High-Precision Residue Positional Interval Logarithmic Computations

Engineering Technologies and Systems, 2019

Introduction. The solution of the simulation problems critical to rounding errors, including the ... more Introduction. The solution of the simulation problems critical to rounding errors, including the problems of computational mathematics, mathematical physics, optimal control, biochemistry, quantum mechanics, mathematical programming and cryptography, requires the accuracy from 100 to 1 000 decimal digits and more. The main lack of high-precision software libraries is a significant decrease of the speed-in-action, unacceptable for practical problems, in particular, when performing multiplication. A way to increase computation performance over very long numbers is using the residue number system. In this work, we discuss a new fast multiplication method with scaling the result using original hybrid residue positional interval logarithmic floating-point number representation. Materials and Methods. The new way of the organizing numerical information is a residue positional interval logarithmic number representation in which the mantissa is presented in the residue number system, and in...

Research paper thumbnail of Parallel multiple-precision arithmetic based on residue number system

Программные системы: теория и приложения, 2016

Research paper thumbnail of Look-Up Table Method of Forward Binary to Residue Number System Conversion for Moduli {2 F - 1}

Research paper thumbnail of Fast Multiplication in Residue Number Systems Using Interval Logarithmic Characteristic

Современные наукоемкие технологии (Modern High Technologies), 2018

Для решения многих задач вычислительной математики, математической физики, экономики, биохимии, к... more Для решения многих задач вычислительной математики, математической физики, экономики, биохимии, криптографии требуется высокая, до 512-1024 бит и более, точность. Операция умножения-одна из наиболее часто выполняемых арифметических операций в таких расчетах. На сегодняшний день основным способом работы с длинными числами являются программные библиотеки позиционной длинной арифметики, главным недостатком которых является резкое снижение быстродействия вследствие возникающих единиц переноса между двоичными разрядами длинного позиционного числа. На практике даже самые быстрые способы умножения дают значительное снижение быстродействия на очень длинных числах, представленных в позиционных системах счисления, поэтому повышение скорости является основной целью при разработке методов умножения. В данной работе рассматривается метод умножения двух длинных чисел с плавающей точкой в гибридном модулярном интервально-логарифмическом формате представления. Мантисса представлена в системе остаточных классов (СОК), которая позволяет заменить последовательные методы выполнения арифметических операций над длинными целыми числами в позиционных системах счисления на параллельные методы выполнения арифметических операций над наборами коротких целых чисел, что существенно ускоряет выполнение операций сложения и умножения. В качестве метаданных формат содержит интервально-логарифмическую характеристику, необходимую для быстрого выполнения операций сравнения и масштабирования. Разработанный метод в среднем приблизительно в 3,4 раза быстрее метода, основанного на позиционной длинной арифметике. Ключевые слова: система остаточных классов, умножение, логарифмическая система счисления, интервальная арифметика, формат с плавающей точкой

Research paper thumbnail of Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units

Journal of Parallel and Distributed Computing, Jun 1, 2020

In the present study a new multiple-precision arithmetic environment in MATLAB is developed based... more In the present study a new multiple-precision arithmetic environment in MATLAB is developed based on exflib, a fast multiple-precision arithmetic in the programming language C, C++ and FORTRAN. We discuss design and implementation of interface between MATLAB and exflib without any modifications to exflib. Although the proposed design incurs the overhead in execution, it is more accurate and faster than Variable Precision Arithmetic (VPA), which is the official multiple-precision arithmetic environment in MATLAB. We also exhibit an efficiency of the proposed environment for reliable computation to overcome the difficulties caused by rounding errors.

Research paper thumbnail of Presentation: Study of efficiency of multiple-precision integer computations in residue number system on GPU (in Russian)

Research paper thumbnail of Extended abstract: Fast Power-of-Two RNS Scaling Algorithm for Large Dynamic Ranges (in Russian)

Research paper thumbnail of Presentation: Supporting extended-range arithmetic on CPU and GPU (in Russian)

Research paper thumbnail of Presentation: High-precision floating-point arithmetic algorithms based on residue number system (in Russian)

Research paper thumbnail of Presentation: High-Precision Computations on CPU and GPU Using Residue Number System (in Russian)

Research paper thumbnail of A parallel multiple-precision arithmetic library for high performance systems (in Russian)

Research paper thumbnail of Fast Power-of-Two RNS Scaling Algorithm for Large Dynamic Ranges

2017 IVth International Conference on Engineering and Telecommunication (EnT), 2017

This paper presents a new efficient algorithm for scaling by power of two in the residue number s... more This paper presents a new efficient algorithm for scaling by power of two in the residue number system (RNS). It focuses on arbitrary moduli sets with large dynamic ranges. In this algorithm, in order to determine the remainder when dividing the number to be scaled by the scaling factor, an interval estimation of the RNS representation is used. The proposed algorithm requires only machine-precision integer and floating-point operations, and is well parallelized. The algorithm is implemented for CPU, as well as for GPU using CUDA C language.

Research paper thumbnail of RNS-Based Data Representation for Handling Multiple-Precision Integers on Parallel Architectures

2016 International Conference on Engineering and Telecommunication (EnT), 2016

In most computer programs and general-purpose computing environments, the precision of any calcul... more In most computer programs and general-purpose computing environments, the precision of any calculation is limited by the word size of the computer. However, for some applications, such as cryptography, this precision is not sufficient. In these cases, it is necessary to use multiple-precision numbers. Operations on such numbers in most computer software are implemented by third party libraries that provide data types and subroutines to store numbers with the requested precision and to perform computations. In this paper, we consider an approach for representing large integers based on the residue number system (RNS). Due to the non-positional nature of RNS, operations on multiple-precision numbers can be split into several reduced-precision operations executed in parallel. This achieves high performance and effective use of the resources of modern parallel computing architectures such as graphics processing units.

Research paper thumbnail of Data-Parallel High-Precision Multiplication on Graphics Processing Units

Communications in Computer and Information Science, 2019

In this article, we consider parallel algorithms for highprecision floating-point multiplication ... more In this article, we consider parallel algorithms for highprecision floating-point multiplication on graphics processing units (GPUs). Our underlying high-precision format is based on the residue number system (RNS). In RNS, the number is represented as a tuple of residues obtained by dividing this number by a given set of moduli. The residues are mutually independent, which eliminates carry propagation delays and introduces parallelism in arithmetic operations. Firstly, we consider a basic algorithm for multiplying high-precision floating-point numbers. Next, we provide three parallel GPU implementations of this algorithm in the context of componentwise vector multiplication. Experiments indicate that our implementations are several times faster than existing high-precision libraries.

Research paper thumbnail of Tool System for Scalable Parallel Programs of Numerical Calculations Design

Journal Scientific and Technical Of Information Technologies, Mechanics and Optics, 2018

Многие прикладные задачи высокой размерности требуют выполнения высокоточных численных расчетов в... more Многие прикладные задачи высокой размерности требуют выполнения высокоточных численных расчетов в базисе матричной алгебры над полем больших чисел. Современные мультипроцессорные вычислительные системы, как правило, имеют 32-битную или 64-битную архитектуру, что затрудняет выполнение операций над большими числами и уменьшает эффективность их использования. В результате работы над проблемой предложен новый метод выполнения вычислений, основанный на использовании в качестве базиса для выполнения численных расчетов системы счисления с параллельной архитектурой-системы остаточных классов. На основании предложенного метода разрабатывается инструментальный комплекс. Ключевые слова: программный комплекс, параллельные масштабируемые расчеты, система остаточных классов, большие числа, MPI, OpenMP, технологии параллельного программирования, параллельные программы.

Research paper thumbnail of Computing the Sparse Matrix-Vector Product in High-Precision Arithmetic for GPU Architectures

Communications in Computer and Information Science, 2021

The multiplication of a sparse matrix by a vector (SpMV) is the main and most expensive component... more The multiplication of a sparse matrix by a vector (SpMV) is the main and most expensive component of iterative methods for sparse linear systems and eigenvalue problems. As rounding errors often lead to poor convergence of iterative methods, in this article we implement and evaluate the SpMV using high-precision arithmetic on graphics processing units (GPUs). We present two implementations that use the compressed sparse row (CSR) format. The first implementation is a scalar high-precision CSR kernel using one thread per matrix row. The second implementation consists of two steps. At the first step, the matrix and vector are multiplied element-by-element. The high efficiency of this step is achieved by using a residue number system, which allows all digits of a high-precision number to be computed in parallel using multiple threads. The second step is a segmented reduction of the intermediate results. Experimental evaluation demonstrates that with the same precision, our implementati...

Research paper thumbnail of Parallel Computation of Normalized Legendre Polynomials Using Graphics Processors

To carry out some calculations in physics and Earth sciences, for example, to determine spherical... more To carry out some calculations in physics and Earth sciences, for example, to determine spherical harmonics in geodesy or angular momentum in quantum mechanics, it is necessary to compute normalized Legendre polynomials. We consider the solution to this problem on modern graphics processing units, whose massively parallel architectures allow to perform calculations for many arguments, orders and degrees of polynomials simultaneously. For higher degrees of a polynomial, computations are characterized by a considerable spread in numerical values and lead to overflow and/or underflow problems. In order to avoid such problems, support for extended-range arithmetic has been implemented.

Research paper thumbnail of Efficient GPU Implementation of Multiple-Precision Addition based on Residue Arithmetic

International Journal of Advanced Computer Science and Applications, 2020

In this work, the residue number system (RNS) is applied for efficient addition of multiple-preci... more In this work, the residue number system (RNS) is applied for efficient addition of multiple-precision integers using graphics processing units (GPUs) that support the Compute Unified Device Architecture (CUDA) platform. The RNS allows calculations with the digits of a multiple-precision number to be performed in an element-wise fashion, without the overhead of communication between them, which is especially useful for massively parallel architectures such as the GPU architecture. The paper discusses two multiple-precision integer algorithms. The first algorithm relies on if-else statements to test the signs of the operands. In turn, the second algorithm uses radix complement RNS arithmetic to handle negative numbers. While the first algorithm is more straightforward, the second one avoids branch divergence among threads that concurrently compute different elements of a multiple-precision array. As a result, the second algorithm shows significantly better performance compared to the first algorithm. Both algorithms running on an NVIDIA RTX 2080 Ti GPU are faster than the multi-core GNU MP implementation running on an Intel Xeon 4100 processor.

Research paper thumbnail of Multiple-precision matrix-vector multiplication on graphics processing units

Program Systems: Theory and Applications, 2020

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для... more Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Research paper thumbnail of Multiple-Precision BLAS Library for Graphics Processing Units

The binary32 and binary64 floating-point formats provide good performance on current hardware, bu... more The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Research paper thumbnail of The Multiplication Method with Scaling the Result for High-Precision Residue Positional Interval Logarithmic Computations

Engineering Technologies and Systems, 2019

Introduction. The solution of the simulation problems critical to rounding errors, including the ... more Introduction. The solution of the simulation problems critical to rounding errors, including the problems of computational mathematics, mathematical physics, optimal control, biochemistry, quantum mechanics, mathematical programming and cryptography, requires the accuracy from 100 to 1 000 decimal digits and more. The main lack of high-precision software libraries is a significant decrease of the speed-in-action, unacceptable for practical problems, in particular, when performing multiplication. A way to increase computation performance over very long numbers is using the residue number system. In this work, we discuss a new fast multiplication method with scaling the result using original hybrid residue positional interval logarithmic floating-point number representation. Materials and Methods. The new way of the organizing numerical information is a residue positional interval logarithmic number representation in which the mantissa is presented in the residue number system, and in...