Squaring Algorithms with Delayed Carry Method and Efficient Parallelization (original) (raw)
Related papers
Parallelization of Integer Squaring Algorithms with Delayed Carry
Increasing amounts of information that needs to be protected put in claims specific requirements for information security systems. The main goal of this paper is to find ways to increase performance of cryptographic transformation with public key by increasing performance of integers squaring. Authors use delayed carry mechanism and approaches of effective parallelization for Comba multiplication algorithm, which was previously proposed by authors. They use the idea of carries accumulation by addition products of multiplying the relevant machine words in columns. As a result, it became possible to perform addition of such products in the column independently of each other. However, independent accumulation of products and carries require correction of the intermediate results to account for the accumulated carries. Due to the independence of accumulation in the columns, it became possible to parallelize the process of products accumulation that allowed formulating several approaches...
Efficient Big Integer Multiplication and Squaring Algorithms for Cryptographic Applications
Public-key cryptosystems are broadly employed to provide security for digital information. Improving the efficiency of public-key cryptosystem through speeding up calculation and using fewer resources are among the main goals of cryptography research. In this paper, we introduce new symbols extracted from binary representation of integers called Big-ones. We present a modified version of the classical multiplication and squaring algorithms based on the Big-ones to improve the efficiency of big integer multiplication and squaring in number theory based cryptosystems. Compared to the adopted classical and Karatsuba multiplication algorithms for squaring, the proposed squaring algorithm is 2 to 3.7 and 7.9 to 2.5 times faster for squaring 32-bit and 8-Kbit numbers, respectively. The proposed multiplication algorithm is also 2.3 to 3.9 and 7 to 2.4 times faster for multiplying 32-bit and 8-Kbit numbers, respectively. The number theory based cryptosystems, which are operating in the range of 1-Kbit to 4-Kbit integers, are directly benefited from the proposed method since multiplication and squaring are the main operations in most of these systems.
Techniques for Performance Improvement of Integer Multiplication in Cryptographic Applications
Mathematical Problems in Engineering, 2014
The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32-bit and 64-bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively.
Fast multiplication techniques for public key cryptography
2008 IEEE Symposium on Computers and Communications, 2008
We describe two novel techniques for multiplying polynomials which help with accelerating popular public key cryptographic algorithms like RSA and key exchange techniques like Elliptic Curve Diffie Hellman. The first technique is based on an algorithm for generating one-iteration Karatsuba-like formulae using graphs. The novelty of our approach lies on the correlation between graph properties (i.e. vertices, edges and sub-graphs) and the Karatsuba-like terms of big number multiplication routines. The second technique is an improvement over the one-iteration extension to Karatsuba proposed by Weimerskirch and Paar [2] that yields better performance when the input polynomials have odd number of coefficients. We present experimental data that show that our techniques boost the performance of public key and key exchange algorithms substantially.
Multi-precision Squaring for Public-Key Cryptography on Embedded Microprocessors
Lecture Notes in Computer Science, 2013
In the paper, we revisit the "Lazy Doubling" (LD) method for multi-precision squaring, which reduces the number of addition operations by deferring the doubling process so that it can be performed on accumulated results. The original LD method has to employ carrycatcher registers to store carry values, which reduces the number of general purpose registers available for optimization of the implementation. Furthermore, the LD method adopts the idea of hybrid multiplication to separate the partial products into several product blocks, which prevents the doubling process to be conducted on fully accumulated intermediate results. To overcome these deficiencies of the LD method and improve the performance of multi-precision squaring, we propose a novel and flexible method named "Sliding Block Doubling" (SBD). The SBD method delays the doubling process till the very end of the partial-product computation and then doubles the result by simply shifting it one bit to the left. In order to further reduce the overhead of doubling, we also optimize the execution process for updating carry values and adopt the product-scanning method for efficient computation of the partial products. Our experimental results on an AVR ATmega128 processor show that the SBD method outperforms state-of-the-art implementations by a factor of between 3.5% and 4.4% for operands ranging from 128 bits to 192 bits.
Speeding up the Multiplication Algorithm for Large Integers
2020
Multiplication is one of the basic operations that influence the performance of many computer applications such as cryptography. The main challenge of the multiplication operation is the cost of the operation as compared to other basic operations such as addition and subtraction, especially when the size of the numbers is large. In this work, we investigate the use of the window strategy for multiplying a sequence of large integers to design an efficient sequential algorithm in order to reduce the number of bit-multiplication operations involved in multiplying a sequence of large integers. In our implementation, several parameters are considered and measured for their effect on the proposed algorithm and the best-known sequential algorithm in the literature. These parameters are the size of the sequence of integers, the size of the integers, the size of the window, and the distribution of the data. The experimental results prove the effectiveness of the proposed algorithm are compar...
Efficient Modular Multiplication Algorithms for Public Key Cryptography
— The modular exponentiation is an important operation for cryptographic transformations in public key cryptosystems like the Rivest, Shamir and Adleman, the Difie and Hellman and the ElGamal schemes. computing n a x mod and n b a y x mod for very large x,y and n are fundamental to the efficiency of almost all pubic key cryptosystems and digital signature schemes. To achieve high level of security, the word length in the modular exponentiations should be significantly large. The performance of public key cryptography is primarily determined by the implementation efficiency of the modular multiplication and exponentiation. As the words are usually large, and in order to optimize the time taken by these operations, it is essential to minimize the number of modular multiplications. In this paper we are presenting efficient algorithms for computing n a x mod and n b a y x mod. In this work we propose four algorithms to evaluate modular exponentiation. Bit forwarding (BFW) algorithms to compute n a x mod , and to compute n b a y x mod two algorithms namely Substitute and reward (SRW), Store and forward(SFW) are proposed. All the proposed algorithms are efficient in terms of time and at the same time demands only minimal additional space to store the pre-computed values. These algorithms are suitable for devices with low computational power and limited storage.
Enhancing the Time Complexity of Mathematically Intensive Algorithms; the Case of Cryptography
International journal of computer applications, 2023
This article aims to compare the performance of the RSA encryption algorithm on two different hardware architectures, namely a CPU and a GPU CUDA. The RSA encryption algorithm is widely used for secure data storage and transmission. The algorithm requires complex mathematical processes that can be computationally demanding and can take significant time to execute, particularly for keys with larger sizes. In this paper, A parallelization technique is proposed in this article, which leverages the capabilities of GPUs to speed up the RSA algorithm. The research is done by experiment using different key sizes to measure the performance of RSA on both platforms; CPU and GPU. The proposed approach involves the parallelization of the most computationally intensive parts of the RSA Algorithm, including modular exponentiation and multiplication. GPU implementation of the RSA algorithm is done using CUDA, a programming model developed by NVIDIA for parallel computing on GPUs. The experimental results show the effectiveness of using GPUs to accelerate the RSA algorithm thus resulting in a faster and more efficient cryptographic solutions. This has significant implications for real-world applications, especially those that are mathematically intensive and demand secure and effective data transmission, like e-commerce, banking, and other financial services.
Approaches for the Parallelization of Software Implementation of Integer Multiplication
In this paper there are considered several approaches for the increasing performance of software implementation of integer multiplication algorithm for the 32-bit & 64-bit platforms via parallelization. The main idea of algorithm parallelization consists in delayed carry mechanism using which authors have proposed earlier. The delayed carry allows to get rid of connectivity in loop iterations for sums accumulation of products, which allows parallel execution of loops iterations in separate threads. Upon completion of sum accumulation threads, it is necessary to make corrections in final result via assimilation of carries. First approach consists in optimization of parallelization for the two execution threads and second approach is an evolution of the first approach and is oriented on three and more execution threads. Proposed approaches for parallelization allow increasing the total algorithm computational complexity, as for one execution thread, but decrease total execution time o...