Amir Kaivani | University of Saskatchewan (original) (raw)
Papers by Amir Kaivani
Iet Computers and Digital Techniques, 2007
With the growing popularity of decimal computer arithmetic in scientific, commercial, financial a... more With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms. However, with the rapid advances in very large scale integration (VLSI) technology, semi-and fully parallel hardware decimal multiplication units are expected to evolve soon. The dominant representation for decimal digits is the binary-coded decimal (BCD) encoding. The BCD-digit multiplier can serve as the key building block of a decimal multiplier, irrespective of the degree of parallelism. A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. We provide a novel design for the latter, showing some advantages in BCD multiplier implementations.
Abstract The authors study previous major contributions to digit recurrence decimal division hard... more Abstract The authors study previous major contributions to digit recurrence decimal division hardware and focus on techniques for improving the performance of quotient digit selection (QDS) as the most complex part. In particular, Design D1 uses the digit set [-5, 5] for quotient digits. Another design (D2) uses mixed binary/decimal carry-save manipulation of the few most significant digits of partial remainders.
Abstract With the growing popularity of decimal computer arithmetic in scientific, commercial, fi... more Abstract With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms.
The Computer Journal, Jan 1, 2011
Hardware implementation of decimal floating-point arithmetic is a topic of great interest among t... more Hardware implementation of decimal floating-point arithmetic is a topic of great interest among the researchers in computer arithmetic and also the digital processor industry. Software packages for decimal arithmetic are actually being challenged by decimal hardware units. This spreading trend seems to include hardware implementation of elementary functions. The (Coordinate Rotation Digital Computer) CORDIC algorithm, due to its simplicity, is one of the most efficient methods for computing elementary functions. In this work, we develop a decimal CORDIC scheme with almost half number of equally long cycles with respect to the best previous design. This is achieved via retiming of the conventional CORDIC architecture and selection of the microrotation factors by rounding. However, the proposed design does not lead to a predetermined constant scaling factor. The solution that we use is to iteratively compute the logarithm of the scaling factor followed by a decimal exponentiation. The same CORDIC hardware is reused for performing the latter. The proposed CORDIC method requires 2n + 3 cycles for n-digit decimal operands vs. 4n cycles of the previous methods. Evaluations with 16-digit operands based on logical effort analysis conclude that the proposed architecture shows 82% speed advantage, at the cost of 60% more area and 2.5 KB more ROM.
Integration, the VLSI Journal, Jan 1, 2010
Decimal computer arithmetic is experiencing a revived popularity, and there is quest for highperf... more Decimal computer arithmetic is experiencing a revived popularity, and there is quest for highperformance decimal hardware units. Successful experiences on binary computer arithmetic may find grounds in decimal arithmetic. For example, the traditional fully redundant (i.e., the result and both of the operands are represented in a redundant format) and semi-redundant (i.e., the result and only one of the operands are redundant) binary addition schemes have influenced the design and implementation of similar decimal arithmetic units. However, special comparison and correction steps are required when decimal arithmetic algorithms are implemented on binary hardware. To circumvent these difficulties, alternative encodings of decimal digits and a variety of decimal arithmetic algorithms have been examined by many researchers over decades. In this paper we offer a new redundant decimal digit set [À8, 9] and a fully redundant addition/subtraction scheme. The proposed digit set, faithfully encoded as a mix of posibits, negabits, and unibits, is shown to obviate the need for any compare-to-9 operations and leads to minimal penalty subtraction using the addition circuitry. Moreover, conversion from the standard BCD encoding to the proposed stored-unibit encoding is possible with the latency of one logic level. However, the reverse conversion, like any other redundant to nonredundant conversion, involves carry propagation.
2007 IEEE/ACS International Conference on …, Jan 1, 2007
Abstract Data shifting is required in many key computer operations from address decoding to compu... more Abstract Data shifting is required in many key computer operations from address decoding to computer arithmetic. Full barrel shifters are often on the critical path, which has led most research to be directed toward speed optimizations. With the advent of quantum computer and reversible logic, design and implementation of all devices in this logic has received more attention. This paper proposes a reversible implementation of a barrel shifter, and also evaluation of its quantum cost is presented.
IEEE Transactions on Computers, Jan 1, 2009
Hardware support for decimal computer arithmetic is regaining popularity. One reason is the recen... more Hardware support for decimal computer arithmetic is regaining popularity. One reason is the recent growth of decimal computations in commercial, scientific, financial, and Internet-based computer applications. Newly commercialized decimal arithmetic hardware units use radix-10 sequential multipliers that are rather slow for multiplication-intensive applications. Therefore, the future relevant processors are likely to host fast parallel decimal multiplication circuits. The corresponding hardware algorithms are normally composed of three steps: partial product generation (PPG), partial product reduction (PPR), and final carry-propagating addition. The state of the art is represented by two recent full solutions with alternative designs for all the three aforementioned steps. In addition, PPR by itself has been the focus of other recent studies. In this paper, we examine both of the full solutions and the impact of a PPRonly design on the appropriate one. In order to improve the speed of parallel decimal multiplication, we present a new PPG method, fine-tune the PPR method of one of the full solutions and the final addition scheme of the other; thus, assembling a new full solution. Logical Effort analysis and 0:13 m synthesis show at least 13 percent speed advantage, but at a cost of at most 36 percent additional area consumption.
Iet Computers and Digital Techniques, 2007
With the growing popularity of decimal computer arithmetic in scientific, commercial, financial a... more With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms. However, with the rapid advances in very large scale integration (VLSI) technology, semi-and fully parallel hardware decimal multiplication units are expected to evolve soon. The dominant representation for decimal digits is the binary-coded decimal (BCD) encoding. The BCD-digit multiplier can serve as the key building block of a decimal multiplier, irrespective of the degree of parallelism. A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. We provide a novel design for the latter, showing some advantages in BCD multiplier implementations.
Abstract The authors study previous major contributions to digit recurrence decimal division hard... more Abstract The authors study previous major contributions to digit recurrence decimal division hardware and focus on techniques for improving the performance of quotient digit selection (QDS) as the most complex part. In particular, Design D1 uses the digit set [-5, 5] for quotient digits. Another design (D2) uses mixed binary/decimal carry-save manipulation of the few most significant digits of partial remainders.
Abstract With the growing popularity of decimal computer arithmetic in scientific, commercial, fi... more Abstract With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms.
The Computer Journal, Jan 1, 2011
Hardware implementation of decimal floating-point arithmetic is a topic of great interest among t... more Hardware implementation of decimal floating-point arithmetic is a topic of great interest among the researchers in computer arithmetic and also the digital processor industry. Software packages for decimal arithmetic are actually being challenged by decimal hardware units. This spreading trend seems to include hardware implementation of elementary functions. The (Coordinate Rotation Digital Computer) CORDIC algorithm, due to its simplicity, is one of the most efficient methods for computing elementary functions. In this work, we develop a decimal CORDIC scheme with almost half number of equally long cycles with respect to the best previous design. This is achieved via retiming of the conventional CORDIC architecture and selection of the microrotation factors by rounding. However, the proposed design does not lead to a predetermined constant scaling factor. The solution that we use is to iteratively compute the logarithm of the scaling factor followed by a decimal exponentiation. The same CORDIC hardware is reused for performing the latter. The proposed CORDIC method requires 2n + 3 cycles for n-digit decimal operands vs. 4n cycles of the previous methods. Evaluations with 16-digit operands based on logical effort analysis conclude that the proposed architecture shows 82% speed advantage, at the cost of 60% more area and 2.5 KB more ROM.
Integration, the VLSI Journal, Jan 1, 2010
Decimal computer arithmetic is experiencing a revived popularity, and there is quest for highperf... more Decimal computer arithmetic is experiencing a revived popularity, and there is quest for highperformance decimal hardware units. Successful experiences on binary computer arithmetic may find grounds in decimal arithmetic. For example, the traditional fully redundant (i.e., the result and both of the operands are represented in a redundant format) and semi-redundant (i.e., the result and only one of the operands are redundant) binary addition schemes have influenced the design and implementation of similar decimal arithmetic units. However, special comparison and correction steps are required when decimal arithmetic algorithms are implemented on binary hardware. To circumvent these difficulties, alternative encodings of decimal digits and a variety of decimal arithmetic algorithms have been examined by many researchers over decades. In this paper we offer a new redundant decimal digit set [À8, 9] and a fully redundant addition/subtraction scheme. The proposed digit set, faithfully encoded as a mix of posibits, negabits, and unibits, is shown to obviate the need for any compare-to-9 operations and leads to minimal penalty subtraction using the addition circuitry. Moreover, conversion from the standard BCD encoding to the proposed stored-unibit encoding is possible with the latency of one logic level. However, the reverse conversion, like any other redundant to nonredundant conversion, involves carry propagation.
2007 IEEE/ACS International Conference on …, Jan 1, 2007
Abstract Data shifting is required in many key computer operations from address decoding to compu... more Abstract Data shifting is required in many key computer operations from address decoding to computer arithmetic. Full barrel shifters are often on the critical path, which has led most research to be directed toward speed optimizations. With the advent of quantum computer and reversible logic, design and implementation of all devices in this logic has received more attention. This paper proposes a reversible implementation of a barrel shifter, and also evaluation of its quantum cost is presented.
IEEE Transactions on Computers, Jan 1, 2009
Hardware support for decimal computer arithmetic is regaining popularity. One reason is the recen... more Hardware support for decimal computer arithmetic is regaining popularity. One reason is the recent growth of decimal computations in commercial, scientific, financial, and Internet-based computer applications. Newly commercialized decimal arithmetic hardware units use radix-10 sequential multipliers that are rather slow for multiplication-intensive applications. Therefore, the future relevant processors are likely to host fast parallel decimal multiplication circuits. The corresponding hardware algorithms are normally composed of three steps: partial product generation (PPG), partial product reduction (PPR), and final carry-propagating addition. The state of the art is represented by two recent full solutions with alternative designs for all the three aforementioned steps. In addition, PPR by itself has been the focus of other recent studies. In this paper, we examine both of the full solutions and the impact of a PPRonly design on the appropriate one. In order to improve the speed of parallel decimal multiplication, we present a new PPG method, fine-tune the PPR method of one of the full solutions and the final addition scheme of the other; thus, assembling a new full solution. Logical Effort analysis and 0:13 m synthesis show at least 13 percent speed advantage, but at a cost of at most 36 percent additional area consumption.