An Operand-Optimized Asynchronous IEEE 754 Double-Precision Floating-Point Adder (original) (raw)
Related papers
Operand-Optimized Asynchronous Floating-Point Arithmetic Circuits
2012
Fast floating-point computations are critical in a wide range of applications. Today, the performance of these applications is limited by power constraints. The traditional power reduction schemes, which relied primarily on technology and voltage scaling, are not sufficient any more. In this thesis, we propose two novel asynchronous pipeline templates and multiple operand-dependent optimization techniques to significantly reduce the overall power consumption while preserving the average throughput. Our novel pipeline templates reduce power consumption by minimizing the handshake circuitry and employing single-track handshake protocol. Noise and timing robustness constraints of our pipelined circuits are quantified across all process corners. A completion detection scheme based on wide NOR gates is presented, which results in significant latency and energy savings especially as the number of output tokens increase. Furthermore, this thesis presents novel operand-dependent optimization techniques to improve the energy efficiency of IEEE-754 compliant floatingpoint adder and floating-point multiplier designs. Some of these optimizations are highly challenging, if at all possible, in a synchronous design because they increase the worst case critical path but on average have negligible impact on performance. To our knowledge, this is the first detailed design of highperformance asynchronous floating-point adder and floating-point multiplier.
Reduced latency IEEE floating-point standard adder architectures
Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)
The design and implementation of a double precision floating-point IEEE-754 standard adder is described which uses "flagged prefix addition" to merge rounding with the significand addition. The floating-point adder is implemented in 0:5m CMOS, measures 1:8mm 2 , has a 3-cycle latency and implements all rounding modes. A modified version of this floating-point adder can perform accumulation in 2-cycles with a small amount of extra hardware for use in a parallel processor node. This is achieved by feeding back the previous un-normalised but correctly rounded result together with the normalisation distance. A 2-cycle latency floating-point adder architecture with potentially the same cycle time that also employs flagged prefix addition is described. It also incorporates a fast prediction scheme for the true subtraction of significands with an exponent difference of 1, with one less adder.
2PSA: An Optimized and Flexible Power-Precision Scalable Adder
2020 33rd Symposium on Integrated Circuits and Systems Design (SBCCI), 2020
Adders are the core of all arithmetic circuits and the proposition of efficient adders, in distinct perspectives, are a constant in the last decades, with a myriad of solutions focusing on a wide variety of applications. The emergence of approximate computing encouraged the development of a new generation of dedicated imprecise adders intending to reduce delay, area, power and/or energy, but none of the proposed solutions is able to support run time definition of distinct power-precision operation points. This article presents the Power-Precision Scalable Adder (2PSA) which is a dynamically configurable power-precision imprecise adder, where the number of powerprecision operation points can be configured at design time and each supported power-precision operation point can be changed in run time. The obtained experimental results showed that 2PSA is a fully flexible and efficient imprecise adder, supporting a high variety of power-SNR pairs, as well as a wide range of applications. Considering 8-bit adders, the power-SNR pairs vary from 2%-55dB to 60%-13.75dB, where eight operation points are allowed. Considering 64-bit adders, the power-SNR pairs ranges from 2%-325dB to 60%-16.65dB and 64 operation points are allowed. 2PSA also reached expressive power (from 18% to 73%) and area (from 54% to 73%) savings when compared with non-optimized solutions supporting the same operation points.
A low power approach to floating point adder design
Proceedings International Conference on Computer Design VLSI in Computers and Processors
Floating point adders are area and power intensive, but essential in high performance systems. The Software-Controlled Architectures and Low Energy (SCALE) project requires a low-power single-precision IEEE floating-point adder cluster. Two adder architectures, one containing a single longer computational path, and one containing two shorter parallel computational paths were implemented using minimal area modules. Inputs to the parallel computational paths were registered, and only enabled when that computational path was valid, reducing switching activity. Energy measurements were made of the dual path adder with and without inhibit control and the single path adder, to determine the most energy efficient design.
An efficient floating point adder for low-power devices
With an increasing demand for power hungry data intensive computing, design methodologies with low power consumption are increasingly gaining prominence in the industry. Most of the systems operate on critical and noncritical data both. An attempt to generate a precision result results in excessive power consumption and results in a slower system. An attempt to generate a precision result results in excessive power consumption and results in a slower system. For non-critical data, approximate computing circuits significantly reduce the circuit complexity and hence power consumption. For non-critical data, approximate computing circuits significantly reduce the circuit complexity and hence power consumption. In this paper, a novel approximate single precision floating point adder is proposed with an approximate mantissa adder. The mantissa adder is designed with three 8-bit full adder blocks.
A dual precision IEEE floating-point multiplier
Integration, the VLSI Journal, 2000
A new algorithm for computing IEEE-compliant rounding is presented, called injection-based rounding. Injection-based rounding is simple and facilitates using the same rounding circuitry for di!erent precisions. We demonstrate the usefulness of injection-based rounding in a design of an IEEE #oating-point multiplier capable of performing either a double-precision multiplication or a single-precision multiplication. The multiplier is designed to minimize hardware cost by using only a half-sized multiplication array and by sharing the rounding circuitry for both precisions. The latency of the multiplier is in single-precision two clock cycles and in double precision the latency is three clock cycles, where each pipeline stage contains roughly 15 logic levels.
IJERT-Design And Verification Of High Speed And Efficient Asynchronous Floating Point Multiplier
International Journal of Engineering Research and Technology (IJERT), 2013
https://www.ijert.org/design-and-verification-of-high-speed-and-efficient-asynchronous-floating-point-multiplier https://www.ijert.org/research/design-and-verification-of-high-speed-and-efficient-asynchronous-floating-point-multiplier-IJERTV2IS70701.pdf We present the details of our energyefficient asynchronous floating-point multiplier (FPM). We discuss design trade-offs of various multiplier implementations. A higher radix array multiplier design with operand-dependent carrypropagation adder and low handshake overhead pipeline design is presented, which yields significant energy savings while preserving the average throughput. Our FPM also includes a hardware implementation of denormal and underflow cases. When compared against a custom synchronous FPM design, our asynchronous FPM consumes 3X less energy per operation while operating at 2.3X higher throughput. To our knowledge, this is the first detailed design of a highperformance asynchronous IEEE-754 compliant double-precision floating-point multiplier.
Efficient dual-precision floating-point fused-multiply-add architecture
Microprocessors and Microsystems, 2018
The fused-multiply-add (FMA) instruction is a common instruction in RISC processors since 1990. A 3-stage, 8level pipelined, dual-precision FMA is proposed here that can perform operations either at one double precision (SISD) or at two single precision in parallel (SIMD). The 53-bit mantissa-multiplier (MM) is optimally segmented by Karatsuba-Offman (KO) algorithm such that both modes can be performed. The 6-stage pipelined MM uses only 6 of 10 multipliers and 13 of 33 adder/subtractors in SIMD. Thus hardware area of the proposed MM is reduced by 23.82% and throughput is maintained to be 923M samples/s. The arithmetic operational units in the data path are shared among the modes by having four data rearrangement units (DRU) which rearranges the data systematically at the input, the outputs of MM and the final output. Though these DRUs bring some hardware overhead, the resulting architecture is modular and uniform for both modes of computation. The proposed FMA has been implemented using TSMC 1P6M CMOS 130 nm library and takes 48% less overall area and consumes 49% less power at 308.7 MHz compared to previous results. The area-delay-product (ADP), 0.48 × 10 −15 shows that the area optimization by proposed KO based MM can also keep the computation time as 3.24 ns.
IJERT-Design and Simulation of Double Precision Floating-Point Adder
International Journal of Engineering Research and Technology (IJERT), 2015
https://www.ijert.org/design-and-simulation-of-double-precision-floating-point-adder https://www.ijert.org/research/design-and-simulation-of-double-precision-floating-point-adder-IJERTV4IS100560.pdf Floating point numbers are very important part of computer processing. Addition imposes a great challenge due to its processing time. This paper presents a technique for addition of IEEE 754 double precision floating-point numbers within two clock cycles. This paper results also show improvements in power utilization, operational chip area management and optimization of hardware. This proposed adder is implemented with the help of 6slx45tfgg484-3 Spartan family as well as 5vfx70tff1136-1 of Virtex Xilinx FPGA devices.
An efficient multiple precision floating-point Multiply-Add Fused unit
Microelectronics Journal, 2016
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power.