An efficient multiple precision floating-point Multiply-Add Fused unit (original) (raw)
Related papers
Efficient dual-precision floating-point fused-multiply-add architecture
Microprocessors and Microsystems, 2018
The fused-multiply-add (FMA) instruction is a common instruction in RISC processors since 1990. A 3-stage, 8level pipelined, dual-precision FMA is proposed here that can perform operations either at one double precision (SISD) or at two single precision in parallel (SIMD). The 53-bit mantissa-multiplier (MM) is optimally segmented by Karatsuba-Offman (KO) algorithm such that both modes can be performed. The 6-stage pipelined MM uses only 6 of 10 multipliers and 13 of 33 adder/subtractors in SIMD. Thus hardware area of the proposed MM is reduced by 23.82% and throughput is maintained to be 923M samples/s. The arithmetic operational units in the data path are shared among the modes by having four data rearrangement units (DRU) which rearranges the data systematically at the input, the outputs of MM and the final output. Though these DRUs bring some hardware overhead, the resulting architecture is modular and uniform for both modes of computation. The proposed FMA has been implemented using TSMC 1P6M CMOS 130 nm library and takes 48% less overall area and consumes 49% less power at 308.7 MHz compared to previous results. The area-delay-product (ADP), 0.48 × 10 −15 shows that the area optimization by proposed KO based MM can also keep the computation time as 3.24 ns.
An efficient dual-mode floating-point Multiply-Add Fused Unit
2010 17th IEEE International Conference on Electronics, Circuits and Systems, 2010
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. Aiming at improving the MAF functionality this paper presents a dualmode MAF architecture, which is able to perform either one double-precision or two single-precision operations in parallel. The design attains low latency by following a dual-path approach and by combining final addition with rounding. The organization performs a MAF instruction in three cycles, while single floatingpoint addition in two cycles. The design has been validated and implemented with TSMC 0.13um.
A 270ps 20mW 108-bit End-around Carry Adder for Multiply-Add Fused Floating Point Unit
Journal of Signal Processing Systems, 2010
A power and area efficient 108-bit end-around carry adder is implemented using IBM 65nm SOI technology. The adder is used for a multiply-add fused (MAF) floating point unit. Careful balance of the adder structure and structure-aware layout techniques enabled this adder to have a latency of 270ps at power consumption of 20mW with 1V supply.
A decimal floating-point fused-multiply-add unit
2010 53rd IEEE International Midwest Symposium on Circuits and Systems, 2010
This paper presents the first hardware implementation of a fully parallel decimal floating-point fusedmultiply-add unit performing the operation ± (A × B) ± C on decimal floating-point operands. The proposed design is fully compliant with the IEEE 754-2008 standard and supports the two standard formats decimal64 and decimal128. Furthermore, the proposed design may be controlled to perform the multiplication or the addition/subtraction as standalone operations. Our decimal floating-point FMA may be pipelined so that a complete resultant decimal floating-point is available each clock cycle. I.
International Journal of Intelligent Engineering and Systems, 2017
Reconfigurable architectures have provided a low cost, fast turnaround platform for the development and deployment of designs in communication and signal processing applications. The floating point operations are used in most of the signal processing applications that require high precision and good accuracy. In this paper, an effective implementation of Fused Floating-point Add-Subtract (FFAS) unit with a modification in dual path design is presented. To enhance the performance of FFAS unit for reconfigurable architectures, a dual path unit with a modification in close path design is proposed. The proposed design is targeted on a Xilinx Virtex-6 device and implemented on ML605 Evaluation board for single, double and double extended precision. When compared to discrete floating point adder design, the FFAS unit reduces area requirement and power dissipation as the later shares common logic. A Dual Path FFAS (DPFFAS) unit has reduced latency when compared with FFAS unit. The latency is further reduced with the proposed modified DPFFAS when compared with DPFFAS for reconfigurable architectures.
Floating-Point Single-Precision Fused Multiplier-adder Unit on FPGA
The fused multiply-add operation improves many calculations and therefore is already available in some generalpurpose processors, like the Itanium. The optimization of units dedicated to execute the multiply-add operation is therefore crucial to achieve optimal performance when running the overlying applications. In this paper, we present a single-precision floating-point fused multiply-add optimized unit implemented in FPGA and prepared to integrate a data flow processor for high-performance computing. The unit presents a numerical accuracy according to the IEEE 754-2008 standard and a performance and resource usage comparable with a state-of-the-art non-fused single-precision unit. The fused multiplier-adder was implemented targeting a Virtex-7 speed-grade-1 device and occupies 754 LUTs, 4 DSPs and achieves a maximum frequency of 361 MHz with 18 pipeline stages. A lighter low latency design of the same unit was also implemented in the same device presenting a resource usage of 845 LUTs, 2 DSPs and achieving a maximum frequency of 285 MHz.
An Operand-Optimized Asynchronous IEEE 754 Double-Precision Floating-Point Adder
2010 IEEE Symposium on Asynchronous Circuits and Systems, 2010
We present the design and implementation of an asynchronous high-performance IEEE 754 compliant doubleprecision floating-point adder (FPA). We provide a detailed breakdown of the power consumption of the FPA datapath, and use it to motivate a number of different data-dependent optimizations for energy-efficiency. Our baseline asynchronous FPA has a throughput of 2.15 GHz while consuming 69.3 pJ per operation in a 65nm bulk process. For the same set of nonzero operands, our optimizations improve the FPA's energy-efficiency to 30.2 pJ per operation while preserving average throughput, a 56.7% reduction in energy relative to the baseline design. To our knowledge, this is the first detailed design of a high-performance asynchronous double-precision floating-point adder.
A mixed-precision fused multiply and add
2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 2011
The standard floating-point fused multiply and add (FMA) computes R=AB+C with a single rounding. This article investigates a variant of this operator where the addend C and the result R are of a larger format, for instance binary64 (double precision), while the multiplier inputs A and B are of a smaller format, for instance binary32 (single precision). With minor modifications, this operator is also able to perform the standard FMA in the smaller format, and the standard addition in the larger format. For sum-of-product applications, the proposed mixed-precision FMA provides the accumulation accuracy of the larger format, at a cost that is close to that of a classical FMA in the smaller format. Besides, it is fully compatible with existing arithmetic and language standards. The architectural cost of this operator is analysed in detail. An implementation of a mixed binary32/binary64 operator fully supporting subnormal numbers, binary64 addition and binary32 FMA is demonstrated and evaluated: its area overhead is one third over the classical binary32 FMA. Similarly, in high-end processors, a mixed binary64/binary128 FMA could provide an adequate solution to the binary128 requirements of very large scale computing applications.
Multi-functional floating-point MAF designs with dot product support
Microelectronics Journal, 2008
This paper presents multi-functional double-precision and quadruple-precision floating-point multiply-add fused (FPMAF) designs. The double-precision FPMAF design can execute adouble-precision floating-point multiply-add, or two single-precision floating-point multiplications, or a single-precision floating-point dot product. The quadruple-precision FPMAF can perform similar operations with quadruple, double and single precision operands. These architectures can perform a dot-product operation two times or more faster than a basic FPMAF design. The presented multi-functional designs are compared with basic double-precision and quadruple-precision FPMAF designs by ASIC syntheses. The syntheses results show that the proposed double-precision implementation has 8%more area than a standard double-precision FPMAF implementation, and the proposed quadruple-precision design has 12.5% more area than a standard quadruple-precision FPMAF. Both of the proposed designs have one more pipeline stage compared to the basic designs.