Robert Montoye - Academia.edu (original) (raw)

Papers by Robert Montoye

Research paper thumbnail of Design of the IBM RISC System/6000 floating-point execution unit

IEEE Computer Society Press eBooks, Mar 1, 1995

Research paper thumbnail of Practical Strategies for Power-Efficient Computing Technologies

Proceedings of the IEEE, Feb 1, 2010

Research paper thumbnail of Wide limited switch dynamic logic circuit implementations

Research paper thumbnail of Design-performance trade-offs in CMOS-domino logic

IEEE Journal of Solid-state Circuits, Apr 1, 1986

The authors present a study of the charge-sharing problem and its effect on the performance of CM... more The authors present a study of the charge-sharing problem and its effect on the performance of CMOS-domino logic. Several solutions to the charge-sharing problem are examined, and the results are verified by simulation. Thus, the charge-sharing problem in CMOS-domino logic was identified and alternate approaches were evaluated.

Research paper thumbnail of A fully-integrated switched-capacitor 2&#x2236;1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm<sup>2</sup&gt

Research paper thumbnail of 1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing

IEEE Journal of Solid-state Circuits, Apr 1, 2014

ABSTRACT

Research paper thumbnail of Custom is from Venus and synthesis from Mars

... William H Joyner (Chair) Semiconductor Research Corporation Research Triangle Park, NC, USA S... more ... William H Joyner (Chair) Semiconductor Research Corporation Research Triangle Park, NC, USA Shekhar Borkar Intel Corporation Hillsboro, OR, USA Ty Garibay Texas Instruments Inc. Dallas, TX, USA Jonathan Lotz Advanced Micro Devices Inc. Fort Collins, CO, USA ...

Research paper thumbnail of Design of the IBM RISC System/6000 floating-point execution unit

IBM journal of research and development, 1990

Research paper thumbnail of Victoria

There is increasing interest in the use of accelerators in computer systems. Accelerators are pro... more There is increasing interest in the use of accelerators in computer systems. Accelerators are processor-attached hardware units that can perform certain functions faster than the conventional general purpose processor. In this paper, we describe the VICTORIA PowerPC architecture, which is based on the iVMX accelerator technology. The iVMX accelerator extends the existing VMX architecture with indirect register addressing. That approach greatly extends the architected space of registers and opens the door for highly optimized vector algorithms that can sustain very high processing rates. The large space of registers is directly controlled by the executing code and offers a sufficiently large storage to hold sizeable intermediate results. This helps reduce the negative effects of limited memory bandwidth and high memory latency. The iVMX accelerator is an example of in-line accelerator; that is, the instructions that drive the accelerator are part of the same stream that drives the main processor. Compared to off-line accelerators, which execute their own instruction stream, in-line accelerators present a much more convenient programming model.

Research paper thumbnail of A Duty-Cycle Correction Circuit for High-Frequency Clocks

We present a circuit to control duty-cycle of high-frequency clocks with very fine resolution. Th... more We present a circuit to control duty-cycle of high-frequency clocks with very fine resolution. The proposed duty-cycle detection and correction circuits are digital and do not require external references and matching devices. The circuits are designed to compensate for duty-cycle uncertainties in a floating point unit implemented using limited switch dynamic logic (LSDL) (Belloumini, 2005). The results show that the circuit can correct the duty-cycle of an 8-GHz clock with plusmn0.8% accuracy for an input range of 25% to 75%

Research paper thumbnail of Leading-zero anticipator (LZA) in the IBM RISC System/6000 floating-point execution unit

IBM journal of research and development, 1990

Research paper thumbnail of The four degrees of 3D

Reversing early limitations on Moore&amp;amp;#39;s low, interconnectors have replaced transis... more Reversing early limitations on Moore&amp;amp;#39;s low, interconnectors have replaced transistors as the main determinants of chip performance. This &amp;amp;quot;tyranny of interconnectors&amp;amp;quot; will only escalate in the future, and thus the nanoelectronics that follow silicon must be interconnect-centric. This new technology will likely use &amp;amp;quot;transistors&amp;amp;quot; that approach, if not surpass, the 0.1 ps latency of 10 nm generation silicon transistors. Consequently, if we optimistically assume that the interconnects of this post-Moore&amp;amp;#39;s Law nanotechnology will be superconductive, their latency will exceed that of the transistors for interconnect lengths greater than 30 μm, while long, on-chip interconnect lengths will be 1,000 times greater at 30 mm. Consequently, mainstream electronics will have an interconnect era beyond Moore&amp;amp;#39;s law.

Research paper thumbnail of Processor architecture for software implementation of multi-sector G-RAKE receivers for HSUPA wireless infrastructure

The high speed uplink packet access (HSUPA) wireless standard requires extremely high-performance... more The high speed uplink packet access (HSUPA) wireless standard requires extremely high-performance signal processing in the baseband receiver, the most challenging being the chip rate rake receiver. In this paper we describe the architectural enhancements on the IBM's PowerEN processor, to enable it to support the computational requirements of the rake receiver in a fully programmable and scalable fashion. A key feature of these enhancements is a bank-based very-large register file, with embedded single instruction multiple data (SIMD) support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank. This overcomes the limitation on the number of register file ports and at the same time enables high degree of parallelism. We show that these enhancements enable the integration of multi-sector HSUPA G-RAKE receivers on a single processor.

Research paper thumbnail of Ratioed CMOS: a low power high speed design choice in SOI technologies

Ratioed CMOS gates implemented in a partially-depleted (PD) SOI CMOS technology are usually consi... more Ratioed CMOS gates implemented in a partially-depleted (PD) SOI CMOS technology are usually considered to be high power but end up being both faster and lower power than other circuit implementations, mainly due to the reduced junction capacitance in SOI devices as well as floating-body effects. As an example, a high performance multiplier shifter is 3 to 4 times faster

Research paper thumbnail of Demonstration of CAM and TCAM Using Phase Change Devices

We demonstrate novel designs for Content Addressable Memory (CAM) and Ternary CAM (TCAM) using Ph... more We demonstrate novel designs for Content Addressable Memory (CAM) and Ternary CAM (TCAM) using Phase Change Memory (PCM) technology, which can potentially improve density and power consumption by &amp;amp;amp;amp;amp;amp;gt;5X as compared with conventional SRAM based implementations. Using Monte-Carlo simulations, we also predict the desired characteristics of PCM devices for realizing large, high performance CAM/TCAM chips.

Research paper thumbnail of Practical Strategies for Power-Efficient Computing

Research paper thumbnail of Automatically generated area, power and delay optimized ALUs

This paper will describe a CAD program which automatically produces an optimized ALU from a famil... more This paper will describe a CAD program which automatically produces an optimized ALU from a family of carry-look-ahead ALU designs, and produces the mask data from a layout rule independent description. A 34b ALU has been automatically synthesized using the program and simulated in 1.4μm NMOS with a limiting delay in nominal technology of about 16ns.

Research paper thumbnail of A 270ps 20mW 108-bit End-around Carry Adder for Multiply-Add Fused Floating Point Unit

Journal of Signal Processing Systems, Jan 10, 2009

Research paper thumbnail of AREA-Time Efficient Addition in Charge Based Technology

Design Automation Conference, Jun 29, 1981

Using the model developed by Mead and Conway for charge based technology, a methodology for the p... more Using the model developed by Mead and Conway for charge based technology, a methodology for the production of area-time efficient adders which imbeds the buffering required to drive large loads caused by the carry-lookahead tree has been developed. This methodology can be used to produce an 0(logN) time and 0(NlogN) area layout. Additionally, an algorithm was written to produce minimal silicon area layouts for a given time bound. This algorithm involves optimization at both the cellular level and the layout level in an iterative fashion to allow the relevant technological parameters to play a role in the cellular design phase. Results of the algorithm including examples and an area-time curve for a 48 bit adder using typical 5 micron NMOS [MeCo80] are displayed.

Research paper thumbnail of Optimization and Testing of Nmos Arithmetic Structures (Vlsi, Mos)

Research paper thumbnail of Design of the IBM RISC System/6000 floating-point execution unit

IEEE Computer Society Press eBooks, Mar 1, 1995

Research paper thumbnail of Practical Strategies for Power-Efficient Computing Technologies

Proceedings of the IEEE, Feb 1, 2010

Research paper thumbnail of Wide limited switch dynamic logic circuit implementations

Research paper thumbnail of Design-performance trade-offs in CMOS-domino logic

IEEE Journal of Solid-state Circuits, Apr 1, 1986

The authors present a study of the charge-sharing problem and its effect on the performance of CM... more The authors present a study of the charge-sharing problem and its effect on the performance of CMOS-domino logic. Several solutions to the charge-sharing problem are examined, and the results are verified by simulation. Thus, the charge-sharing problem in CMOS-domino logic was identified and alternate approaches were evaluated.

Research paper thumbnail of A fully-integrated switched-capacitor 2&#x2236;1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm<sup>2</sup&gt

Research paper thumbnail of 1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing

IEEE Journal of Solid-state Circuits, Apr 1, 2014

ABSTRACT

Research paper thumbnail of Custom is from Venus and synthesis from Mars

... William H Joyner (Chair) Semiconductor Research Corporation Research Triangle Park, NC, USA S... more ... William H Joyner (Chair) Semiconductor Research Corporation Research Triangle Park, NC, USA Shekhar Borkar Intel Corporation Hillsboro, OR, USA Ty Garibay Texas Instruments Inc. Dallas, TX, USA Jonathan Lotz Advanced Micro Devices Inc. Fort Collins, CO, USA ...

Research paper thumbnail of Design of the IBM RISC System/6000 floating-point execution unit

IBM journal of research and development, 1990

Research paper thumbnail of Victoria

There is increasing interest in the use of accelerators in computer systems. Accelerators are pro... more There is increasing interest in the use of accelerators in computer systems. Accelerators are processor-attached hardware units that can perform certain functions faster than the conventional general purpose processor. In this paper, we describe the VICTORIA PowerPC architecture, which is based on the iVMX accelerator technology. The iVMX accelerator extends the existing VMX architecture with indirect register addressing. That approach greatly extends the architected space of registers and opens the door for highly optimized vector algorithms that can sustain very high processing rates. The large space of registers is directly controlled by the executing code and offers a sufficiently large storage to hold sizeable intermediate results. This helps reduce the negative effects of limited memory bandwidth and high memory latency. The iVMX accelerator is an example of in-line accelerator; that is, the instructions that drive the accelerator are part of the same stream that drives the main processor. Compared to off-line accelerators, which execute their own instruction stream, in-line accelerators present a much more convenient programming model.

Research paper thumbnail of A Duty-Cycle Correction Circuit for High-Frequency Clocks

We present a circuit to control duty-cycle of high-frequency clocks with very fine resolution. Th... more We present a circuit to control duty-cycle of high-frequency clocks with very fine resolution. The proposed duty-cycle detection and correction circuits are digital and do not require external references and matching devices. The circuits are designed to compensate for duty-cycle uncertainties in a floating point unit implemented using limited switch dynamic logic (LSDL) (Belloumini, 2005). The results show that the circuit can correct the duty-cycle of an 8-GHz clock with plusmn0.8% accuracy for an input range of 25% to 75%

Research paper thumbnail of Leading-zero anticipator (LZA) in the IBM RISC System/6000 floating-point execution unit

IBM journal of research and development, 1990

Research paper thumbnail of The four degrees of 3D

Reversing early limitations on Moore&amp;amp;#39;s low, interconnectors have replaced transis... more Reversing early limitations on Moore&amp;amp;#39;s low, interconnectors have replaced transistors as the main determinants of chip performance. This &amp;amp;quot;tyranny of interconnectors&amp;amp;quot; will only escalate in the future, and thus the nanoelectronics that follow silicon must be interconnect-centric. This new technology will likely use &amp;amp;quot;transistors&amp;amp;quot; that approach, if not surpass, the 0.1 ps latency of 10 nm generation silicon transistors. Consequently, if we optimistically assume that the interconnects of this post-Moore&amp;amp;#39;s Law nanotechnology will be superconductive, their latency will exceed that of the transistors for interconnect lengths greater than 30 μm, while long, on-chip interconnect lengths will be 1,000 times greater at 30 mm. Consequently, mainstream electronics will have an interconnect era beyond Moore&amp;amp;#39;s law.

Research paper thumbnail of Processor architecture for software implementation of multi-sector G-RAKE receivers for HSUPA wireless infrastructure

The high speed uplink packet access (HSUPA) wireless standard requires extremely high-performance... more The high speed uplink packet access (HSUPA) wireless standard requires extremely high-performance signal processing in the baseband receiver, the most challenging being the chip rate rake receiver. In this paper we describe the architectural enhancements on the IBM's PowerEN processor, to enable it to support the computational requirements of the rake receiver in a fully programmable and scalable fashion. A key feature of these enhancements is a bank-based very-large register file, with embedded single instruction multiple data (SIMD) support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank. This overcomes the limitation on the number of register file ports and at the same time enables high degree of parallelism. We show that these enhancements enable the integration of multi-sector HSUPA G-RAKE receivers on a single processor.

Research paper thumbnail of Ratioed CMOS: a low power high speed design choice in SOI technologies

Ratioed CMOS gates implemented in a partially-depleted (PD) SOI CMOS technology are usually consi... more Ratioed CMOS gates implemented in a partially-depleted (PD) SOI CMOS technology are usually considered to be high power but end up being both faster and lower power than other circuit implementations, mainly due to the reduced junction capacitance in SOI devices as well as floating-body effects. As an example, a high performance multiplier shifter is 3 to 4 times faster

Research paper thumbnail of Demonstration of CAM and TCAM Using Phase Change Devices

We demonstrate novel designs for Content Addressable Memory (CAM) and Ternary CAM (TCAM) using Ph... more We demonstrate novel designs for Content Addressable Memory (CAM) and Ternary CAM (TCAM) using Phase Change Memory (PCM) technology, which can potentially improve density and power consumption by &amp;amp;amp;amp;amp;amp;gt;5X as compared with conventional SRAM based implementations. Using Monte-Carlo simulations, we also predict the desired characteristics of PCM devices for realizing large, high performance CAM/TCAM chips.

Research paper thumbnail of Practical Strategies for Power-Efficient Computing

Research paper thumbnail of Automatically generated area, power and delay optimized ALUs

This paper will describe a CAD program which automatically produces an optimized ALU from a famil... more This paper will describe a CAD program which automatically produces an optimized ALU from a family of carry-look-ahead ALU designs, and produces the mask data from a layout rule independent description. A 34b ALU has been automatically synthesized using the program and simulated in 1.4μm NMOS with a limiting delay in nominal technology of about 16ns.

Research paper thumbnail of A 270ps 20mW 108-bit End-around Carry Adder for Multiply-Add Fused Floating Point Unit

Journal of Signal Processing Systems, Jan 10, 2009

Research paper thumbnail of AREA-Time Efficient Addition in Charge Based Technology

Design Automation Conference, Jun 29, 1981

Using the model developed by Mead and Conway for charge based technology, a methodology for the p... more Using the model developed by Mead and Conway for charge based technology, a methodology for the production of area-time efficient adders which imbeds the buffering required to drive large loads caused by the carry-lookahead tree has been developed. This methodology can be used to produce an 0(logN) time and 0(NlogN) area layout. Additionally, an algorithm was written to produce minimal silicon area layouts for a given time bound. This algorithm involves optimization at both the cellular level and the layout level in an iterative fashion to allow the relevant technological parameters to play a role in the cellular design phase. Results of the algorithm including examples and an area-time curve for a 48 bit adder using typical 5 micron NMOS [MeCo80] are displayed.

Research paper thumbnail of Optimization and Testing of Nmos Arithmetic Structures (Vlsi, Mos)