yong hei - Academia.edu (original) (raw)
Papers by yong hei
2015 IEEE 11th International Conference on ASIC (ASICON), 2015
With continued CMOS technology scaling down, transistors exhibit higher degrees of variation and ... more With continued CMOS technology scaling down, transistors exhibit higher degrees of variation and mismatch, resulting in a larger offset voltage. A large offset voltage will enlarge bitline swing, increasing dynamic power consumption during a read operation and degrading the sensing decision correct rate and operation speed. Thus, the offset voltage is the most critical metric for static random access memory sense amplifiers (SAs), mainly arising from transistor threshold voltage mismatch. Here we propose an offset-cancelling technique with digitized multiple body biasing. In this scheme, SA transistor threshold voltage mismatch is compensated by adjusting the body bias voltage digitally and repeatedly. Simulation results in 130-nm CMOS technology show that the proposed calibration technique can reduce the standard deviation of the offset voltage by over four times comparing to a conventional SA, with about 6.5% and 1.6% area power overhead of a 6-kbit prototype chip introduced.
IEICE Electronics Express, 2017
In this brief, we propose a novel method which realizes conflictfree strategy in memory-based FFT... more In this brief, we propose a novel method which realizes conflictfree strategy in memory-based FFT, of which the hardware complexity is simplified, since only a few extra registers are needed and the control logic is identical in all stages. In addition, we present a modified signal flow graph to fit for the proposed conflict-free strategy. The modified signal flow graph derives from the mixed-radix signal flow graph and has constant geometry property. Furthermore, continuous-flow is adopted to increase the throughput. Thus, the proposed FFT processor has better performance compared with the previous memory-based FFT processors. Simulation result shows that for the proposed 8 to 2048-point FFT processor, the maximum frequency is 400 MHz by using a 65-nm CMOS technology, and the area is 0.45 mm 2 in the same condition.
Journal of Electronics & Information Technology, 2010
Proceedings of the 47th International Conference on Parallel Processing, 2018
We have implemented an asynchronous mesh network. This paper describes our innovative design usin... more We have implemented an asynchronous mesh network. This paper describes our innovative design using a Click controller. Compared to designs that use other asynchronous circuit families with C-elements and four-phase bundled data, our two-phase Click-based Bounded Bundled Data design is faster, but introduces phase skews when handling concurrent traffic at a single node. Instead of eliminating the phase skews, we use them as computation slots. Our network uses a novel asynchronous arbiter with a queue that can accept data from both the four cardinal directions as well as from a local source, five directions in all. We have implemented our network design in 1 × 1, 2 × 2 and 4 × 4 sizes, larger network could be implemented easier since the isomorphism and modularity of the routing nodes. Our experiments show that an initial data item passes through a node in 157ns v.s. 81ns for non-delay-branch and delay-branch designs separately. Following items take about 65% as long. But for a networ...
IEICE Electronics Express
Self-timed systems divide nicely into two kinds of components: communication links that transport... more Self-timed systems divide nicely into two kinds of components: communication links that transport and store data, and computation joints that apply logic to data. We treat these two types of self-timed components as equally important. Putting communication on a par with computation acknowledges the increasing cost of data transport and storage in terms of energy, time, and area. Our clean separation of data transport and storage from logic simplifies the design and test of self-timed systems. The separation also helps one to grasp how self-timed systems work. We offer this paper in the hope that better understanding of self-timed systems will engage the minds of compiler, formal verification, and test experts.
This paper presents a robust 12T subthreshold SRAM cell which is fabricated in 55nm CMOS technolo... more This paper presents a robust 12T subthreshold SRAM cell which is fabricated in 55nm CMOS technology. The back to back latch structure of the proposed cell is constructed by two half Schmitt-based inverters, and hence the hold noise margin is significantly enhanced compared to conventional 6T cell. Meanwhile, the proposed cell also exhibits improved read noise margin and read speed compared to the previous Schmitt Trigger based SRAM cell due to the employment of pseudo read nodes. In addition, multiple threshold CMOS transistors are used to improve the write ability. Furthermore, the proposed cell eliminates the half-selected problem in the write operation to facilitate a stable function in the subthreshold region.
IEICE Electronics Express
IEICE Electronics Express, 2017
IEEE Transactions on Circuits and Systems II: Express Briefs
This brief presents a fast and energy-efficient level shifter with wide conversion range. To achi... more This brief presents a fast and energy-efficient level shifter with wide conversion range. To achieve both energy-efficient and high-speed voltage level conversion, a novel architecture combined with multi-threshold CMOS technique is employed in the proposed circuit. A mixed-threshold current mirror circuit is proposed to solve the reduced swing issue in the prior arts. Moreover, auxiliary bias circuits are inserted to guarantee that the low-threshold pull down networks could be strongly cut off while in leaking state. As a result, the power consumption would be reduced to a great extent. Measurement results based on SMIC 55-nm MTCMOS process demonstrate that the proposed level shifter could provide robust voltage conversion from 0.12V to 1.2V. At the target voltage of 0.3V, the proposed level shifter shows a propagation delay of 17.86ns, a static power of 73.95pW, and an energy per transition of 26.59fJ for input frequency of 1MHz.
IEICE Electronics Express
This paper proposes an energy-efficient and glitch-free digital phase modulator for outphasing tr... more This paper proposes an energy-efficient and glitch-free digital phase modulator for outphasing transmitter. The proposed modulator uses a path-shared tapped delay line (TDL) and a dynamical pseudo clock-gating control technique. These approaches lead a 64% lower power consumption compared conventional digital control delay lines (DCDLs). Moreover, the proposed modulator achieves circular rotational phase modulation, resulting a system EVM of −36.93 dB and ACLR of −50.96 dBc without extra shaping circuits or analog filters. The prototype modulator was fabricated in 130 nm CMOS process with an active area of 0.134 mm 2. Operating under 40 MHz frequency with 1.2 V power supply, the proposed modulator consumes total power of 450 µW. In addition, this chip achieves an 80 ps coarse resolution with 4.7 ps RMS error and a minimum phase resolution of 0.96 ps.
IEICE Electronics Express
An improved phase digitization mechanism is designed to overcome limited lock-in range of low-pow... more An improved phase digitization mechanism is designed to overcome limited lock-in range of low-power all-digital phase-locked loop (ADPLL) with phase prediction and edge snapshot circuit. The proposed mechanism including a dual-mode multiplexer-based time-to-digital converter (TDC) and accessional algorithm is verified in a modelled and simulated ADPLL. Results show that the ADPLL is able to lock in 7.8 µs, i.e., 187 cycles with a 24 MHz reference clock. The ADPLL also has strong recovery capability from sudden disturbance, for instance, it recovers in 8 µs with 0.38% disturbance.
IEICE Electronics Express
An ultra-low leakage energy efficient level shifter that can convert extremely low input voltage ... more An ultra-low leakage energy efficient level shifter that can convert extremely low input voltage into the supply voltage level is presented in this paper. In order to reduce the leakage power dissipation, the super-cutoff mechanism and MTCMOS technique are utilized in the proposed structure. At the same time, a positive feedback circuit is inserted to avoid the loss of performance. Post-layout simulation results in a 55-nm MTCMOS process demonstrate that for the voltage level conversion from 0.3V to 1.2V, the proposed level shifter exhibits a propagation delay of 70.77ns and an energy per transition of 89.55fJ for input frequency of 1MHz. Meanwhile, the static power of the proposed level shifter is as low as 27.82pW. The proposed level shifter only occupies 7.79 2 , which demonstrates prominent area efficiency.
IEICE Electronics Express
In this paper, we proposed an improved design method of critical path replica (CPR) for wide volt... more In this paper, we proposed an improved design method of critical path replica (CPR) for wide voltage design. Timing accuracy of CPR in wide operating voltage is improved by applying load matching and transistor-level static timing analysis (TSTA). We applied proposed method to 100 critical paths of iscas'95 benchmark circuits, the results of simulation experiments in SMIC 55nm shows that the CPR designed by proposed method can operating between 0.3V-1.2V with only 0.25% delay error (DE).
IEICE Electronics Express
Variation poses a guard-band requirement for integrated circuit designs, which degrades performan... more Variation poses a guard-band requirement for integrated circuit designs, which degrades performance and energy efficiency. As the voltage scales down, the circuits are more sensitive to the variation and the guard-band margin becomes unacceptable. In this paper, we propose a novel error detection and correction technique to eliminate the margin for variation. The error detection latch introduces the low overhead of only 6 transistors compared to the conventional latch. The detection and correction scheme relaxes the timing constraint for the error signal by one clock cycle by extending another latch stage next to the critical stage. The proposed technique reduces the energy per cycle by 51% with 7.6% area overhead compared to the conventional margin technique. Comparison with other works in the state of art shows the proposed technique is quite competitive.
IEICE Electronics Express
This paper presents a practical, low-overhead, one-cycle correction better-than-worst-case design... more This paper presents a practical, low-overhead, one-cycle correction better-than-worst-case design method for ultra-low voltage digital circuits. Excessive design margin for PVT variation brought by traditional worst-case design method is eliminated. Proposed method is completely compatible with EDA tools. Considerable design efforts are relaxed compared with other variation-tolerant techniques. We have implemented our proposed technique on a 16 bits × 16 bits pipelined multiplier in SIMC 55 nm CMOS process. The experimental results show that our proposed technique can get about 59% energy efficiency improvements compared with operating in worst-case timing margin.
IEICE Electronics Express
Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variatio... more Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variation degrades the robust of traditional synchronous pipelines severely when voltage scales into the subthreshold region. In this paper, we propose a register-based bundleddata asynchronous pipeline that can operate robustly in sub-threshold, called Snake. By looping the match delay line, the Snake halves the design overhead compared to other asynchronous pipelines. We also propose a practical asynchronous design methodology which is compatible with commercial EDA and needs only a few modifications to synchronous design flow. Monte-Carlo SPICE simulation shows that the pipelined multiplier applying the proposed techniques operates stably in 0.2V and achieves minimum power 1.3nW in 0.2V, minimum energy 1.07pJ per cycle in 0.3V. It provides 6.7 times superiority over synchronous baseline design with 22% area overhead. Comparison with other works in the state of art shows the proposed techniques are quite competitive.
IEICE Electronics Express
In this paper, a 16 times 16 low-power low-area asynchronous iterative multiplier is proposed. Th... more In this paper, a 16 times 16 low-power low-area asynchronous iterative multiplier is proposed. The multiplier diminishes 2 bits at a time with an iterative structure, to filter out the useless switching activities, we employ a finishing detector to dynamically detect the end of the computation and stop iteration ahead of schedule. Additionally, with the employment of finishing detectors, the proposed multiplier could provide a much faster average speed than synchronous approach. Post-layout simulation results show that the asynchronous multiplier offers up to 74% power reduction compared with the synchronous design. Simultaneously, the proposed design also exhibits a prominent area reduction compared with other non-iterative multiplier benefited from the iterative architecture.
IEEE Transactions on Circuits and Systems II: Express Briefs
This brief presents a modified radix-4 fast Fourier transform (FFT) signal flow graph, whose inpu... more This brief presents a modified radix-4 fast Fourier transform (FFT) signal flow graph, whose input and output both are in natural order. Compared with the conventional radix-4 signal flow graph, it does not buffer the result of the last stage or execute the bit-reverse operation to generate the result, but generates the result directly in the last stage. Thus, the number of iterations is reduced by one. In order to realize the proposed memory-based FFT processor by using the modified radix-4 FFT signal flow graph, a conflict-free strategy and corresponding memory-addressing scheme is proposed. At last, the hardware implementation for the proposed FFT processor is proposed. Through the adoption of this method, FFT processor of arbitrary point conforming to the radix-4 algorithm can be implemented. Compared with the previous memory-based FFT processors, the proposed FFT processor has less processing time under similar or lower resource consumption.
IEICE Electronics Express
This paper presents a novel PMOS read-port 8T SRAM cell, in which the read circuit is constructed... more This paper presents a novel PMOS read-port 8T SRAM cell, in which the read circuit is constructed by two cascaded PMOS transistors, and hence the leakage power is significantly optimized compared to the conventional 8T cell. Meanwhile, it also exhibits high area efficiency due to an equalized quantity of NMOS and PMOS transistors per cell. Furthermore, the proposed cell has sufficient potential to enhance performance by employing a Half-Schmitt inverter. The measurements indicate that the proposed cell outmatches conventional 8T cell in terms of leakage suppression and area saving, thus making it a superior choice for ultra low power applications.
2015 IEEE 11th International Conference on ASIC (ASICON), 2015
With continued CMOS technology scaling down, transistors exhibit higher degrees of variation and ... more With continued CMOS technology scaling down, transistors exhibit higher degrees of variation and mismatch, resulting in a larger offset voltage. A large offset voltage will enlarge bitline swing, increasing dynamic power consumption during a read operation and degrading the sensing decision correct rate and operation speed. Thus, the offset voltage is the most critical metric for static random access memory sense amplifiers (SAs), mainly arising from transistor threshold voltage mismatch. Here we propose an offset-cancelling technique with digitized multiple body biasing. In this scheme, SA transistor threshold voltage mismatch is compensated by adjusting the body bias voltage digitally and repeatedly. Simulation results in 130-nm CMOS technology show that the proposed calibration technique can reduce the standard deviation of the offset voltage by over four times comparing to a conventional SA, with about 6.5% and 1.6% area power overhead of a 6-kbit prototype chip introduced.
IEICE Electronics Express, 2017
In this brief, we propose a novel method which realizes conflictfree strategy in memory-based FFT... more In this brief, we propose a novel method which realizes conflictfree strategy in memory-based FFT, of which the hardware complexity is simplified, since only a few extra registers are needed and the control logic is identical in all stages. In addition, we present a modified signal flow graph to fit for the proposed conflict-free strategy. The modified signal flow graph derives from the mixed-radix signal flow graph and has constant geometry property. Furthermore, continuous-flow is adopted to increase the throughput. Thus, the proposed FFT processor has better performance compared with the previous memory-based FFT processors. Simulation result shows that for the proposed 8 to 2048-point FFT processor, the maximum frequency is 400 MHz by using a 65-nm CMOS technology, and the area is 0.45 mm 2 in the same condition.
Journal of Electronics & Information Technology, 2010
Proceedings of the 47th International Conference on Parallel Processing, 2018
We have implemented an asynchronous mesh network. This paper describes our innovative design usin... more We have implemented an asynchronous mesh network. This paper describes our innovative design using a Click controller. Compared to designs that use other asynchronous circuit families with C-elements and four-phase bundled data, our two-phase Click-based Bounded Bundled Data design is faster, but introduces phase skews when handling concurrent traffic at a single node. Instead of eliminating the phase skews, we use them as computation slots. Our network uses a novel asynchronous arbiter with a queue that can accept data from both the four cardinal directions as well as from a local source, five directions in all. We have implemented our network design in 1 × 1, 2 × 2 and 4 × 4 sizes, larger network could be implemented easier since the isomorphism and modularity of the routing nodes. Our experiments show that an initial data item passes through a node in 157ns v.s. 81ns for non-delay-branch and delay-branch designs separately. Following items take about 65% as long. But for a networ...
IEICE Electronics Express
Self-timed systems divide nicely into two kinds of components: communication links that transport... more Self-timed systems divide nicely into two kinds of components: communication links that transport and store data, and computation joints that apply logic to data. We treat these two types of self-timed components as equally important. Putting communication on a par with computation acknowledges the increasing cost of data transport and storage in terms of energy, time, and area. Our clean separation of data transport and storage from logic simplifies the design and test of self-timed systems. The separation also helps one to grasp how self-timed systems work. We offer this paper in the hope that better understanding of self-timed systems will engage the minds of compiler, formal verification, and test experts.
This paper presents a robust 12T subthreshold SRAM cell which is fabricated in 55nm CMOS technolo... more This paper presents a robust 12T subthreshold SRAM cell which is fabricated in 55nm CMOS technology. The back to back latch structure of the proposed cell is constructed by two half Schmitt-based inverters, and hence the hold noise margin is significantly enhanced compared to conventional 6T cell. Meanwhile, the proposed cell also exhibits improved read noise margin and read speed compared to the previous Schmitt Trigger based SRAM cell due to the employment of pseudo read nodes. In addition, multiple threshold CMOS transistors are used to improve the write ability. Furthermore, the proposed cell eliminates the half-selected problem in the write operation to facilitate a stable function in the subthreshold region.
IEICE Electronics Express
IEICE Electronics Express, 2017
IEEE Transactions on Circuits and Systems II: Express Briefs
This brief presents a fast and energy-efficient level shifter with wide conversion range. To achi... more This brief presents a fast and energy-efficient level shifter with wide conversion range. To achieve both energy-efficient and high-speed voltage level conversion, a novel architecture combined with multi-threshold CMOS technique is employed in the proposed circuit. A mixed-threshold current mirror circuit is proposed to solve the reduced swing issue in the prior arts. Moreover, auxiliary bias circuits are inserted to guarantee that the low-threshold pull down networks could be strongly cut off while in leaking state. As a result, the power consumption would be reduced to a great extent. Measurement results based on SMIC 55-nm MTCMOS process demonstrate that the proposed level shifter could provide robust voltage conversion from 0.12V to 1.2V. At the target voltage of 0.3V, the proposed level shifter shows a propagation delay of 17.86ns, a static power of 73.95pW, and an energy per transition of 26.59fJ for input frequency of 1MHz.
IEICE Electronics Express
This paper proposes an energy-efficient and glitch-free digital phase modulator for outphasing tr... more This paper proposes an energy-efficient and glitch-free digital phase modulator for outphasing transmitter. The proposed modulator uses a path-shared tapped delay line (TDL) and a dynamical pseudo clock-gating control technique. These approaches lead a 64% lower power consumption compared conventional digital control delay lines (DCDLs). Moreover, the proposed modulator achieves circular rotational phase modulation, resulting a system EVM of −36.93 dB and ACLR of −50.96 dBc without extra shaping circuits or analog filters. The prototype modulator was fabricated in 130 nm CMOS process with an active area of 0.134 mm 2. Operating under 40 MHz frequency with 1.2 V power supply, the proposed modulator consumes total power of 450 µW. In addition, this chip achieves an 80 ps coarse resolution with 4.7 ps RMS error and a minimum phase resolution of 0.96 ps.
IEICE Electronics Express
An improved phase digitization mechanism is designed to overcome limited lock-in range of low-pow... more An improved phase digitization mechanism is designed to overcome limited lock-in range of low-power all-digital phase-locked loop (ADPLL) with phase prediction and edge snapshot circuit. The proposed mechanism including a dual-mode multiplexer-based time-to-digital converter (TDC) and accessional algorithm is verified in a modelled and simulated ADPLL. Results show that the ADPLL is able to lock in 7.8 µs, i.e., 187 cycles with a 24 MHz reference clock. The ADPLL also has strong recovery capability from sudden disturbance, for instance, it recovers in 8 µs with 0.38% disturbance.
IEICE Electronics Express
An ultra-low leakage energy efficient level shifter that can convert extremely low input voltage ... more An ultra-low leakage energy efficient level shifter that can convert extremely low input voltage into the supply voltage level is presented in this paper. In order to reduce the leakage power dissipation, the super-cutoff mechanism and MTCMOS technique are utilized in the proposed structure. At the same time, a positive feedback circuit is inserted to avoid the loss of performance. Post-layout simulation results in a 55-nm MTCMOS process demonstrate that for the voltage level conversion from 0.3V to 1.2V, the proposed level shifter exhibits a propagation delay of 70.77ns and an energy per transition of 89.55fJ for input frequency of 1MHz. Meanwhile, the static power of the proposed level shifter is as low as 27.82pW. The proposed level shifter only occupies 7.79 2 , which demonstrates prominent area efficiency.
IEICE Electronics Express
In this paper, we proposed an improved design method of critical path replica (CPR) for wide volt... more In this paper, we proposed an improved design method of critical path replica (CPR) for wide voltage design. Timing accuracy of CPR in wide operating voltage is improved by applying load matching and transistor-level static timing analysis (TSTA). We applied proposed method to 100 critical paths of iscas'95 benchmark circuits, the results of simulation experiments in SMIC 55nm shows that the CPR designed by proposed method can operating between 0.3V-1.2V with only 0.25% delay error (DE).
IEICE Electronics Express
Variation poses a guard-band requirement for integrated circuit designs, which degrades performan... more Variation poses a guard-band requirement for integrated circuit designs, which degrades performance and energy efficiency. As the voltage scales down, the circuits are more sensitive to the variation and the guard-band margin becomes unacceptable. In this paper, we propose a novel error detection and correction technique to eliminate the margin for variation. The error detection latch introduces the low overhead of only 6 transistors compared to the conventional latch. The detection and correction scheme relaxes the timing constraint for the error signal by one clock cycle by extending another latch stage next to the critical stage. The proposed technique reduces the energy per cycle by 51% with 7.6% area overhead compared to the conventional margin technique. Comparison with other works in the state of art shows the proposed technique is quite competitive.
IEICE Electronics Express
This paper presents a practical, low-overhead, one-cycle correction better-than-worst-case design... more This paper presents a practical, low-overhead, one-cycle correction better-than-worst-case design method for ultra-low voltage digital circuits. Excessive design margin for PVT variation brought by traditional worst-case design method is eliminated. Proposed method is completely compatible with EDA tools. Considerable design efforts are relaxed compared with other variation-tolerant techniques. We have implemented our proposed technique on a 16 bits × 16 bits pipelined multiplier in SIMC 55 nm CMOS process. The experimental results show that our proposed technique can get about 59% energy efficiency improvements compared with operating in worst-case timing margin.
IEICE Electronics Express
Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variatio... more Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variation degrades the robust of traditional synchronous pipelines severely when voltage scales into the subthreshold region. In this paper, we propose a register-based bundleddata asynchronous pipeline that can operate robustly in sub-threshold, called Snake. By looping the match delay line, the Snake halves the design overhead compared to other asynchronous pipelines. We also propose a practical asynchronous design methodology which is compatible with commercial EDA and needs only a few modifications to synchronous design flow. Monte-Carlo SPICE simulation shows that the pipelined multiplier applying the proposed techniques operates stably in 0.2V and achieves minimum power 1.3nW in 0.2V, minimum energy 1.07pJ per cycle in 0.3V. It provides 6.7 times superiority over synchronous baseline design with 22% area overhead. Comparison with other works in the state of art shows the proposed techniques are quite competitive.
IEICE Electronics Express
In this paper, a 16 times 16 low-power low-area asynchronous iterative multiplier is proposed. Th... more In this paper, a 16 times 16 low-power low-area asynchronous iterative multiplier is proposed. The multiplier diminishes 2 bits at a time with an iterative structure, to filter out the useless switching activities, we employ a finishing detector to dynamically detect the end of the computation and stop iteration ahead of schedule. Additionally, with the employment of finishing detectors, the proposed multiplier could provide a much faster average speed than synchronous approach. Post-layout simulation results show that the asynchronous multiplier offers up to 74% power reduction compared with the synchronous design. Simultaneously, the proposed design also exhibits a prominent area reduction compared with other non-iterative multiplier benefited from the iterative architecture.
IEEE Transactions on Circuits and Systems II: Express Briefs
This brief presents a modified radix-4 fast Fourier transform (FFT) signal flow graph, whose inpu... more This brief presents a modified radix-4 fast Fourier transform (FFT) signal flow graph, whose input and output both are in natural order. Compared with the conventional radix-4 signal flow graph, it does not buffer the result of the last stage or execute the bit-reverse operation to generate the result, but generates the result directly in the last stage. Thus, the number of iterations is reduced by one. In order to realize the proposed memory-based FFT processor by using the modified radix-4 FFT signal flow graph, a conflict-free strategy and corresponding memory-addressing scheme is proposed. At last, the hardware implementation for the proposed FFT processor is proposed. Through the adoption of this method, FFT processor of arbitrary point conforming to the radix-4 algorithm can be implemented. Compared with the previous memory-based FFT processors, the proposed FFT processor has less processing time under similar or lower resource consumption.
IEICE Electronics Express
This paper presents a novel PMOS read-port 8T SRAM cell, in which the read circuit is constructed... more This paper presents a novel PMOS read-port 8T SRAM cell, in which the read circuit is constructed by two cascaded PMOS transistors, and hence the leakage power is significantly optimized compared to the conventional 8T cell. Meanwhile, it also exhibits high area efficiency due to an equalized quantity of NMOS and PMOS transistors per cell. Furthermore, the proposed cell has sufficient potential to enhance performance by employing a Half-Schmitt inverter. The measurements indicate that the proposed cell outmatches conventional 8T cell in terms of leakage suppression and area saving, thus making it a superior choice for ultra low power applications.