On the Impact of Area I/O on Partitioning: A new Perspective (original) (raw)

The use of area I/O or a look on future architectures

1999

To date designers seek to achieve ever smaller systems with ever more functionality, but more and more they face the interconnection technology as a show stopper. To overcome this bottleneck we propose a chip-package codesign approach; a close cooperation between chip and package designers exploiting the synergism. Our approach distributes the on-chip pads all over the IC area near the pads associated core area. This technique results into smaller ICs with more and faster I/Os being much easier to package. In this paper, a case study for a Pentium class system shows why other approaches such as wire bond, re-routing and chip size package (CSP) have shortcomings. Finally, we present an outlook to new system architectures that are enabled by area I/O: A processor system with first level cache on separate ICs instead of being integrated on the CPU itself.

A New Architectural Framework for the Speed Gap between Processor – Main Memory

The Interactions between CPU and Main Memory Architecture, 2020

ABSTRACT In modern general-purpose computer architectures, the rate of improvement in microprocessor speed exceeds the rate of improvement in speed of main memory. Although the speed gap between processor and main memory has already been an issue, downstream it somewhere may be a bigger one. Computer designers are facing the increased speed gap between processor and main memory, which now is the primary challenge to improve computer system’s performance. In this emerging architecture, the band width of the bus interfaces between Processor & main memory has already become a greater issue to be deal with. This is just because of its limited data transfer capacity in between processor & main memory and its high access time respectively. Though computer architects devoted more efforts to dramatically increase the speed of processors, the overall performance of the computer system is not yet improved as the improvement rate of processors. This research paper focuses on incorporating new components with combination of functionalities; like accepting more than one request in parallel through different additional paths in a collision free process. Afterwards a new Memory Management Technique for improving the speed of the memory is also examined. Finally a new design solution i.e. data transfer rate between processor and main memory has been increased by integration of separate Buses. KEYWORDS Architecture, Processor, Main memory, speed gap, parallel, collision, Memory Management Technique

EFFICIENT CACHE PARTITIONING TECHNIQUE FOR CHIP MULTIPROCESSORS

Chip multiprocessors (CMPs) have been widely adopted and commercially available as the building blocks for future computer systems. It contains multiple cores, which enables to concurrently execute multiple applications (or threads) on a single chip. As the number of cores on a chip increases, the pressure on the memory system to sustain the memory requirements of all the concurrently executing applications (or threads) increases. An important question in CMP design is how to use the limited size L2 cache on chip to achieve the best possible system throughput for a wide range of applications. Keys to obtaining high performance from multicore architectures is to provide fast data accesses (reduce latency) for on-chip computation resources and manage the largest level on-chip L2 cache efficiently so that off-chip accesses are reduced.We propose efficient cache partitioning (ECP) technique in which the amount ofL2 cache space that can be shared among the cores is controlled dynamically. Efficient cache partitioning (ECP) technique estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall Performance. We show that our partitioning technique performs better than traditional techniques like LRU partitioning and Half-and-Half partitioning under Efficient Replacement Policy.

A Quantitative Prediction Model for Hardware/Software Partitioning

2007

Heterogeneous System Development needs Hardware/Software Partitioning performed early on in the development process. In order to do this early on predictions of hardware resource usage and delay are necessary. In this thesis a Quantitative Model is presented that can make early predictions to support the partitioning process. The model is based on Software Complexity Metrics, which capture important aspects of functions like control intensity, data intensity, code size, etc. In order to remedy the interdependence of the software metrics a Principal Component Analysis performed. The hardware characteristics were determined by automatically generating VHDL from C using the DWARV C-to-VHDL compiler. Using the results from the principal component analysis, the quantitative model was generated using linear regression. The error of the model differs per hardware characteristic. We show that for flip-flops the mean error for the predictions is 69%. In conclusion, our quantitative model can make fast and sufficiently accurate area predictions to support Hardware/Software Partitioning. In the future, the model can be extended by introducing extra software metrics, using more advanced modeling techniques, and using a larger collection of functions and algorithms.

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

11th International Symposium on High-Performance Computer Architecture, 2005

Future high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall processor performance and power consumption. VLSI techniques allow a variety of wire implementations, but these wire properties have previously never been exposed to the microarchitecture. This paper advocates global wire management at the microarchitecture level and proposes a heterogeneous interconnect that is comprised of wires with varying latency, bandwidth, and energy characteristics. We propose and evaluate microarchitectural techniques that can exploit such a heterogeneous interconnect to improve performance and reduce energy consumption. These techniques include a novel cache pipeline design, the identification of narrow bit-width operands, the classification of non-critical data, and the detection of interconnect load imbalance. For a dynamically scheduled partitioned architecture, our results demonstrate that the proposed innovations result in up to 11% reductions in overall processor ED 2 , compared to a baseline processor that employs a homogeneous interconnect.

An area model for on-chip memories and its application

IEEE Journal of Solid-State Circuits, 1991

Utility can be defined as quality per unit of cost. The utility of a particular function in a microprocessor can be defined as its contribution to the overall processor performance per unit of implementation cost. In the case of on-chip data memory (e.g., registers, caches) the performance contribution can be reduced to its effectiveness in reducing memory traffic or in reducing the average time to fetch operands. An important cost measure for on-chip memory is occupied area. On-chip memory performance, however, is expressed much more easily as a function of size (the storage capacity) than as a function of area.

Optimizing the Block I/O Subsystem for Fast Storage Devices

ACM Transactions on Computer Systems, 2014

Fast storage devices are an emerging solution to satisfy data-intensive applications. They provide high transaction rates for DBMS, low response times for Web servers, instant on-demand paging for applications with large memory footprints, and many similar advantages for performance-hungry applications. In spite of the benefits promised by fast hardware, modern operating systems are not yet structured to take advantage of the hardware’s full potential. The software overhead caused by an OS, negligible in the past, adversely impacts application performance, lessening the advantage of using such hardware. Our analysis demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system. In this article, we propose six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices. With the support of n...