Network on chip Research Papers (original) (raw)

— Building large computing systems requires first to model them. Modern hardware systems are so complex that their software models in the desired detail may be too slow. Thus abstract hardware modelling can be appropriate. This paper... more

— Building large computing systems requires first to model them. Modern hardware systems are so complex that their software models in the desired detail may be too slow. Thus abstract hardware modelling can be appropriate. This paper presents an example software/hardware model built using Bluespec System Verilog (BSV) design flow to give rapid simulation of a hardware system. The chosen example was a hardware model of the on-chip router, on-chip and off-chip network of SpiNNaker for understanding the behaviour of the traffic in the system. A model of a 5×5 SpiNNaker topology has been designed in Virtex-5 FPGA using BSV and a Graphical User Interface (GUI) was developed in LabVIEW for graphical representation of the results. I.

Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics... more

Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics of CPU and GPU cores. We consider how to efficiently share on-chip resources between cores within the heterogeneous system, in particular the on-chip network. Heterogeneous architectures use an on-chip interconnection network to access shared resources such as last-level cache tiles and memory controllers, and this type of on-chip network will have a significant impact on performance. In this article, we propose a feedback-directed virtual channel partitioning (VCP) mechanism for on-chip routers to effectively share network bandwidth between CPU and GPU cores in a heterogeneous architecture. VCP dedicates a few virtual channels to CPU and GPU applications with separate injection queues. The proposed mechanism balances on-chip network bandwidth for a...

We propose an efficient design flow for the automatic synthesis of Networkon-Chip (NOC) topologies. The specification of the problem is given as a netlist of IP cores and their communication requirements. Each IP is characterized by its... more

We propose an efficient design flow for the automatic synthesis of Networkon-Chip (NOC) topologies. The specification of the problem is given as a netlist of IP cores and their communication requirements. Each IP is characterized by its area. A communication constraint is denoted by its source and destination IP and a minimum bandwidth requirement. Together with the specification, the users provides a percentage of the chip area that they want to allocate for the communication network. Then, given the clock ...

The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other... more

The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other hand, it becomes the reason for faults. A generic router consists of the buffers and pipeline stages. A single fault may result in an undesirable situation of degraded performance or a whole chip may stop working. Therefore, it is necessary to provide permanent fault tolerance to all the components of the router. In this paper, we propose a mechanism that can tolerate permanent faults that occur in the router. We exploit the fault-tolerant techniques of resource sharing and paring between components for the input port unit and routing computation (RC) unit, the resource borrowing for virtual channel allocator (VA) and multiple paths for switch allocator (SA) and crossbar (XB). The experimental results and analysis show that the proposed mechanism en...

A method and system for an infrastructure for performance-based chip-to-chip stacking are provided in the illustrative embodiments. A critical path monitor circuit (infrastructure) is configured to launch a signal from a launch point in a... more

A method and system for an infrastructure for performance-based chip-to-chip stacking are provided in the illustrative embodiments. A critical path monitor circuit (infrastructure) is configured to launch a signal from a launch point in a first layer, the first layer being a first circuit. The infrastructure is further configured to create an electrical path to a capture point. The signal is launched from the launch point in the first layer. A performance characteristic of the electrical path is measured, resulting in a measurement, wherein the measurement is indicative of a performance of the first layer when stacked with a second layer in a 3D stack without actually stacking the first and the second layers in the 3D stack, the second layer being a second circuit.

Vertical integration (3D ICs) has demonstrated the potential to reduce inter-block wire latency through flexible block placement and routing. However, there is untapped potential for 3D ICs to reduce intrablock wire latency through... more

Vertical integration (3D ICs) has demonstrated the potential to reduce inter-block wire latency through flexible block placement and routing. However, there is untapped potential for 3D ICs to reduce intrablock wire latency through architectural designs that can leverage multiple silicon layers in innovative ways. Furthermore, it is particularly challenging to simultaneously explore the physical design space and microarchitectural space for vertical integration. The physical design space typically has no information on the microarchitectural impact of latency optimization, and the microarchitectural space has no information on the physical design impact of different architectural alternatives. We make the following contributions in this paper:(1) the introduction of port partitioning, a new approach to constructing multi-layer blocks,(2) the extension of a microarchitectural exploration tool to include the ability to model multi-layer blocks and to consider these blocks as alternative implementations of single layer architectural blocks on the fly, within a single floorplanning run, and (3) the evaluation of vertical integration on a design driver using this framework. For this design driver, we see an average 36% improvement in performance (measured in BIPS) over a single layer architecture, and a 29% improvement in performance over a multi-layer architecture with single layer blocks. The on-chip temperature is kept below 40◦ C.

The three-dimensional integrated circuits (3D ICs) offer performance advantages thanks to the increased bandwidth and reduced wire-length enabled by through-silicon-via structures (TSVs). Traditionally TSVs have been considered to improve... more

The three-dimensional integrated circuits (3D ICs) offer performance advantages thanks to the increased bandwidth and reduced wire-length enabled by through-silicon-via structures (TSVs). Traditionally TSVs have been considered to improve the thermal conductivity in the vertical direction. However, the lateral thermal blockage effect becomes increasingly important for TSV via farms (a cluster of TSV vias used for signal bus connections between layers). TSV farms can cause different thermal effects on different layers due to the unequal x,y,z thermal conductivities. This can exhibit itself as thermal improvement in the vertical heat flow, at the same time lateral heat blockage effects in thinned pass-through layers. In this paper, we propose a thermalaware via farm placement technique for 3D ICs to minimize lateral heat blockages caused by dense signal bus TSV structures. By incorporating thermal conductivity profile of via farm blocks in the design flow and enabling placement/aspect ratio optimization, the corresponding hotspots can be minimized within the wire-length and area constraints.

In most 3D work to date, people have looked at two situations: 1) a case in which power density is not a problem, and the parts of a processor and/or entire processors can be stacked atop each other, and 2) a case in which power density... more

In most 3D work to date, people have looked at two situations: 1) a case in which power density is not a problem, and the parts of a processor and/or entire processors can be stacked atop each other, and 2) a case in which power density is limited, and storage is stacked atop processors. In this paper, we consider the case in which power density is a limitation, yet we stack processors atop processors. We also will discuss some of the physical limitations today that render many of the good ideas presented in other work impractical, and what would be required in the technology to make them feasible. In the high-performance regime, circuits are not designed to be "power efficient;" they're designed to be fast. In power-efficient design, the speed and power of a processor should be nearly proportional. In the highperformance regime, the frequency is (ever progressingly) sublinear in power. Thus, when the power density is constrained-as it is in high-performance machines, there may be opportunities to selectively exploit parallelism in workloads by running processor-on-processor systems at the same power, yet at much greater than half speed.

This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation along with a selfcontained adaptive system for detecting and bypassing permanent errors in multiprocessor system-on-chip... more

This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation along with a selfcontained adaptive system for detecting and bypassing permanent errors in multiprocessor system-on-chip applications . The proposed network employs a pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic permutations.The circuit-switching approach offers a guarantee of permuted data and its compact overhead enables the benefit of stacking multiple networks. The proposed system reroutes data on erroneous links to a set of spare wires without interrupting the data flow. To detect permanent errors at runtime, a novel in-line test (ILT) method using spare wires and a test pattern generator is proposed. In addition, an improved syndrome storing-based detection (SSD) method is presented and compared to the ILT method.

Packet-switched networks-on-chip (NOC) have been advocated as the solution to the challenge of organizing efficient and reliable communication structures among the components of a system-on-chip (SOC). A critical issue in designing a NOC... more

Packet-switched networks-on-chip (NOC) have been advocated as the solution to the challenge of organizing efficient and reliable communication structures among the components of a system-on-chip (SOC). A critical issue in designing a NOC is to determine its topology given the set of point-to-point communication requirements among these components. We present a novel approach to on-chip communication synthesis that is based on the iterative combination of two efficient computational steps: (1) an application of the k-Median algorithm to coarsely determine the global communication structure (which may turned out not be a network after all), and a (2) a variation of the shortest-path algorithm in order to finely tune the data flows on the communication channels. The application of our method to case studies taken from the literature shows that we can automatically synthesize optimal NOC topologies for multi-core on-chip processors and it offers new insights on why NOC are not necessari...

In an era of computation, speed is a major criterion. With the advent of chip multiprocessor (CMP) systems, it’s exigent for an innovative strategy to bypass the improficiency in present memory system & architecture. In accordance to the... more

In an era of computation, speed is a major criterion. With the advent of chip multiprocessor (CMP) systems, it’s exigent for an innovative strategy to bypass the improficiency in present memory system & architecture. In accordance to the above, frequent on chip memory access have increased analytical challenges in delivering high memory access performance with compact power and latency. The generalized concept of Scratch Pad Memory (SPM) can be configured from SRAM, MRAM & Z-RAM to evolve a heterogeneous SPM architecture. In this paper, we focus on uplifting latency & reducing power consumption. We have used, Adaptive Genetic Algorithm for Data Allocation (AGADA) for allocating data to above mentioned memory units forming the architecture along with test results.

The need of communication between two servers in the present world is Highly important. Though the performance is achieved speed is at risk in case of any two host servers. Achieving Data transferring speed up to 1GB is merely... more

The need of communication between two servers in the present world is Highly important. Though the performance is achieved speed is at risk in case of any two host servers. Achieving Data transferring speed up to 1GB is merely challenging. PCI is a multi-drop bus-based technology that was originally intended for compute applications, with the expectation that the host processor would control the entire system. In the PCI architecture, Non-Transparent Bridges are used to expand the number of slots possible for the PCI bus. Implementing an application and a client driver drives the Non-transparent bridge in achieving high speed upto 5GB/sec. IndexTerms Peripheral Component Interface, NTB-Non-Transparent Bridge, Memory window, Doorbells,