Stamatis Kavvadias | TEI of Crete (original) (raw)
Papers by Stamatis Kavvadias
The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necess... more The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of cores per chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science. The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the most suitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interf...
2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010
One of the main challenges in the multi-core area is the communication and synchronization of the... more One of the main challenges in the multi-core area is the communication and synchronization of the cores and the design of an efficient interconnection network that is scalable to multiple cores. In this paper we present an efficient implementation of a scalable system that is targeting multicore systems. Each cluster node consists of 4 processors that support both explicit and implicit communication. Processor's cache is augmented with scratchpad and is merged with the network interface (NI) for reduced communication latency. All nodes are connected through a novel layer-2 switch that can support up to 20 nodes. The proposed system is designed and implemented using multiple FPGA boards and the performance evaluation presents the aggregate throughput of the system (with 16 processors) and the communication latency between that cluster nodes.
Computer Architecture Dept., Polythecnic University of Catalonia (UPC), Barcelona, July, 2008
Abstract. Programming models with explicit communication between parallel tasks allow the runtime... more Abstract. Programming models with explicit communication between parallel tasks allow the runtime system to schedule task execution and data transfers ahead of time. Explicit communication is not limited to message passing and streaming applications: recent proposals in parallel programming allow such explicit communication in other task-based scenarios too. Scheduling of data transfers allows the overlap of computation and communication, and latency hiding, and locality optimization, using programmable data ...
Previous researches have shown some approaches on hardware phase detection. In this work, we prop... more Previous researches have shown some approaches on hardware phase detection. In this work, we propose a new framework based on Xilinx Virtex-6 platform for the implementation of task-optimized coarse-grained reconfiguration that can be reconfigured to adapt to the applications' behavior. We use MicroBlaze as a general-purpose processor and a ρ-VEX VLIW architecture as reconfigurable cores. A run-time component called supervisor, dynamically monitors the system behavior, and triggers the reconfiguration; we also propose a Profiler component that automatically obtains the phase information. The collected data can be used to guide dynamic reconfiguration on the FPGA.
2016 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2016
2011 Design, Automation & Test in Europe, 2011
We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC m... more We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementation of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.
Chapman & Hall/CRC Computational Science, 2010
2010 International Conference on Reconfigurable Computing and FPGAs, 2010
Per-core local (scratchpad) memories allow direct inter-core communication, with latency and ener... more Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. A multicore FPGA platform with cache-integrated network interfaces (NIs) is presented, appropriate for scalable multicores, that combine the best of two worlds-the flexibility of caches (using implicit communication) and the efficiency of scratchpad memories (using explicit communication): on-chip SRAM is configurable shared among caching, scratchpad, and virtualized NI functions. The proposed system has been implemented in a four-core FPGA. Special hardware primitives (counter, queues) are used for the the communication and synchronization of the cores that are most suitable in network processing applications. The paper presents the performance evaluation of the proposed system in the domain of network processing. Two representatives benchmarks are used; one for header processing and one for payload processing. The system is evaluated in terms of performance and the communication overhead is measured. Furthermore, two approaches for the communication of the processors are evaluated and compared; common queue and distributed queues.
2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 2007
Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation... more Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.
2 Guest Editors' Introduction: Multicore: The View from Europe Mateo Valero and Nacho Navarr... more 2 Guest Editors' Introduction: Multicore: The View from Europe Mateo Valero and Nacho Navarro ... 5 ArchExplorer for Automatic Design Space Exploration Veerle Desmet, Sylvain Girbal, Alex Ramirez, Augusto Vega, and Olivier Temam ... 16 The SARC Architecture Alex Ramirez, Felipe Cabarcas, Ben Juurlink, Mauricio Alvarez Mesa, Friman Sanchez, Arnaldo Azevedo, Cor Meenderinck, Ca˘ta˘lin Ciobanu, Sebastian Isaza, and Georgi Gaydadjiev ... 30 Explicit Communication and Synchronization in SARC Manolis GH Katevenis, Vassilis Papaefstathiou, Stamatis ...
Micro, IEEE, Sep 1, 2010
A new network interface optimized for SARC supports synchronization and explicit communication an... more A new network interface optimized for SARC supports synchronization and explicit communication and provides a robust mechanism for event responses. Full-system simulation of the authors' design achieved a 10-to 40-percent speed increase over traditional cache architectures on 64 cores, a two-to four-fold decrease in on-chip network traffic, and a three-to five-fold decrease in lock and barrier latency.
We present the hardware design and implementation of a local memory system for individual process... more We present the hardware design and implementation of a local memory system for individual processors inside future chip multi-processors (CMP). Our memory system supports both implicit commu-nication via caches, and explicit communication via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each proces-sor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communi-cation subsystems required by the cache and scratchpad into one inte-grated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchro-nization primitives: counters, and queues. We describe the NI design, the hardware cost, and the ...
Recent advances in silicon technology allow today's systems to host a few processor cores in ... more Recent advances in silicon technology allow today's systems to host a few processor cores in the same chip. In the upcoming manycore era, parallel systems will depend on multi-core chips to allow their performance to scale. Scalability can only be achieved with a synergistic use of the available cores and thus the efficient communication between them is increasingly important. This Interprocessor Communication takes place in the processors' Network Interfaces (NIs) and thus requires low-cost and high performance NI architectures. Our current research focus is on future on-chip NIs where the NIs are tightly coupled to the processors and the memory hierarchy. This paper introduces the on-chip environment for these NIs and discusses the scalability issues. We propose the integration of the NI inside the cache controller and a simple interface that allows only a few store/load instructions to send/receive messages at L1 cache rates.
Proceedings of the 7th ACM international conference on Computing frontiers - CF '10, 2010
Per-core local (scratchpad) memories allow direct inter-core communication, with latency and ener... more Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
International Journal of Parallel Programming, 2011
Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latenc... more Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the All the authors are member of HiPEAC. cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
Proceedings 12th International Workshop on Rapid System Prototyping. RSP 2001, 2001
2009 International Symposium on Systems, Architectures, Modeling, and Simulation, 2009
We report on the hardware implementation of a local memory system for individual processors insid... more We report on the hardware implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). It intends to support both implicit communication, via caches, and explicit communication, via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks near each processor, so that part of them operates as 2nd level (local) cache, while the rest operates as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor communicates with the NI in user-level, through virtualized command areas in scratchpad; through a similar mechanism, the NI also provides efficient support for synchronization, using two hardware primitives: counters, and queues. We describe the block diagram, the hardware cost, and the latencies of our FPGA-based prototype implementation, which integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller on a Xilinx-5 FPGA. One-way, end-to-end, user-level communication completes within about 30 clock cycles for short transfer sizes.
Previous researches have shown some approaches on hardware phase detection. In this work, we prop... more Previous researches have shown some approaches on hardware phase detection. In this work, we propose a new framework based on Xilinx Virtex-6 platform for the implementation of task-optimized coarse-grained reconfiguration that can be reconfigured to adapt to the applications' behavior. We use MicroBlaze as a general-purpose processor and a ρ-VEX VLIW architecture as reconfigurable cores. A run-time component called supervisor, dynamically monitors the system behavior, and triggers the reconfiguration; we also propose a Profiler component that automatically obtains the phase information. The collected data can be used to guide dynamic reconfiguration on the FPGA.
Abstract. An active, reconfigurable network node, named PLATO, has been designed and implemented.... more Abstract. An active, reconfigurable network node, named PLATO, has been designed and implemented. Version V. X1. 0 was developed with a Xilinx Virtex XCV-1000 FPGA as a PCI board, with the capability of 256MBytes of SDRAM, 512KBytes of SRAM, and a UTOPIA level 2 based interface to 4 bidirectional 155 Mbps ATM links. This paper presents the implementation and testing of the PLATO system, as well as a priority enforcement scheme for transmission of TCP/IP packets over ATM networks. Keywords: Active network, ...
2nd Industrial Workshop of the European Network of Excellence on High-Performance Embedded Architecture and Compilation (HiPEAC), Oct 17, 2006
Parallel and multinode computing systems are becoming widespread and grow in sophistication. Besi... more Parallel and multinode computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid prototyping becomes important in designing and evaluating their architecture. We present an FPGA-based system that we developed and use for prototyping and measuring high speed processor-network interfaces and interconnects; it is an experimental tool for research projects in architecture. We configure FPGA boards as network interfaces (NI) and as switches. NI's plug into the PCI-X bus of commercial PC's, ...
The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necess... more The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of cores per chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science. The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the most suitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interf...
2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010
One of the main challenges in the multi-core area is the communication and synchronization of the... more One of the main challenges in the multi-core area is the communication and synchronization of the cores and the design of an efficient interconnection network that is scalable to multiple cores. In this paper we present an efficient implementation of a scalable system that is targeting multicore systems. Each cluster node consists of 4 processors that support both explicit and implicit communication. Processor's cache is augmented with scratchpad and is merged with the network interface (NI) for reduced communication latency. All nodes are connected through a novel layer-2 switch that can support up to 20 nodes. The proposed system is designed and implemented using multiple FPGA boards and the performance evaluation presents the aggregate throughput of the system (with 16 processors) and the communication latency between that cluster nodes.
Computer Architecture Dept., Polythecnic University of Catalonia (UPC), Barcelona, July, 2008
Abstract. Programming models with explicit communication between parallel tasks allow the runtime... more Abstract. Programming models with explicit communication between parallel tasks allow the runtime system to schedule task execution and data transfers ahead of time. Explicit communication is not limited to message passing and streaming applications: recent proposals in parallel programming allow such explicit communication in other task-based scenarios too. Scheduling of data transfers allows the overlap of computation and communication, and latency hiding, and locality optimization, using programmable data ...
Previous researches have shown some approaches on hardware phase detection. In this work, we prop... more Previous researches have shown some approaches on hardware phase detection. In this work, we propose a new framework based on Xilinx Virtex-6 platform for the implementation of task-optimized coarse-grained reconfiguration that can be reconfigured to adapt to the applications' behavior. We use MicroBlaze as a general-purpose processor and a ρ-VEX VLIW architecture as reconfigurable cores. A run-time component called supervisor, dynamically monitors the system behavior, and triggers the reconfiguration; we also propose a Profiler component that automatically obtains the phase information. The collected data can be used to guide dynamic reconfiguration on the FPGA.
2016 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2016
2011 Design, Automation & Test in Europe, 2011
We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC m... more We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementation of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.
Chapman & Hall/CRC Computational Science, 2010
2010 International Conference on Reconfigurable Computing and FPGAs, 2010
Per-core local (scratchpad) memories allow direct inter-core communication, with latency and ener... more Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. A multicore FPGA platform with cache-integrated network interfaces (NIs) is presented, appropriate for scalable multicores, that combine the best of two worlds-the flexibility of caches (using implicit communication) and the efficiency of scratchpad memories (using explicit communication): on-chip SRAM is configurable shared among caching, scratchpad, and virtualized NI functions. The proposed system has been implemented in a four-core FPGA. Special hardware primitives (counter, queues) are used for the the communication and synchronization of the cores that are most suitable in network processing applications. The paper presents the performance evaluation of the proposed system in the domain of network processing. Two representatives benchmarks are used; one for header processing and one for payload processing. The system is evaluated in terms of performance and the communication overhead is measured. Furthermore, two approaches for the communication of the processors are evaluated and compared; common queue and distributed queues.
2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 2007
Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation... more Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.
2 Guest Editors' Introduction: Multicore: The View from Europe Mateo Valero and Nacho Navarr... more 2 Guest Editors' Introduction: Multicore: The View from Europe Mateo Valero and Nacho Navarro ... 5 ArchExplorer for Automatic Design Space Exploration Veerle Desmet, Sylvain Girbal, Alex Ramirez, Augusto Vega, and Olivier Temam ... 16 The SARC Architecture Alex Ramirez, Felipe Cabarcas, Ben Juurlink, Mauricio Alvarez Mesa, Friman Sanchez, Arnaldo Azevedo, Cor Meenderinck, Ca˘ta˘lin Ciobanu, Sebastian Isaza, and Georgi Gaydadjiev ... 30 Explicit Communication and Synchronization in SARC Manolis GH Katevenis, Vassilis Papaefstathiou, Stamatis ...
Micro, IEEE, Sep 1, 2010
A new network interface optimized for SARC supports synchronization and explicit communication an... more A new network interface optimized for SARC supports synchronization and explicit communication and provides a robust mechanism for event responses. Full-system simulation of the authors' design achieved a 10-to 40-percent speed increase over traditional cache architectures on 64 cores, a two-to four-fold decrease in on-chip network traffic, and a three-to five-fold decrease in lock and barrier latency.
We present the hardware design and implementation of a local memory system for individual process... more We present the hardware design and implementation of a local memory system for individual processors inside future chip multi-processors (CMP). Our memory system supports both implicit commu-nication via caches, and explicit communication via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each proces-sor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communi-cation subsystems required by the cache and scratchpad into one inte-grated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchro-nization primitives: counters, and queues. We describe the NI design, the hardware cost, and the ...
Recent advances in silicon technology allow today's systems to host a few processor cores in ... more Recent advances in silicon technology allow today's systems to host a few processor cores in the same chip. In the upcoming manycore era, parallel systems will depend on multi-core chips to allow their performance to scale. Scalability can only be achieved with a synergistic use of the available cores and thus the efficient communication between them is increasingly important. This Interprocessor Communication takes place in the processors' Network Interfaces (NIs) and thus requires low-cost and high performance NI architectures. Our current research focus is on future on-chip NIs where the NIs are tightly coupled to the processors and the memory hierarchy. This paper introduces the on-chip environment for these NIs and discusses the scalability issues. We propose the integration of the NI inside the cache controller and a simple interface that allows only a few store/load instructions to send/receive messages at L1 cache rates.
Proceedings of the 7th ACM international conference on Computing frontiers - CF '10, 2010
Per-core local (scratchpad) memories allow direct inter-core communication, with latency and ener... more Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
International Journal of Parallel Programming, 2011
Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latenc... more Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the All the authors are member of HiPEAC. cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
Proceedings 12th International Workshop on Rapid System Prototyping. RSP 2001, 2001
2009 International Symposium on Systems, Architectures, Modeling, and Simulation, 2009
We report on the hardware implementation of a local memory system for individual processors insid... more We report on the hardware implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). It intends to support both implicit communication, via caches, and explicit communication, via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks near each processor, so that part of them operates as 2nd level (local) cache, while the rest operates as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor communicates with the NI in user-level, through virtualized command areas in scratchpad; through a similar mechanism, the NI also provides efficient support for synchronization, using two hardware primitives: counters, and queues. We describe the block diagram, the hardware cost, and the latencies of our FPGA-based prototype implementation, which integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller on a Xilinx-5 FPGA. One-way, end-to-end, user-level communication completes within about 30 clock cycles for short transfer sizes.
Previous researches have shown some approaches on hardware phase detection. In this work, we prop... more Previous researches have shown some approaches on hardware phase detection. In this work, we propose a new framework based on Xilinx Virtex-6 platform for the implementation of task-optimized coarse-grained reconfiguration that can be reconfigured to adapt to the applications' behavior. We use MicroBlaze as a general-purpose processor and a ρ-VEX VLIW architecture as reconfigurable cores. A run-time component called supervisor, dynamically monitors the system behavior, and triggers the reconfiguration; we also propose a Profiler component that automatically obtains the phase information. The collected data can be used to guide dynamic reconfiguration on the FPGA.
Abstract. An active, reconfigurable network node, named PLATO, has been designed and implemented.... more Abstract. An active, reconfigurable network node, named PLATO, has been designed and implemented. Version V. X1. 0 was developed with a Xilinx Virtex XCV-1000 FPGA as a PCI board, with the capability of 256MBytes of SDRAM, 512KBytes of SRAM, and a UTOPIA level 2 based interface to 4 bidirectional 155 Mbps ATM links. This paper presents the implementation and testing of the PLATO system, as well as a priority enforcement scheme for transmission of TCP/IP packets over ATM networks. Keywords: Active network, ...
2nd Industrial Workshop of the European Network of Excellence on High-Performance Embedded Architecture and Compilation (HiPEAC), Oct 17, 2006
Parallel and multinode computing systems are becoming widespread and grow in sophistication. Besi... more Parallel and multinode computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid prototyping becomes important in designing and evaluating their architecture. We present an FPGA-based system that we developed and use for prototyping and measuring high speed processor-network interfaces and interconnects; it is an experimental tool for research projects in architecture. We configure FPGA boards as network interfaces (NI) and as switches. NI's plug into the PCI-X bus of commercial PC's, ...