Abhishek Das - Profile on Academia.edu (original) (raw)

Papers by Abhishek Das

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve... more Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve significantly higher performance and efficiency over traditional architectures by introducing an explicitly managed on-chip storage in the memory hierarchy. This software managed memory serves as a staging area for bulk amounts of data, making all functional unit references short and predictable, while data is asynchronously transferred from external memory. The decoupling of computation from memory accesses allows the software to statically optimize the execution pipeline, transferring the onus of latency tolerance from hardware to software. The Stream programming model captures this by making computation and communication explicit in a 2-level storage hierarchy. This paradigm of structuring algorithms and explicitly managing data so that they are serviced by levels of the memory hierarchy as close to the processors as possible, however, applies to modern systems of all scales. The levels of memory hierarchy could include on-die storage such as caches, local DRAM, or even remote memory accessed over high-speed interconnect. Sequoia, a recently proposed programming language, extends the stream programming model to describe array blocking and communication for machines that can be abstracted as a tree of distinct memory modules.

Research paper thumbnail of Compiling for stream processing

This paper describes a compiler for stream programs that efficiently schedules computational kern... more This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve... more Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve significantly higher performance and efficiency over traditional architectures by introducing an explicitly managed on-chip storage in the memory hierarchy. This software managed memory serves as a staging area for bulk amounts of data, making all functional unit references short and predictable, while data is asynchronously transferred from external memory. The decoupling of computation from memory accesses allows the software to statically optimize the execution pipeline, transferring the onus of latency tolerance from hardware to software. The Stream programming model captures this by making computation and communication explicit in a 2-level storage hierarchy. This paradigm of structuring algorithms and explicitly managing data so that they are serviced by levels of the memory hierarchy as close to the processors as possible, however, applies to modern systems of all scales. The levels of memory hierarchy could include on-die storage such as caches, local DRAM, or even remote memory accessed over high-speed interconnect. Sequoia, a recently proposed programming language, extends the stream programming model to describe array blocking and communication for machines that can be abstracted as a tree of distinct memory modules.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Compiling for stream processing

This paper describes a compiler for stream programs that efficiently schedules computational kern... more This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.

Research paper thumbnail of Stream Processors: Progammability and Efficiency

ACM Queue, 2004

M any signal processing applications require both efficiency and programmability. Baseband signal... more M any signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Processors: Progammability and Efficiency

ACM Queue, 2004

M any signal processing applications require both efficiency and programmability. Baseband signal... more M any signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs.

Research paper thumbnail of Evaluating voltage islands in CMPs under process variations

Parameter variations are a major factor causing powerperformance asymmetry in chip multiprocessor... more Parameter variations are a major factor causing powerperformance asymmetry in chip multiprocessors. In this paper, we analyze the effects of with-in-die (WID) process variations on chip multicore processors and then apply a variable voltage island scheme to minimize power dissipation. Our idea is based on the observation that due to process variations, the critical paths in each core are likely to have a different latencies resulting in core-to-core (C2C) variations. As a result, each core can operate correctly under different supply voltage levels, achieving an optimal power consumption level. Particularly, we analyze voltage islands at different granularities ranging from a single core to a group of cores. We show that the dynamic power consumption can be reduced by up to 36.2% when each core can set its individual supply voltage level. In addition, for most manufacturing technologies, significant power savings can be achieved with only a few voltage islands on the whole chip: a single customized voltage setting can reduce the power consumption by up to 31.5%. Since the nominal operating frequency remains unchanged after the modifications, our scheme incurs no performance overhead.

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagin... more This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producerconsumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic microbenchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor.

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagin... more This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producerconsumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic microbenchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor.

Research paper thumbnail of Microarchitectures for Managing Chip Revenues under Process Variations

IEEE Computer Architecture Letters, 2007

As transistor feature sizes continue to shrink into the sub-90nm range and beyond, the effects of... more As transistor feature sizes continue to shrink into the sub-90nm range and beyond, the effects of process variations on critical path delay and chip yields have amplified. A common concept to remedy the effects of variation is speed-binning, by which chips from a single batch are rated by a discrete range of frequencies and sold at different prices. In this paper, we discuss strategies to modify the number of chips in different bins and hence enhance the profits obtained from them. Particularly, we propose a scheme that introduces a small Substitute Cache associated with each cache way to replicate the data elements that will be stored in the high latency lines. Assuming a fixed pricing model, this method increases the revenue by as much as 13.8% without any impact on the performance of the chips.

Research paper thumbnail of An FPGA-Based Network Intrusion Detection Architecture

IEEE Transactions on Information Forensics and Security, 2008

Network intrusion detection systems (NIDSs) monitor network traffic for suspicious activity and a... more Network intrusion detection systems (NIDSs) monitor network traffic for suspicious activity and alert the system or network administrator. With the onset of gigabit networks, current generation networking components for NIDS will soon be insufficient for numerous reasons; most notably because the existing methods cannot support high-performance demands. Field-programmable gate arrays (FPGAs) are an attractive medium to handle both high throughput and adaptability to the dynamic nature of intrusion detection. In this work, we design an FPGA-based architecture for anomaly detection in network transmissions. We first develop a feature extraction module (FEM) which aims to summarize network information to be used at a later stage. Our FPGA implementation shows that we can achieve significant performance improvements compared to existing software and application-specific integrated-circuit implementations. Then, we go one step further and demonstrate the use of principal component analysis as an outlier detection method for NIDSs. The results show that our architecture correctly classifies attacks with detection rates exceeding 99% and false alarms rates as low as 1.95%. Moreover, using extensive pipelining and hardware parallelism, it can be shown that for realistic workloads, our architectures for FEM and outlier analysis achieve 21.25-and 23.76-Gb/s core throughput, respectively.

Research paper thumbnail of Evaluating the effects of cache redundancy on profit

Previous works in computer architecture have mostly neglected revenue and/or profit, key factors ... more Previous works in computer architecture have mostly neglected revenue and/or profit, key factors driving any design decision. In this paper, we evaluate architectural techniques to optimize for revenue/profit. The continual trend of technology scaling and subwavelength lithography has caused transistor feature sizes to shrink into the nanoscale range. As a result, the effects of process variations on critical path delay and chip yields have amplified. A common concept to remedy the effects of variations is speedbinning, by which chips from a single batch are rated by a discrete range of frequencies and sold at different prices. An efficient binning distribution thus decides the profitability of the chip manufacturer. We propose and evaluate a cache-redundancy scheme called substitute cache, which allows the chip manufacturers to modify the number of chips in different bins. Particularly, this technique introduces a small fully associative array associated with each cache way to replicate the data elements that will be stored in the high latency lines, and hence can be effectively used to boost up the overall chip yield and also shift the chip binning distribution towards higher frequencies. We also develop models based on linear regression and neural networks to accurately estimate the chip prices from their architectural configurations. Using these estimation models, we find that our substitute cache scheme can potentially increase the revenue for the batch of chips by as much as 13.1%. *

Research paper thumbnail of Detecting/preventing information leakage on the memory bus due to malicious hardware

An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.

Research paper thumbnail of Quantifying and coping with parametric variations in 3D-stacked microarchitectures

Variability in device characteristics, i.e., parametric variations, is an important problem for s... more Variability in device characteristics, i.e., parametric variations, is an important problem for shrinking process technologies. They manifest themselves as variations in performance, power consumption, and reduction in reliability in the manufactured chips as well as low yield levels. Their implications on performance and yield are particularly profound on 3D architectures: a defect on even a single layer can render the entire stack useless. In this paper, we show that instead of causing increased yield losses, we can actually exploit 3D technology to reduce yield losses by intelligently devising the architectures. We take advantage of the layer-to-layer variations to reduce yield losses by splitting critical components among multiple layers. Our results indicate that our proposed method achieves a 30.6% lower yield loss rate compared to the same pipeline implemented on a 2D architecture.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve... more Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve significantly higher performance and efficiency over traditional architectures by introducing an explicitly managed on-chip storage in the memory hierarchy. This software managed memory serves as a staging area for bulk amounts of data, making all functional unit references short and predictable, while data is asynchronously transferred from external memory. The decoupling of computation from memory accesses allows the software to statically optimize the execution pipeline, transferring the onus of latency tolerance from hardware to software. The Stream programming model captures this by making computation and communication explicit in a 2-level storage hierarchy. This paradigm of structuring algorithms and explicitly managing data so that they are serviced by levels of the memory hierarchy as close to the processors as possible, however, applies to modern systems of all scales. The levels of memory hierarchy could include on-die storage such as caches, local DRAM, or even remote memory accessed over high-speed interconnect. Sequoia, a recently proposed programming language, extends the stream programming model to describe array blocking and communication for machines that can be abstracted as a tree of distinct memory modules.

Research paper thumbnail of Compiling for stream processing

This paper describes a compiler for stream programs that efficiently schedules computational kern... more This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve... more Recently, Streaming architectures such as Imagine, Merrimac and Cell were demonstrated to achieve significantly higher performance and efficiency over traditional architectures by introducing an explicitly managed on-chip storage in the memory hierarchy. This software managed memory serves as a staging area for bulk amounts of data, making all functional unit references short and predictable, while data is asynchronously transferred from external memory. The decoupling of computation from memory accesses allows the software to statically optimize the execution pipeline, transferring the onus of latency tolerance from hardware to software. The Stream programming model captures this by making computation and communication explicit in a 2-level storage hierarchy. This paradigm of structuring algorithms and explicitly managing data so that they are serviced by levels of the memory hierarchy as close to the processors as possible, however, applies to modern systems of all scales. The levels of memory hierarchy could include on-die storage such as caches, local DRAM, or even remote memory accessed over high-speed interconnect. Sequoia, a recently proposed programming language, extends the stream programming model to describe array blocking and communication for machines that can be abstracted as a tree of distinct memory modules.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Compiling for stream processing

This paper describes a compiler for stream programs that efficiently schedules computational kern... more This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.

Research paper thumbnail of Stream Processors: Progammability and Efficiency

ACM Queue, 2004

M any signal processing applications require both efficiency and programmability. Baseband signal... more M any signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Processors: Progammability and Efficiency

ACM Queue, 2004

M any signal processing applications require both efficiency and programmability. Baseband signal... more M any signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs.

Research paper thumbnail of Evaluating voltage islands in CMPs under process variations

Parameter variations are a major factor causing powerperformance asymmetry in chip multiprocessor... more Parameter variations are a major factor causing powerperformance asymmetry in chip multiprocessors. In this paper, we analyze the effects of with-in-die (WID) process variations on chip multicore processors and then apply a variable voltage island scheme to minimize power dissipation. Our idea is based on the observation that due to process variations, the critical paths in each core are likely to have a different latencies resulting in core-to-core (C2C) variations. As a result, each core can operate correctly under different supply voltage levels, achieving an optimal power consumption level. Particularly, we analyze voltage islands at different granularities ranging from a single core to a group of cores. We show that the dynamic power consumption can be reduced by up to 36.2% when each core can set its individual supply voltage level. In addition, for most manufacturing technologies, significant power savings can be achieved with only a few voltage islands on the whole chip: a single customized voltage setting can reduce the power consumption by up to 31.5%. Since the nominal operating frequency remains unchanged after the modifications, our scheme incurs no performance overhead.

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagin... more This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producerconsumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic microbenchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor.

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagin... more This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producerconsumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic microbenchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor.

Research paper thumbnail of Microarchitectures for Managing Chip Revenues under Process Variations

IEEE Computer Architecture Letters, 2007

As transistor feature sizes continue to shrink into the sub-90nm range and beyond, the effects of... more As transistor feature sizes continue to shrink into the sub-90nm range and beyond, the effects of process variations on critical path delay and chip yields have amplified. A common concept to remedy the effects of variation is speed-binning, by which chips from a single batch are rated by a discrete range of frequencies and sold at different prices. In this paper, we discuss strategies to modify the number of chips in different bins and hence enhance the profits obtained from them. Particularly, we propose a scheme that introduces a small Substitute Cache associated with each cache way to replicate the data elements that will be stored in the high latency lines. Assuming a fixed pricing model, this method increases the revenue by as much as 13.8% without any impact on the performance of the chips.

Research paper thumbnail of An FPGA-Based Network Intrusion Detection Architecture

IEEE Transactions on Information Forensics and Security, 2008

Network intrusion detection systems (NIDSs) monitor network traffic for suspicious activity and a... more Network intrusion detection systems (NIDSs) monitor network traffic for suspicious activity and alert the system or network administrator. With the onset of gigabit networks, current generation networking components for NIDS will soon be insufficient for numerous reasons; most notably because the existing methods cannot support high-performance demands. Field-programmable gate arrays (FPGAs) are an attractive medium to handle both high throughput and adaptability to the dynamic nature of intrusion detection. In this work, we design an FPGA-based architecture for anomaly detection in network transmissions. We first develop a feature extraction module (FEM) which aims to summarize network information to be used at a later stage. Our FPGA implementation shows that we can achieve significant performance improvements compared to existing software and application-specific integrated-circuit implementations. Then, we go one step further and demonstrate the use of principal component analysis as an outlier detection method for NIDSs. The results show that our architecture correctly classifies attacks with detection rates exceeding 99% and false alarms rates as low as 1.95%. Moreover, using extensive pipelining and hardware parallelism, it can be shown that for realistic workloads, our architectures for FEM and outlier analysis achieve 21.25-and 23.76-Gb/s core throughput, respectively.

Research paper thumbnail of Evaluating the effects of cache redundancy on profit

Previous works in computer architecture have mostly neglected revenue and/or profit, key factors ... more Previous works in computer architecture have mostly neglected revenue and/or profit, key factors driving any design decision. In this paper, we evaluate architectural techniques to optimize for revenue/profit. The continual trend of technology scaling and subwavelength lithography has caused transistor feature sizes to shrink into the nanoscale range. As a result, the effects of process variations on critical path delay and chip yields have amplified. A common concept to remedy the effects of variations is speedbinning, by which chips from a single batch are rated by a discrete range of frequencies and sold at different prices. An efficient binning distribution thus decides the profitability of the chip manufacturer. We propose and evaluate a cache-redundancy scheme called substitute cache, which allows the chip manufacturers to modify the number of chips in different bins. Particularly, this technique introduces a small fully associative array associated with each cache way to replicate the data elements that will be stored in the high latency lines, and hence can be effectively used to boost up the overall chip yield and also shift the chip binning distribution towards higher frequencies. We also develop models based on linear regression and neural networks to accurately estimate the chip prices from their architectural configurations. Using these estimation models, we find that our substitute cache scheme can potentially increase the revenue for the batch of chips by as much as 13.1%. *

Research paper thumbnail of Detecting/preventing information leakage on the memory bus due to malicious hardware

An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.

Research paper thumbnail of Quantifying and coping with parametric variations in 3D-stacked microarchitectures

Variability in device characteristics, i.e., parametric variations, is an important problem for s... more Variability in device characteristics, i.e., parametric variations, is an important problem for shrinking process technologies. They manifest themselves as variations in performance, power consumption, and reduction in reliability in the manufactured chips as well as low yield levels. Their implications on performance and yield are particularly profound on 3D architectures: a defect on even a single layer can render the entire stack useless. In this paper, we show that instead of causing increased yield losses, we can actually exploit 3D technology to reduce yield losses by intelligently devising the architectures. We take advantage of the layer-to-layer variations to reduce yield losses by splitting critical components among multiple layers. Our results indicate that our proposed method achieves a 30.6% lower yield loss rate compared to the same pipeline implemented on a 2D architecture.