Abhishek Das | Stanford University (original) (raw)

Papers by Abhishek Das

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Research paper thumbnail of Compiling for stream processing

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Research paper thumbnail of Merrimac: Supercomputing with Streams

Research paper thumbnail of Compiling for stream processing

Research paper thumbnail of Stream Processors: Progammability and Efficiency

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Research paper thumbnail of Stream Processors: Progammability and Efficiency

Research paper thumbnail of Evaluating voltage islands in CMPs under process variations

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

Research paper thumbnail of Microarchitectures for Managing Chip Revenues under Process Variations

IEEE Computer Architecture Letters, 2007

Research paper thumbnail of An FPGA-Based Network Intrusion Detection Architecture

IEEE Transactions on Information Forensics and Security, 2008

Research paper thumbnail of Evaluating the effects of cache redundancy on profit

Research paper thumbnail of Detecting/preventing information leakage on the memory bus due to malicious hardware

An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.

Research paper thumbnail of Quantifying and coping with parametric variations in 3D-stacked microarchitectures

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Research paper thumbnail of Compiling for stream processing

Research paper thumbnail of Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy

Research paper thumbnail of Merrimac: Supercomputing with Streams

Research paper thumbnail of Compiling for stream processing

Research paper thumbnail of Stream Processors: Progammability and Efficiency

Research paper thumbnail of Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.

Research paper thumbnail of Merrimac: Supercomputing with Streams

Research paper thumbnail of Stream Processors: Progammability and Efficiency

Research paper thumbnail of Evaluating voltage islands in CMPs under process variations

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

Research paper thumbnail of Evaluating the Imagine Stream Architecture

ACM Sigarch Computer Architecture News, 2004

Research paper thumbnail of Microarchitectures for Managing Chip Revenues under Process Variations

IEEE Computer Architecture Letters, 2007

Research paper thumbnail of An FPGA-Based Network Intrusion Detection Architecture

IEEE Transactions on Information Forensics and Security, 2008

Research paper thumbnail of Evaluating the effects of cache redundancy on profit

Research paper thumbnail of Detecting/preventing information leakage on the memory bus due to malicious hardware

An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.

Research paper thumbnail of Quantifying and coping with parametric variations in 3D-stacked microarchitectures