Abhishek Das | Stanford University (original) (raw)
Papers by Abhishek Das
Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.
Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.
With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.
With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.
ACM Sigarch Computer Architecture News, 2004
ACM Sigarch Computer Architecture News, 2004
IEEE Computer Architecture Letters, 2007
IEEE Transactions on Information Forensics and Security, 2008
An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.
Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.
Merrimac uses stream architecture and advanced interconnection networks to give an order of magni... more Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a 20K2TFLOPSworkstationtoa20K 2 TFLOPS workstation to a 20K2TFLOPSworkstationtoa20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.
With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.
With the emergence of streaming and multi-core architectures, there is an increasing demand to ma... more With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.
ACM Sigarch Computer Architecture News, 2004
ACM Sigarch Computer Architecture News, 2004
IEEE Computer Architecture Letters, 2007
IEEE Transactions on Information Forensics and Security, 2008
An increasing concern amongst designers and integrators of military and defense-related systems i... more An increasing concern amongst designers and integrators of military and defense-related systems is the underlying security of the individual microprocessor components that make up these systems. Malicious circuitry can be inserted and hidden at several stages of the design process through the use of third-party Intellectual Property (IP), design tools, and manufacturing facilities. Such hardware Trojan circuitry has been shown to be capable of shutting down the main processor after a random number of cycles, broadcasting sensitive information over the bus, and bypassing software authentication mechanisms. In this work, we propose an architecture that can prevent information leakage due to such malicious hardware. Our technique is based on guaranteeing certain behavior in the memory system, which will be checked at an external guardian core that ??approves?? each memory request. By sitting between off-chip memory and the main core, the guardian core can monitor bus activity and verify the compiler-defined correctness of all memory writes. Experimental results on a conventional x86 platform demonstrate that application binaries can be statically re-instrumented to coordinate with the guardian core to monitor off-chip access, resulting in less than 60% overhead for the majority of the studied benchmarks.