Rajeev Barua | University of Maryland (original) (raw)
Papers by Rajeev Barua
Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in t... more Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses.
A hardware method for functional unit assignment is presented, based on the principle that a func... more A hardware method for functional unit assignment is presented, based on the principle that a functional unit's power consumption is approximated by the switching activity of its inputs. Since computing the Hamming distance of the inputs in hardware is expensive, only a portion of the inputs are examined. Integers often have many identical top bits, due to sign extension, and floating points often have many zeros in the least significant digits, due to the casting of integer values into floating point, and other reasons.
Abstract Binary rewriting softwares transform executables by maintaining the original binary's fu... more Abstract Binary rewriting softwares transform executables by maintaining the original binary's functionality, while improving it in one or more metrics, such as runtime performance, energy use, memory use, security, and reliability. Existing static binary rewriters are unable to rewrite binaries that do not contain relocation information, which is typically discarded by linkers unless specifically instructed otherwise.
Abstract This paper presents the first automatic scheme to allocate local (stack) data in recursi... more Abstract This paper presents the first automatic scheme to allocate local (stack) data in recursive functions to scratch-pad memory (SPM) in embedded systems. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its significantly lower access time, energy consumption, real-time bounds, area and overall runtime.
We present a practical tool for inserting security features against low-level software attacks in... more We present a practical tool for inserting security features against low-level software attacks into third-party, proprietary or otherwise binary-only software. We are motivated by the inability of software users to select and use low-overhead protection schemes when source code is unavailable to them, by the lack of information as to what (if any) security mechanisms software producers have used in their toolchains, and the high overhead and inaccuracy of solutions that treat software as a black box.
Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in t... more Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4–16 concurrent cache misses.
Page 1. Issues in writing a Parallel Compiler starting from a Serial Compiler Alexandros Tzannes,... more Page 1. Issues in writing a Parallel Compiler starting from a Serial Compiler Alexandros Tzannes, Rajeev Barua, George C. Caragea, Uzi Vishkin University of Maryland, College Park Page 2. 2 Motivation ��� ���The Free Lunch is Over��� [Herb Sutter] ��� CPU clock speed stopped increasing ��� Dual Cores (on chip) are now mainstream and Quad cores around the corner ��� Intel has a 5 year roadmap for a 80-core teraflop processor Page 3.
Abstract This paper presents a novel compilation system that allows sequential programs, written ... more Abstract This paper presents a novel compilation system that allows sequential programs, written in C or FORTRAN, to be compiled directly into distributed-memory hardware. This capability is interesting not only because certain applications require the performance of custom hardware, but also because current trends in computer architecture are moving towards more hardware-like substrates. Our system works by successfully combining two resource-e cient computing disciplines: Smart Memories and Virtual Wires.
Abstract We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for... more Abstract We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst) for every do-all loop in the code.
Abstract The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, wi... more Abstract The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. This paper presents a publicly available tool chain for XMT, complete with a highly configurable cycle-accurate simulator and an optimizing compiler. The XMT tool chain has matured and has been validated to a point where its description merits publication.
ABSTRACT Cache memories in embedded systems play an important role in reducing the execution time... more ABSTRACT Cache memories in embedded systems play an important role in reducing the execution time of the applications. Various kinds of extensions have been added to cache hardware to enable software involvement in replacement decisions, thus improving the run-time over a purely hardware-managed cache. Novel embedded systems, like Intel's Xscale and ARM Cortex processors provide the facility of locking one or more lines in cache-this feature is called cache locking.
SUMMARY Memory access violations are a leading source of unreliability in C programs. As evidence... more SUMMARY Memory access violations are a leading source of unreliability in C programs. As evidence of this problem, a variety of methods exist that retrofit C with software checks to detect memory errors at runtime. However, these methods generally suffer from one or more drawbacks including the inability to detect all errors, the use of incompatible metadata, the need for manual code modifications, and high runtime overheads.
ABSTRACT A binary rewriter is a piece of software that accepts a binary executable program as inp... more ABSTRACT A binary rewriter is a piece of software that accepts a binary executable program as input, and produces an improved executable as output. This paper describes the first technique in literature to decompile the input binary into an existing compiler's high-level intermediate form (IR). The compiler's back-end is then used to generate the output binary from the IR. Doing so enables the use of the rich set of compiler analysis and transformation passes available in mature compilers.
Abstract Today, nearly all general-purpose computers are parallel, but nearly all software runnin... more Abstract Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. However bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable.
Abstract Field-Programmable Gate Arrays (FPGAs) are currently used in two major classes of system... more Abstract Field-Programmable Gate Arrays (FPGAs) are currently used in two major classes of systems, namely in logic emulation of digital circuits, and in reconfigurable computer systems. This paper reviews the current status of research into both these areas, especially reconfigurable computers, and comments on their future prospects. Three papers are reviewed in some detail, but several others are also reviewed, and placed in a larger context.
Abstract This index covers all technical items-papers, correspondence, reviews, etc.-that appeare... more Abstract This index covers all technical items-papers, correspondence, reviews, etc.-that appeared in this periodical during the year, and items from previous years that were commented upon or corrected in this year. Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name.
Abstract The most radical of the architectures that appear in this issue are Raw processors-highl... more Abstract The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture.
Out-of-memory errors are a serious source of unreliability in most embedded systems. Applications... more Out-of-memory errors are a serious source of unreliability in most embedded systems. Applications run out of main memory because of the frequent difficulty of estimating the memory requirement before deployment, either because it depends on input data, or because certain language features prevent estimation. The typical lack of disks and virtual memory in embedded systems has a serious consequence when an out-of-memory error occurs. Without swap space, the system crashes if its memory footprint exceeds the available memory by even one byte.
This paper presents the first memory allocation scheme for embedded systems having a scratchpad m... more This paper presents the first memory allocation scheme for embedded systems having a scratchpad memory whose size is unknown at compile-time. A scratch-pad memory (SPM) is a fast compiler-managed SRAM that replaces the hardware-managed cache. All existing memory allocation schemes for SPM require the SPM size to be known at compile-time. Unfortunately, because of this constraint, the resulting executable is tied to that size of SPM and is not portable to other processor implementations having a different SPM size. Size-portable code is valuable when programs are downloaded during deployment either via a network or portable media. Code downloads are used for fixing bugs or for enhancing functionality. The presence of different SPM sizes in different devices is common because of the evolution in VLSI technology across years. The result is that SPM cannot be used in such situations with downloaded codes.
Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in t... more Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses.
A hardware method for functional unit assignment is presented, based on the principle that a func... more A hardware method for functional unit assignment is presented, based on the principle that a functional unit's power consumption is approximated by the switching activity of its inputs. Since computing the Hamming distance of the inputs in hardware is expensive, only a portion of the inputs are examined. Integers often have many identical top bits, due to sign extension, and floating points often have many zeros in the least significant digits, due to the casting of integer values into floating point, and other reasons.
Abstract Binary rewriting softwares transform executables by maintaining the original binary's fu... more Abstract Binary rewriting softwares transform executables by maintaining the original binary's functionality, while improving it in one or more metrics, such as runtime performance, energy use, memory use, security, and reliability. Existing static binary rewriters are unable to rewrite binaries that do not contain relocation information, which is typically discarded by linkers unless specifically instructed otherwise.
Abstract This paper presents the first automatic scheme to allocate local (stack) data in recursi... more Abstract This paper presents the first automatic scheme to allocate local (stack) data in recursive functions to scratch-pad memory (SPM) in embedded systems. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its significantly lower access time, energy consumption, real-time bounds, area and overall runtime.
We present a practical tool for inserting security features against low-level software attacks in... more We present a practical tool for inserting security features against low-level software attacks into third-party, proprietary or otherwise binary-only software. We are motivated by the inability of software users to select and use low-overhead protection schemes when source code is unavailable to them, by the lack of information as to what (if any) security mechanisms software producers have used in their toolchains, and the high overhead and inaccuracy of solutions that treat software as a black box.
Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in t... more Abstract Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4–16 concurrent cache misses.
Page 1. Issues in writing a Parallel Compiler starting from a Serial Compiler Alexandros Tzannes,... more Page 1. Issues in writing a Parallel Compiler starting from a Serial Compiler Alexandros Tzannes, Rajeev Barua, George C. Caragea, Uzi Vishkin University of Maryland, College Park Page 2. 2 Motivation ��� ���The Free Lunch is Over��� [Herb Sutter] ��� CPU clock speed stopped increasing ��� Dual Cores (on chip) are now mainstream and Quad cores around the corner ��� Intel has a 5 year roadmap for a 80-core teraflop processor Page 3.
Abstract This paper presents a novel compilation system that allows sequential programs, written ... more Abstract This paper presents a novel compilation system that allows sequential programs, written in C or FORTRAN, to be compiled directly into distributed-memory hardware. This capability is interesting not only because certain applications require the performance of custom hardware, but also because current trends in computer architecture are moving towards more hardware-like substrates. Our system works by successfully combining two resource-e cient computing disciplines: Smart Memories and Virtual Wires.
Abstract We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for... more Abstract We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst) for every do-all loop in the code.
Abstract The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, wi... more Abstract The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. This paper presents a publicly available tool chain for XMT, complete with a highly configurable cycle-accurate simulator and an optimizing compiler. The XMT tool chain has matured and has been validated to a point where its description merits publication.
ABSTRACT Cache memories in embedded systems play an important role in reducing the execution time... more ABSTRACT Cache memories in embedded systems play an important role in reducing the execution time of the applications. Various kinds of extensions have been added to cache hardware to enable software involvement in replacement decisions, thus improving the run-time over a purely hardware-managed cache. Novel embedded systems, like Intel's Xscale and ARM Cortex processors provide the facility of locking one or more lines in cache-this feature is called cache locking.
SUMMARY Memory access violations are a leading source of unreliability in C programs. As evidence... more SUMMARY Memory access violations are a leading source of unreliability in C programs. As evidence of this problem, a variety of methods exist that retrofit C with software checks to detect memory errors at runtime. However, these methods generally suffer from one or more drawbacks including the inability to detect all errors, the use of incompatible metadata, the need for manual code modifications, and high runtime overheads.
ABSTRACT A binary rewriter is a piece of software that accepts a binary executable program as inp... more ABSTRACT A binary rewriter is a piece of software that accepts a binary executable program as input, and produces an improved executable as output. This paper describes the first technique in literature to decompile the input binary into an existing compiler's high-level intermediate form (IR). The compiler's back-end is then used to generate the output binary from the IR. Doing so enables the use of the rich set of compiler analysis and transformation passes available in mature compilers.
Abstract Today, nearly all general-purpose computers are parallel, but nearly all software runnin... more Abstract Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. However bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable.
Abstract Field-Programmable Gate Arrays (FPGAs) are currently used in two major classes of system... more Abstract Field-Programmable Gate Arrays (FPGAs) are currently used in two major classes of systems, namely in logic emulation of digital circuits, and in reconfigurable computer systems. This paper reviews the current status of research into both these areas, especially reconfigurable computers, and comments on their future prospects. Three papers are reviewed in some detail, but several others are also reviewed, and placed in a larger context.
Abstract This index covers all technical items-papers, correspondence, reviews, etc.-that appeare... more Abstract This index covers all technical items-papers, correspondence, reviews, etc.-that appeared in this periodical during the year, and items from previous years that were commented upon or corrected in this year. Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name.
Abstract The most radical of the architectures that appear in this issue are Raw processors-highl... more Abstract The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture.
Out-of-memory errors are a serious source of unreliability in most embedded systems. Applications... more Out-of-memory errors are a serious source of unreliability in most embedded systems. Applications run out of main memory because of the frequent difficulty of estimating the memory requirement before deployment, either because it depends on input data, or because certain language features prevent estimation. The typical lack of disks and virtual memory in embedded systems has a serious consequence when an out-of-memory error occurs. Without swap space, the system crashes if its memory footprint exceeds the available memory by even one byte.
This paper presents the first memory allocation scheme for embedded systems having a scratchpad m... more This paper presents the first memory allocation scheme for embedded systems having a scratchpad memory whose size is unknown at compile-time. A scratch-pad memory (SPM) is a fast compiler-managed SRAM that replaces the hardware-managed cache. All existing memory allocation schemes for SPM require the SPM size to be known at compile-time. Unfortunately, because of this constraint, the resulting executable is tied to that size of SPM and is not portable to other processor implementations having a different SPM size. Size-portable code is valuable when programs are downloaded during deployment either via a network or portable media. Code downloads are used for fixing bugs or for enhancing functionality. The presence of different SPM sizes in different devices is common because of the evolution in VLSI technology across years. The result is that SPM cannot be used in such situations with downloaded codes.