Richard (Rich) Vuduc | Georgia Institute of Technology (original) (raw)
Uploads
Papers by Richard (Rich) Vuduc
Abstract Accurately modeling application performance for specific architectures allows us to unde... more Abstract Accurately modeling application performance for specific architectures allows us to understand and analyze the impact of various architectural features on performance which will ultimately lead to improved performance and better architecture design choices for efficiency and scalability on future systems.
Abstract This column examines how GPU computing might affect the architecture of future exascale ... more Abstract This column examines how GPU computing might affect the architecture of future exascale supercomputers. Specifically, the authors argue that a system with slower but better-balanced processors might yield higher performance and consume less energy than a system with very fast but imbalanced processors.
The active contour models or" snakes" algorithm is a well-known technique for findingcontours of ... more The active contour models or" snakes" algorithm is a well-known technique for findingcontours of objects in images. In this paper, we implement a version of snakes using thedynamic programming approach suggested by Amini, et al.[1], to address the problem ofmotion tracking. Our implementation effectively and efficiently finds object boundaries, butmeets with only moderate success when applied to motion tracking. The primary difficulty isthat the snakes gradually lose their" grasp" of the edges as objects move away.
ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofs... more ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofshared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.
Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most rece... more Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors and accelerator technologies. Our primary aim is to aid application designers in better mapping their software to the most suitable architecture, with an additional goal of influencing future computing system design.
Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of ... more Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.
The School of Computational Science and Engineering (CSE) division was established in 2005 to str... more The School of Computational Science and Engineering (CSE) division was established in 2005 to strengthen and better reflect the critical role that computation plays in the science and engineering disciplines at Georgia Tech and in the broader technology community. The division is currently developing programs that immerse students both in computing and important computational problems within specific domain contexts.
Abstract We present initial work on perturbation techniques that cause the manifestation of timin... more Abstract We present initial work on perturbation techniques that cause the manifestation of timing-related bugs in distributed memory Message Passing Interface (MPI)-based applications. These techniques improve the coverage of possible message orderings in MPI applications that rely on nondeterministic point-to-point communication and work with small processor counts to alleviate the need to test at larger scales.
Abstract Achieving peak performance from the computational kernels that dominate application perf... more Abstract Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (ie actually running the code).
This dissertation presents an automated system to generate highly efficient, platformadapted impl... more This dissertation presents an automated system to generate highly efficient, platformadapted implementations of sparse matrix kernels. These computational kernels lie at the heart of diverse applications in scientific computing, engineering, economic modeling, and information retrieval, to name a few. Informally, sparse kernels are computational operations on matrices whose entries are mostly zero, so that operations with and storage of these zero elements may be eliminated. The challenge in developing high-performance implementations of such kernels is choosing the data structure and code that best exploits the structural properties of the matrix-generally unknown until application run-time-for high-performance on the underlying machine architecture (e.g., memory hierarchy configuration and CPU pipeline structure). We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.
Abstract This work presents the first extensive study of single-node performance optimization, tu... more Abstract This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single-and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning.
Abstract Given a program and a multisocket, multicore system, what is the process by which one un... more Abstract Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability.
Abstract Structural and behavioral visualization of large-scale legacy systems to aid program com... more Abstract Structural and behavioral visualization of large-scale legacy systems to aid program comprehension is still a major challenge. The challenge is even greater when applications are implemented in flexible and expressive languages such as C and C++. In this paper, we consider visualization of static and dynamic aspects of large-scale scientific C/C++ applications. For our investigation, we reuse and integrate specialized analysis and visualization tools.
Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperat... more Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperative emission and thus reduces the intensity of the SF pulse. We also show how electronic relaxation or time-dependent hyperfine interactions can mollify the effect of inhomogeneous broadening so that SF can be recovered.
Abstract Software is among the most complex human artifacts, and visualization is widely acknowle... more Abstract Software is among the most complex human artifacts, and visualization is widely acknowledged as important to understanding software. In this paper, we consider the problem of understanding a software system's architecture through visualization.
Abstract We compare two procedures for automatically building a fast library routine from a set o... more Abstract We compare two procedures for automatically building a fast library routine from a set of code fragments1. These code fragments solve the same problem (ie, have identical inputs and outputs), but we assume that each fragment has been performance-tuned for di erent kinds of inputs. Suppose we are given a sampling of possible inputs, and the execution times of each algorithm on each sample input. Then, our procedure builds a single library routine with static rules for quickly selecting a code fragment.
Abstract Sparse matrix–vector multiplication is an important computational kernel that performs p... more Abstract Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix.
Abstract Accurately modeling application performance for specific architectures allows us to unde... more Abstract Accurately modeling application performance for specific architectures allows us to understand and analyze the impact of various architectural features on performance which will ultimately lead to improved performance and better architecture design choices for efficiency and scalability on future systems.
Abstract This column examines how GPU computing might affect the architecture of future exascale ... more Abstract This column examines how GPU computing might affect the architecture of future exascale supercomputers. Specifically, the authors argue that a system with slower but better-balanced processors might yield higher performance and consume less energy than a system with very fast but imbalanced processors.
The active contour models or" snakes" algorithm is a well-known technique for findingcontours of ... more The active contour models or" snakes" algorithm is a well-known technique for findingcontours of objects in images. In this paper, we implement a version of snakes using thedynamic programming approach suggested by Amini, et al.[1], to address the problem ofmotion tracking. Our implementation effectively and efficiently finds object boundaries, butmeets with only moderate success when applied to motion tracking. The primary difficulty isthat the snakes gradually lose their" grasp" of the edges as objects move away.
ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofs... more ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofshared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.
Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most rece... more Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors and accelerator technologies. Our primary aim is to aid application designers in better mapping their software to the most suitable architecture, with an additional goal of influencing future computing system design.
Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of ... more Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.
The School of Computational Science and Engineering (CSE) division was established in 2005 to str... more The School of Computational Science and Engineering (CSE) division was established in 2005 to strengthen and better reflect the critical role that computation plays in the science and engineering disciplines at Georgia Tech and in the broader technology community. The division is currently developing programs that immerse students both in computing and important computational problems within specific domain contexts.
Abstract We present initial work on perturbation techniques that cause the manifestation of timin... more Abstract We present initial work on perturbation techniques that cause the manifestation of timing-related bugs in distributed memory Message Passing Interface (MPI)-based applications. These techniques improve the coverage of possible message orderings in MPI applications that rely on nondeterministic point-to-point communication and work with small processor counts to alleviate the need to test at larger scales.
Abstract Achieving peak performance from the computational kernels that dominate application perf... more Abstract Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (ie actually running the code).
This dissertation presents an automated system to generate highly efficient, platformadapted impl... more This dissertation presents an automated system to generate highly efficient, platformadapted implementations of sparse matrix kernels. These computational kernels lie at the heart of diverse applications in scientific computing, engineering, economic modeling, and information retrieval, to name a few. Informally, sparse kernels are computational operations on matrices whose entries are mostly zero, so that operations with and storage of these zero elements may be eliminated. The challenge in developing high-performance implementations of such kernels is choosing the data structure and code that best exploits the structural properties of the matrix-generally unknown until application run-time-for high-performance on the underlying machine architecture (e.g., memory hierarchy configuration and CPU pipeline structure). We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.
Abstract This work presents the first extensive study of single-node performance optimization, tu... more Abstract This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single-and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning.
Abstract Given a program and a multisocket, multicore system, what is the process by which one un... more Abstract Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability.
Abstract Structural and behavioral visualization of large-scale legacy systems to aid program com... more Abstract Structural and behavioral visualization of large-scale legacy systems to aid program comprehension is still a major challenge. The challenge is even greater when applications are implemented in flexible and expressive languages such as C and C++. In this paper, we consider visualization of static and dynamic aspects of large-scale scientific C/C++ applications. For our investigation, we reuse and integrate specialized analysis and visualization tools.
Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperat... more Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperative emission and thus reduces the intensity of the SF pulse. We also show how electronic relaxation or time-dependent hyperfine interactions can mollify the effect of inhomogeneous broadening so that SF can be recovered.
Abstract Software is among the most complex human artifacts, and visualization is widely acknowle... more Abstract Software is among the most complex human artifacts, and visualization is widely acknowledged as important to understanding software. In this paper, we consider the problem of understanding a software system's architecture through visualization.
Abstract We compare two procedures for automatically building a fast library routine from a set o... more Abstract We compare two procedures for automatically building a fast library routine from a set of code fragments1. These code fragments solve the same problem (ie, have identical inputs and outputs), but we assume that each fragment has been performance-tuned for di erent kinds of inputs. Suppose we are given a sampling of possible inputs, and the execution times of each algorithm on each sample input. Then, our procedure builds a single library routine with static rules for quickly selecting a code fragment.
Abstract Sparse matrix–vector multiplication is an important computational kernel that performs p... more Abstract Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix.