Richard (Rich) Vuduc | Georgia Institute of Technology (original) (raw)

Uploads

Papers by Richard (Rich) Vuduc

Research paper thumbnail of On Statistical Models in Automatic Tuning

Research paper thumbnail of Modeling and Analysis for Performance and Power

Abstract Accurately modeling application performance for specific architectures allows us to unde... more Abstract Accurately modeling application performance for specific architectures allows us to understand and analyze the impact of various architectural features on performance which will ultimately lead to improved performance and better architecture design choices for efficiency and scalability on future systems.

Research paper thumbnail of What GPU Computing Means for High-End Systems

Abstract This column examines how GPU computing might affect the architecture of future exascale ... more Abstract This column examines how GPU computing might affect the architecture of future exascale supercomputers. Specifically, the authors argue that a system with slower but better-balanced processors might yield higher performance and consume less energy than a system with very fast but imbalanced processors.

Research paper thumbnail of Motion Tracking Using Snakes and Dynamic Programming

The active contour models or" snakes" algorithm is a well-known technique for findingcontours of ... more The active contour models or" snakes" algorithm is a well-known technique for findingcontours of objects in images. In this paper, we implement a version of snakes using thedynamic programming approach suggested by Amini, et al.[1], to address the problem ofmotion tracking. Our implementation effectively and efficiently finds object boundaries, butmeets with only moderate success when applied to motion tracking. The primary difficulty isthat the snakes gradually lose their" grasp" of the edges as objects move away.

Research paper thumbnail of Performance Analysis and Tuning For~ autofilled~

ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofs... more ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofshared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.

Research paper thumbnail of Understanding the design trade-offs among current multicore systems for numerical computations

Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most rece... more Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors and accelerator technologies. Our primary aim is to aid application designers in better mapping their software to the most suitable architecture, with an additional goal of influencing future computing system design.

Research paper thumbnail of Numerical Algorithms with Tunable Parallelism

Research paper thumbnail of Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of ... more Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.

Research paper thumbnail of School of Computational Science and Engineering (CSE)

The School of Computational Science and Engineering (CSE) division was established in 2005 to str... more The School of Computational Science and Engineering (CSE) division was established in 2005 to strengthen and better reflect the critical role that computation plays in the science and engineering disciplines at Georgia Tech and in the broader technology community. The division is currently developing programs that immerse students both in computing and important computational problems within specific domain contexts.

Research paper thumbnail of An Investigation Of The Possible Enhancement Of Nuclear Superfluorescence Through Crystalline And Hyperfine Interaction Effects

Research paper thumbnail of Improving distributed memory applications testing by message perturbation

Abstract We present initial work on perturbation techniques that cause the manifestation of timin... more Abstract We present initial work on perturbation techniques that cause the manifestation of timing-related bugs in distributed memory Message Passing Interface (MPI)-based applications. These techniques improve the coverage of possible message orderings in MPI applications that rely on nondeterministic point-to-point communication and work with small processor counts to alleviate the need to test at larger scales.

Research paper thumbnail of Statistical models for empirical search-based performance tuning

Abstract Achieving peak performance from the computational kernels that dominate application perf... more Abstract Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (ie actually running the code).

Research paper thumbnail of Automatic performance tuning of sparse matrix kernels

This dissertation presents an automated system to generate highly efficient, platformadapted impl... more This dissertation presents an automated system to generate highly efficient, platformadapted implementations of sparse matrix kernels. These computational kernels lie at the heart of diverse applications in scientific computing, engineering, economic modeling, and information retrieval, to name a few. Informally, sparse kernels are computational operations on matrices whose entries are mostly zero, so that operations with and storage of these zero elements may be eliminated. The challenge in developing high-performance implementations of such kernels is choosing the data structure and code that best exploits the structural properties of the matrix-generally unknown until application run-time-for high-performance on the underlying machine architecture (e.g., memory hierarchy configuration and CPU pipeline structure). We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.

Research paper thumbnail of Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Abstract This work presents the first extensive study of single-node performance optimization, tu... more Abstract This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single-and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning.

Research paper thumbnail of Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method

Abstract Given a program and a multisocket, multicore system, what is the process by which one un... more Abstract Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability.

Research paper thumbnail of Architectural Visualization of C/C++ Source Code for Program Comprehension

Abstract Structural and behavioral visualization of large-scale legacy systems to aid program com... more Abstract Structural and behavioral visualization of large-scale legacy systems to aid program comprehension is still a major challenge. The challenge is even greater when applications are implemented in flexible and expressive languages such as C and C++. In this paper, we consider visualization of static and dynamic aspects of large-scale scientific C/C++ applications. For our investigation, we reuse and integrate specialized analysis and visualization tools.

Research paper thumbnail of Superfluorescence in the presence of inhomogeneousbroadening and relaxation

Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperat... more Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperative emission and thus reduces the intensity of the SF pulse. We also show how electronic relaxation or time-dependent hyperfine interactions can mollify the effect of inhomogeneous broadening so that SF can be recovered.

Research paper thumbnail of Comprehending Software Architecture using a Single-View Visualization

Abstract Software is among the most complex human artifacts, and visualization is widely acknowle... more Abstract Software is among the most complex human artifacts, and visualization is widely acknowledged as important to understanding software. In this paper, we consider the problem of understanding a software system's architecture through visualization.

Research paper thumbnail of More automatic assembly of highly tuned code fragments

Abstract We compare two procedures for automatically building a fast library routine from a set o... more Abstract We compare two procedures for automatically building a fast library routine from a set of code fragments1. These code fragments solve the same problem (ie, have identical inputs and outputs), but we assume that each fragment has been performance-tuned for di erent kinds of inputs. Suppose we are given a sampling of possible inputs, and the execution times of each algorithm on each sample input. Then, our procedure builds a single library routine with static rules for quickly selecting a code fragment.

Research paper thumbnail of Sparsity: Optimization framework for sparse matrix kernels

Abstract Sparse matrix–vector multiplication is an important computational kernel that performs p... more Abstract Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix.

Research paper thumbnail of On Statistical Models in Automatic Tuning

Research paper thumbnail of Modeling and Analysis for Performance and Power

Abstract Accurately modeling application performance for specific architectures allows us to unde... more Abstract Accurately modeling application performance for specific architectures allows us to understand and analyze the impact of various architectural features on performance which will ultimately lead to improved performance and better architecture design choices for efficiency and scalability on future systems.

Research paper thumbnail of What GPU Computing Means for High-End Systems

Abstract This column examines how GPU computing might affect the architecture of future exascale ... more Abstract This column examines how GPU computing might affect the architecture of future exascale supercomputers. Specifically, the authors argue that a system with slower but better-balanced processors might yield higher performance and consume less energy than a system with very fast but imbalanced processors.

Research paper thumbnail of Motion Tracking Using Snakes and Dynamic Programming

The active contour models or" snakes" algorithm is a well-known technique for findingcontours of ... more The active contour models or" snakes" algorithm is a well-known technique for findingcontours of objects in images. In this paper, we implement a version of snakes using thedynamic programming approach suggested by Amini, et al.[1], to address the problem ofmotion tracking. Our implementation effectively and efficiently finds object boundaries, butmeets with only moderate success when applied to motion tracking. The primary difficulty isthat the snakes gradually lose their" grasp" of the edges as objects move away.

Research paper thumbnail of Performance Analysis and Tuning For~ autofilled~

ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofs... more ABSTRACT General-purpose graphics processing units (GPGPU) have emerged as an important class ofshared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.

Research paper thumbnail of Understanding the design trade-offs among current multicore systems for numerical computations

Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most rece... more Abstract In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors and accelerator technologies. Our primary aim is to aid application designers in better mapping their software to the most suitable architecture, with an additional goal of influencing future computing system design.

Research paper thumbnail of Numerical Algorithms with Tunable Parallelism

Research paper thumbnail of Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of ... more Abstract General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs.

Research paper thumbnail of School of Computational Science and Engineering (CSE)

The School of Computational Science and Engineering (CSE) division was established in 2005 to str... more The School of Computational Science and Engineering (CSE) division was established in 2005 to strengthen and better reflect the critical role that computation plays in the science and engineering disciplines at Georgia Tech and in the broader technology community. The division is currently developing programs that immerse students both in computing and important computational problems within specific domain contexts.

Research paper thumbnail of An Investigation Of The Possible Enhancement Of Nuclear Superfluorescence Through Crystalline And Hyperfine Interaction Effects

Research paper thumbnail of Improving distributed memory applications testing by message perturbation

Abstract We present initial work on perturbation techniques that cause the manifestation of timin... more Abstract We present initial work on perturbation techniques that cause the manifestation of timing-related bugs in distributed memory Message Passing Interface (MPI)-based applications. These techniques improve the coverage of possible message orderings in MPI applications that rely on nondeterministic point-to-point communication and work with small processor counts to alleviate the need to test at larger scales.

Research paper thumbnail of Statistical models for empirical search-based performance tuning

Abstract Achieving peak performance from the computational kernels that dominate application perf... more Abstract Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (ie actually running the code).

Research paper thumbnail of Automatic performance tuning of sparse matrix kernels

This dissertation presents an automated system to generate highly efficient, platformadapted impl... more This dissertation presents an automated system to generate highly efficient, platformadapted implementations of sparse matrix kernels. These computational kernels lie at the heart of diverse applications in scientific computing, engineering, economic modeling, and information retrieval, to name a few. Informally, sparse kernels are computational operations on matrices whose entries are mostly zero, so that operations with and storage of these zero elements may be eliminated. The challenge in developing high-performance implementations of such kernels is choosing the data structure and code that best exploits the structural properties of the matrix-generally unknown until application run-time-for high-performance on the underlying machine architecture (e.g., memory hierarchy configuration and CPU pipeline structure). We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.

Research paper thumbnail of Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Abstract This work presents the first extensive study of single-node performance optimization, tu... more Abstract This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single-and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning.

Research paper thumbnail of Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method

Abstract Given a program and a multisocket, multicore system, what is the process by which one un... more Abstract Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability.

Research paper thumbnail of Architectural Visualization of C/C++ Source Code for Program Comprehension

Abstract Structural and behavioral visualization of large-scale legacy systems to aid program com... more Abstract Structural and behavioral visualization of large-scale legacy systems to aid program comprehension is still a major challenge. The challenge is even greater when applications are implemented in flexible and expressive languages such as C and C++. In this paper, we consider visualization of static and dynamic aspects of large-scale scientific C/C++ applications. For our investigation, we reuse and integrate specialized analysis and visualization tools.

Research paper thumbnail of Superfluorescence in the presence of inhomogeneousbroadening and relaxation

Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperat... more Abstract In this paper we show how inhomogeneous broadening produces dephasing, inhibits cooperative emission and thus reduces the intensity of the SF pulse. We also show how electronic relaxation or time-dependent hyperfine interactions can mollify the effect of inhomogeneous broadening so that SF can be recovered.

Research paper thumbnail of Comprehending Software Architecture using a Single-View Visualization

Abstract Software is among the most complex human artifacts, and visualization is widely acknowle... more Abstract Software is among the most complex human artifacts, and visualization is widely acknowledged as important to understanding software. In this paper, we consider the problem of understanding a software system's architecture through visualization.

Research paper thumbnail of More automatic assembly of highly tuned code fragments

Abstract We compare two procedures for automatically building a fast library routine from a set o... more Abstract We compare two procedures for automatically building a fast library routine from a set of code fragments1. These code fragments solve the same problem (ie, have identical inputs and outputs), but we assume that each fragment has been performance-tuned for di erent kinds of inputs. Suppose we are given a sampling of possible inputs, and the execution times of each algorithm on each sample input. Then, our procedure builds a single library routine with static rules for quickly selecting a code fragment.

Research paper thumbnail of Sparsity: Optimization framework for sparse matrix kernels

Abstract Sparse matrix–vector multiplication is an important computational kernel that performs p... more Abstract Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix.