Hyesoon Kim - Academia.edu (original) (raw)

Uploads

Papers by Hyesoon Kim

Research paper thumbnail of Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture

Research paper thumbnail of Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Research paper thumbnail of Vortex: OpenCL Compatible RISC-V GPGPU

arXiv (Cornell University), Feb 27, 2020

Research paper thumbnail of Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Research paper thumbnail of Batch-Aware Unified Memory Management in GPUs for Irregular Workloads

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020

Research paper thumbnail of Coda

ACM Transactions on Architecture and Code Optimization, 2018

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place comp... more To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively ...

Research paper thumbnail of Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)

Research paper thumbnail of An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture, 2009

Research paper thumbnail of An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture, 2010

Research paper thumbnail of Performance-aware speculation control using wrong path usefulness prediction

2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008

Research paper thumbnail of Cache Filtering Techniques to Reduce the Negative Impact of Useless Speculative Memory References on Processor Performance

16th Symposium on Computer Architecture and High Performance Computing

Research paper thumbnail of DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function

IEEE Computer Architecture Letters, 2012

Research paper thumbnail of Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007

Research paper thumbnail of Techniques for Efficient Processing in Runahead Execution Engines

ACM SIGARCH Computer Architecture News, 2005

Runahead execution is a technique that improves processor performance by pre-executing the runnin... more Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditionalout-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number...

Research paper thumbnail of TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels

2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Research paper thumbnail of A Power-Performance Analysis of Memory-intensive Parallel Applications on a Manycore Platform

Research paper thumbnail of Evaluating scalability of multi-threaded applications on a many-core platform

Research paper thumbnail of Power Modeling for GPU Architectures Using McPAT

ACM Transactions on Design Automation of Electronic Systems, 2014

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applicati... more Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this article we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data. We use the NVIDIA's Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.

Research paper thumbnail of Performance analysis and tuning for general purpose graphics processing units (GPGPU)

Synthesis Lectures on Computer Architecture, 2012

Research paper thumbnail of A performance analysis framework for identifying potential benefits in GPGPU applications

ACM SIGPLAN Notices, 2012

Research paper thumbnail of Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture

Research paper thumbnail of Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Research paper thumbnail of Vortex: OpenCL Compatible RISC-V GPGPU

arXiv (Cornell University), Feb 27, 2020

Research paper thumbnail of Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Research paper thumbnail of Batch-Aware Unified Memory Management in GPUs for Irregular Workloads

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020

Research paper thumbnail of Coda

ACM Transactions on Architecture and Code Optimization, 2018

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place comp... more To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively ...

Research paper thumbnail of Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)

Research paper thumbnail of An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture, 2009

Research paper thumbnail of An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture, 2010

Research paper thumbnail of Performance-aware speculation control using wrong path usefulness prediction

2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008

Research paper thumbnail of Cache Filtering Techniques to Reduce the Negative Impact of Useless Speculative Memory References on Processor Performance

16th Symposium on Computer Architecture and High Performance Computing

Research paper thumbnail of DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function

IEEE Computer Architecture Letters, 2012

Research paper thumbnail of Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007

Research paper thumbnail of Techniques for Efficient Processing in Runahead Execution Engines

ACM SIGARCH Computer Architecture News, 2005

Runahead execution is a technique that improves processor performance by pre-executing the runnin... more Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditionalout-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number...

Research paper thumbnail of TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels

2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Research paper thumbnail of A Power-Performance Analysis of Memory-intensive Parallel Applications on a Manycore Platform

Research paper thumbnail of Evaluating scalability of multi-threaded applications on a many-core platform

Research paper thumbnail of Power Modeling for GPU Architectures Using McPAT

ACM Transactions on Design Automation of Electronic Systems, 2014

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applicati... more Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this article we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data. We use the NVIDIA's Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.

Research paper thumbnail of Performance analysis and tuning for general purpose graphics processing units (GPGPU)

Synthesis Lectures on Computer Architecture, 2012

Research paper thumbnail of A performance analysis framework for identifying potential benefits in GPGPU applications

ACM SIGPLAN Notices, 2012

Log In