Hyesoon Kim - Academia.edu (original) (raw)

Uploads

Papers by Hyesoon Kim

arXiv (Cornell University), Feb 27, 2020

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020

ACM Transactions on Architecture and Code Optimization, 2018

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place comp... more To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively ...

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)

Proceedings of the 36th annual international symposium on Computer architecture, 2009

Proceedings of the 37th annual international symposium on Computer architecture, 2010

2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008

16th Symposium on Computer Architecture and High Performance Computing

IEEE Computer Architecture Letters, 2012

2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007

ACM SIGARCH Computer Architecture News, 2005

Runahead execution is a technique that improves processor performance by pre-executing the runnin... more Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditionalout-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number...

2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

ACM Transactions on Design Automation of Electronic Systems, 2014

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applicati... more Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this article we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data. We use the NVIDIA's Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.

Synthesis Lectures on Computer Architecture, 2012

ACM SIGPLAN Notices, 2012