Hyesoon Kim - Academia.edu (original) (raw)
Uploads
Papers by Hyesoon Kim
arXiv (Cornell University), Feb 27, 2020
MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020
ACM Transactions on Architecture and Code Optimization, 2018
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place comp... more To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively ...
38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)
Proceedings of the 36th annual international symposium on Computer architecture, 2009
Proceedings of the 37th annual international symposium on Computer architecture, 2010
2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008
16th Symposium on Computer Architecture and High Performance Computing
IEEE Computer Architecture Letters, 2012
2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007
ACM SIGARCH Computer Architecture News, 2005
Runahead execution is a technique that improves processor performance by pre-executing the runnin... more Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditionalout-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number...
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
ACM Transactions on Design Automation of Electronic Systems, 2014
Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applicati... more Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this article we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data. We use the NVIDIA's Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.
Synthesis Lectures on Computer Architecture, 2012
ACM SIGPLAN Notices, 2012
arXiv (Cornell University), Feb 27, 2020
MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020
ACM Transactions on Architecture and Code Optimization, 2018
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place comp... more To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively ...
38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)
Proceedings of the 36th annual international symposium on Computer architecture, 2009
Proceedings of the 37th annual international symposium on Computer architecture, 2010
2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008
16th Symposium on Computer Architecture and High Performance Computing
IEEE Computer Architecture Letters, 2012
2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007
ACM SIGARCH Computer Architecture News, 2005
Runahead execution is a technique that improves processor performance by pre-executing the runnin... more Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditionalout-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number...
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
ACM Transactions on Design Automation of Electronic Systems, 2014
Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applicati... more Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this article we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data. We use the NVIDIA's Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.
Synthesis Lectures on Computer Architecture, 2012
ACM SIGPLAN Notices, 2012