Accelerated high-performance computing through efficient multi-process GPU resource sharing (original) (raw)
Related papers
GPU Resource Sharing and Virtualization on High Performance Computing Systems
2011 International Conference on Parallel Processing, 2011
Modern Graphic Processing Units (GPUs) are widely used as application accelerators in the High Performance Computing (HPC) field due to their massive floating-point computational capabilities and highly dataparallel computing architecture. Contemporary high performance computers equipped with co-processors such as GPUs primarily execute parallel applications using the Single Program Multiple Data (SPMD) model, which requires balanced computing resources of both microprocessor and coprocessors to ensure full system utilization. While the inclusion of GPUs in HPC systems provides more computing resources and significant performance improvements, the asymmetrical distribution of the number of GPUs relative to the microprocessors can result in an underutilization of overall system computing resources. In this paper, we propose a GPU resource virtualization approach to allow underutilized microprocessors to share the GPUs. We analyze factors affecting the parallel execution performance on GPUs and conduct a theoretical performance estimation based on the most recent GPU architectures as well as the SPMD model. Then we present the implementation details of the virtualization infrastructure, followed by an experimental verification of the proposed concepts using an NVIDIA Fermi GPU computing node. The results demonstrate a considerable performance gain over the traditional SPMD execution without virtualization. Furthermore, the proposed solution enables full utilization of the asymmetrical system resources, through the sharing of the GPUs among microprocessors, while incurring low overheads due to the virtualization layer.
Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing
Computers, 2013
The increasing incorporation of Graphics Processing Units (GPUs) as accelerators has been one of the forefront High Performance Computing (HPC) trends and provides unprecedented performance; however, the prevalent adoption of the Single-Program Multiple-Data (SPMD) programming model brings with it challenges of resource underutilization. In other words, under SPMD, every CPU needs GPU capability available to it. However, since CPUs generally outnumber GPUs, the asymmetric resource distribution gives rise to overall computing resource underutilization. In this paper, we propose to efficiently share the GPU under SPMD and formally define a series of GPU sharing scenarios. We provide performance-modeling analysis for each sharing scenario with accurate experimentation validation. With the modeling basis, we further conduct experimental studies to explore potential GPU sharing efficiency improvements from multiple perspectives. Both further theoretical and experimental GPU sharing performance analysis and results are presented. Our results not only demonstrate the significant performance gain for SPMD programs with the proposed efficient GPU sharing, but also the further improved sharing efficiency with the optimization techniques based on our accurate modeling.
Towards efficient GPU sharing on multicore processors
Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems - PMBS '11, 2011
Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing. The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This paper provides a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a GPU under hybrid programming. Using a set of microbenchmarks and applications on a GPU cluster, we show that thread and process-based context hosting have different tradeoffs. Experimental results on application benchmarks suggest that both thread-based context funneling and process-based context switching natively perform similarly on the latest Fermi GPUs, while manually guided context funneling is currently the best way to achieve optimal performance.
GPU-SM: shared memory multi-GPU programming
Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015, 2015
Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highlymultithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40% SLOC reduction in the host code of finite difference).
A complete and efficient CUDA-sharing solution for HPC clusters
Parallel Computing, 2014
In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator's policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrixmatrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
GPU virtualization for high performance general purpose computing on the ESX hypervisor
high performance computing symposium, 2014
Graphics Processing Units (GPU) have become important components in high performance computing (HPC) systems for their massively parallel computing capability and energy efficiency. Virtualization technologies are increasingly applied to HPC to reduce administration costs and improve system utilization. However, virtualizing the GPU to support general purpose computing presents many challenges because of the complexity of this device. On VMware's ESX hypervisor, DirectPath I/O can provide virtual machines (VM) high performance access to physical GPUs. However, this technology does not allow multiplexing for sharing GPUs among VMs and is not compatible with vMotion, VMware's technology for transparently migrating VMs among hosts inside clusters. In this paper, we address these issues by implementing a solution that uses "remote API execution" and takes advantage of DirectPath I/O to enable general purpose GPU on ESX. This solution, named vmCUDA, allows CUDA applications running concurrently in multiple VMs on ESX to share GPU(s). Our solution requires neither recompilation nor even editing of the source code of CUDA applications. Our performance evaluation has shown that vmCUDA introduced an overhead of 0.6%-3.5% for applications with moderate data size and 14%-20% for those with large data (e.g. 12.5 GB-237.5GB in our experiments).
Enabling GPU and Many-Core Systems in Heterogeneous HPC Environments using Memory Considerations
2010
Increasing the utilization of many-core systems has been one of the forefront topics these last years. Although many-cores architectures were merely theoretical models few years ago, they have become an important part of the high performance computing market. The semiconductor industry has developed Graphical Processing Units (GPU) systems that provide access to many cores (i.e: Larrabee, Fermi or Tesla) that can be used for General Purpose (GP) computing.
Generalizing the utility of GPUs in large-scale heterogeneous computing systems
2012
Graphics processing units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging than local GPUs since local and remote GPUs have to be dealt with separately. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent virtualization of GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution. To reduce the virtualization overhead, we optimize the GPU memory accesses and kernel launches. We also extend the VOCL framework to support live task migration across physical GPUs to achieve load balance and/or quick system maintenance. Our experiment results indicate that VOCL can greatly simplify the task of programming cluster-based GPUs at a reasonable virtualization cost.
Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration
2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017
GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by reconfiguring the GPU on a kernel-by-kernel basis. We explore four different schemes that trade the number of active SMs (streaming multiprocessor) for increased occupancy and local memory resources per SM and demonstrate improved power and energy with limited impact to performance. Our most aggressive scheme, BigSM, is capable of saving energy by up to 54%, and 26% on an average.
GPU clusters for high-performance computing
Proceedings - IEEE International Conference on Cluster Computing, ICCC, 2009
Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster architecture, resource sharing in a cluster environment, programming models, and applications for GPU clusters.