About GPU sharing strategies in GKE (original) (raw)

This page explains the characteristics and best types of workloads for each GPU sharing strategy available in Google Kubernetes Engine (GKE), such as multi-instance GPUs, GPU time-sharing, and NVIDIA MPS. GPU sharing helps you to minimize underutilized capacity in your cluster and to provide workloads with just enough capacity to complete tasks.

This page is for Platform admins and operators and for Data and AI specialists who want to run GPU-based workloads that consume GPU capacity as efficiently as possible. To learn more about common roles that we reference in Google Cloud content, seeCommon GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following concepts:

How GPU requests work in Kubernetes

Kubernetes enables workloads to request precisely the resource amounts they need to function. Although you can request fractional CPU units for workloads, you can't request fractional GPU units. Pod manifests must request GPU resources in integers, which means that an entire physical GPU is allocated to one container even if the container only needs a fraction of the resources to function correctly. This is inefficient and can be costly, especially when you're running multiple workloads with similar low GPU requirements.

Best practice:

Use GPU sharing strategies to improve GPU utilization when your workloads don't need all of the GPU resources.

What are GPU sharing strategies?

GPU sharing strategies allow multiple containers to efficiently use your attached GPUs and save running costs. GKE provides the following GPU sharing strategies:

Which GPU sharing strategy to use

The following table summarizes and compares the characteristics of the available GPU sharing strategies:

Multi-instance GPU GPU time-sharing NVIDIA MPS
General Parallel GPU sharing among containers Rapid context switching Parallel GPU sharing among containers
Isolation A single GPU is divided in up to seven slices and each container on the same physical GPU has dedicated compute, memory, and bandwidth. Therefore, a container in a partition has a predictable throughput and latency even when other containers saturate other partitions. Each container accesses the full capacity of the underlying physical GPU by doing context switching between processes running on a GPU. However, time-sharing provides no memory limit enforcement between shared Jobs and the rapid context switching for shared access may introduce overhead. NVIDIA MPS has limited resource isolation, but gains more flexibility in other dimensions, for example GPU types and max shared units, which simplify resource allocation.
Suitable for these workloads Recommended for workloads running in parallel and that need certain resiliency and QoS. For example, when running AI inference workloads, multi-instance GPU allows multiple inference queries to run simultaneously for quick responses, without slowing each other down. Recommended for bursty and interactive workloads that have idle periods. These workloads are not cost-effective with a fully dedicated GPU. By using time-sharing, workloads get quick access to the GPU when they are in active phases. GPU time-sharing is optimal for scenarios to avoid idling costly GPUs where full isolation and continuous GPU access might not be necessary, for example, when multiple users test or prototype workloads. Workloads that use time-sharing need to tolerate certain performance and latency compromises. Recommended for batch processing for small jobs because MPS maximizes the throughput and concurrent use of a GPU. MPS allows batch jobs to efficiently process in parallel for small to medium sized workloads. NVIDIA MPS is optimal for cooperative processes acting as a single application. For example, MPI jobs with inter-MPI rank parallelism. With these jobs, each small CUDA process (typically MPI ranks) can run concurrently on the GPU to fully saturate the whole GPU. Workloads that use CUDA MPS need to tolerate the memory protection and error containment limitations.
Monitoring GPU utilization metrics are not available for multi-instance GPUs. Use Cloud Monitoring to monitor the performance of your GPU time-sharing. To learn more about the available metrics, see Monitor GPU time-sharing or NVIDIA MPS nodes. Use Cloud Monitoring to monitor the performance of your NVIDIA MPS. To learn more about the available metrics, see Monitor GPU time-sharing or NVIDIA MPS nodes.
Request shared GPUs in workloads Run multi-instance GPUs Run GPUs with time-sharing Run GPUs with NVIDIA MPS

Best practice:

To maximize your GPU utilization, combine GPU sharing strategies. For each, multi-instance GPU partition, use either time-sharing or NVIDIA MPS. You can then run multiple containers on each partition, with those containers sharing access to the resources on that partition. We recommend that you use any of the following combinations:

How the GPU sharing strategies work

You can specify the maximum number of containers allowed to share a physical GPU:

The following sections explain the scheduling behavior and operation of each GPU sharing strategy.

Multi-instance GPU

You can request multi-instance GPU in workloads by specifying thecloud.google.com/gke-gpu-partition-size label in the Pod specnodeSelector field, under spec: nodeSelector.

GKE schedules workloads to appropriate available nodes by matching these labels. If there are no appropriate available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match this label.

GPU time-sharing or NVIDIA MPS

You can request GPU time-sharing or NVIDIA MPS in workloads by specifying the following labels in the Pod spec nodeSelector field, under spec:nodeSelector.

The following table describes how scheduling behavior changes based on the combination of node labels that you specify in your manifests.

Node labels
cloud.google.com/gke-max-shared-clients-per-gpu_and_ cloud.google.com/gke-gpu-sharing-strategy GKE schedules workloads in available nodes that match both the labels. If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match both the labels.
Only cloud.google.com/gke-max-shared-clients-per-gpu Autopilot: GKE rejects the workload. Standard: GKE schedules workloads in available nodes that match the label. If there are no available nodes, GKE uses autoscaling and node auto-provisioning to create new nodes or node pools that match the label. By default, auto-provisioned nodes are given the following label and value for each strategy: GPU time-sharing: cloud.google.com/gke-gpu-sharing-strategy: time-sharing NVIDIA MPS: cloud.google.com/gke-gpu-sharing-strategy: mps
Only cloud.google.com/gke-gpu-sharing-strategy Autopilot: GKE rejects the workload. Standard: GKE schedules workloads in available nodes that use specific sharing strategies. If there are multiple shared node pools with different values forcloud.google.com/gke-max-shared-clients-per-gpu, the workload can be scheduled on any available node. If there are no available nodes in any node pools, the cluster autoscaler scales up the node pool with the lowest value forcloud.google.com/gke-max-shared-clients-per-gpu. If all node pools are at capacity, node auto-provisioning creates a new node pool with a default value ofcloud.google.com/gke-max-shared-clients-per-gpu=2

The GPU request process that you complete is the same for GPU time-sharing and NVIDIA MPS strategy.

If you're developing GPU applications that run on GPU time-sharing or NVIDIA MPS, you can only request one GPU for each container. GKE rejects a request for more than one GPU in a container to avoid unexpected behavior. In addition, the number of GPUs requested with time-sharing and NVIDIA MPS is not a measure of the compute power available to the container.

The following table shows you what to expect when you request specific quantities of GPUs.

GPU requests that apply to GPU time-sharing and NVIDIA MPS
One GPU time-sharing or NVIDIA MPS per container GKE allows the request, even if the node has one physical GPU or multiple physical GPUs.
More than one GPU time-sharing per container GKE rejects the request. This behavior is the same when requesting more than onemulti-instance GPU instance in a container, because each GPU instance is considered to be a discrete physical GPU.
More than one NVIDIA MPS per container Based on the number of physical GPUs in the node, GKE does the following: GKE allows the request when the node only has one physical GPU. GKE rejects the request when the node has multiple physical GPUs. This behavior is the same when requesting more than onemulti-instance GPU instance in a container, because each GPU instance is considered to be a discrete physical GPU.

If GKE rejects the workload, you see an error message similar to the following:

status:
  message: 'Pod Allocate failed due to rpc error: code = Unknown desc = [invalid request
    for sharing GPU (time-sharing), at most 1 nvidia.com/gpu can be requested on GPU nodes], which is unexpected'
  phase: Failed
  reason: UnexpectedAdmissionError

Monitor GPU time-sharing or NVIDIA MPS nodes

Use Cloud Monitoring to monitor the performance of your GPU time-sharing or NVIDIA MPS nodes. GKE sends metrics for each GPU node to Cloud Monitoring. These GPU time-sharing or NVIDIA MPS node metrics apply at the node level (node/accelerator/).

You can check the following metrics for each GPU time-sharing or NVIDIA MPS node in Cloud Monitoring:

These metrics are different from themetrics for regular GPUs that are not time-shared or NVIDA MPS nodes. The metrics forregular physical GPUsapply at the container level (container/accelerator) and are not collected for containers scheduled on a GPU that uses GPU time-sharing or NVIDIA MPS.

What's next