Review compute instance and cluster configurations (original) (raw)

This document describes the configurations in AI Hypercomputer to consider before you create Compute Engine instances and clusters. Reviewing the available configurations helps ensure optimal performance for your workloads, as well as minimize downtimes and performance issues.

Configuration factors for compute instance and cluster creation

Before you create compute instances and clusters to run your workloads, consider which configuration to use:

  1. The provisioning model
  2. The cluster deployment tools
  3. If you use the reservation-bound provisioning model, then you must also consider the following factors:

Provisioning models

Based on theconsumption option that you choose for creating compute instances or clusters, you can use one of the following provisioning models to obtain the necessary resources for creating instances:

Reservation-bound provisioning model

The reservation-bound provisioning model links your created compute instances to the capacity that you previously reserved. When you reserve capacity, Compute Engine creates an empty reservation. Then, at the reservation start time, the following occurs:

You can then use the reserved resources to create instances without additional charges. You only pay for resources that aren't included in the reservation, such as disks or IP addresses.

To specify the reservation-bound provisioning model when you create compute instances or MIGs, do the following:

For more information about setting these parameters when you create instances or MIGs after you reserve capacity, seeCompute instance and cluster creation overview. If you use Cluster Toolkit to deploy your clusters, then the cluster blueprint sets the provisioning model for you.

Flex-start provisioning model

The flex-start provisioning model lets you create standalone Flex-start VMs or add Flex-start VMs to a managed instance group (MIG) when your requested capacity is available. When you add Flex-start VMs to a MIG by using resize requests, the MIG creates the instances all at once. This approach helps you avoid unnecessary charges for partial capacity that Compute Engine might deliver while you wait for the full capacity needed to start your workload. The flex-start provisioning model provisions resources from a secure capacity pool, which helps to increase your chances of obtaining high-demand resources like GPUs.

To specify the flex-start provisioning model when creating a standalone instance or an instance template for a MIG, do the following:

For more information about creating instances or clusters that use flex-start provisioning model, see the following documents:

Spot provisioning model

The spot provisioning model lets you create deeply-discounted compute instances based on availability. However, Compute Engine might stop or delete the created instances at any time to reclaim capacity. This process is calledpreemption.

To specify the spot provisioning model when you create instances or MIGs, do the following:

For more information about setting these parameters when you create instances or MIGs, seeCompute instance and cluster creation overview.

Cluster deployment tools

Cluster Toolkit is an open source deployment tool that is recommended for creating GPU-accelerated clusters. Cluster Toolkit can deploy both Google Kubernetes Engine (GKE) or Slurm clusters.

Alternatively, you can choose to provision your groups of compute instances by using one of the following methods, and then incorporate your own workload scheduler as needed:

Reservation block deployment types

If you use the reservation-bound provisioning model when creating A4X Max, A4X, A4, A3 Ultra, A3 Mega, and A3 High (8 GPUs) compute instances or clusters, the machines you receive are automatically deployed within blocks of densely allocated hosts. This deployment offers the following benefits:

Reservation operational mode

If you use the reservation-bound provisioning model, then the machine type that you reserve determines the _reservation operational mode_for your reserved capacity. Each mode defines how to respond to host errors or faulty host reports, as well as your level of visibility and control over the reservation's infrastructure.

Each reservation operational mode defines the following:

When you reserve capacity to create compute instances or clusters, you must choose between one of the following reservation operational modes:managed mode or all capacity mode.

Managed mode

In managed mode, Google Cloud automatically manages the maintenance and recovery process of your compute instances after host errors or faulty host reports. This approach is ideal when your workload requires high stability, and you prefer an automated process to minimize downtimes.

The managed mode has the following features:

All capacity mode

In all capacity mode, you are responsible for managing a compute instance recovery process. You must manually start maintenance after host errors or faulty host reports. Unlike the managed mode, you can also view and start maintenance for your reservation sub-blocks. These features give you full, granular control over the maintenance and recovery process for your instances.

The all capacity mode has the following features:

Maintenance scheduling types

If you use the reservation-bound provisioning model, then Cluster Director provides options for scheduling host maintenance for the running compute instances in your cluster. When you reserve capacity, you can specify whether to group instances and have synchronized maintenance scheduling (grouped), or the instances can be loosely coupled and have independent maintenance scheduling (independent).

Grouped maintenance scheduling

The grouped maintenance scheduling type helps ensure that, no matter when Compute Engine provisions a compute instance, all instances running the same workload have the same planned maintenance frequency. This tightly-coupled maintenance lets you optimize your job's performance by giving you complete control over your used and unused capacity.

A group maintenance scheduling type is useful in the following cases:

Independent maintenance scheduling

This independent maintenance scheduling type gives instances different maintenance schedules. This configuration is ideal if you want to run inference or limited-scale training where workloads run more efficiently when they have separate maintenance schedules.

What's next?