Configure GKE for ML Diagnostics (original) (raw)

If you are using Google Kubernetes Engine (GKE) for your ML workload, use this guide to configure your GKE cluster and install the required GKE artifacts.

The configuration of your workload depends on whether you use on-demand profiling or programmatic profiling.

If you are using a version of GKE that is later than1.35.0-gke.3065000, you can set up GKE cluster for ML Diagnostics with a single gcloud CLI command. For more information, see Set up with gcloud CLI.

For GKE versions prior to 1.35.0-gke.3065000, you need to manually configure the GKE cluster to install the cert-manager,injection-webhook, and connection-operator artifacts. For more information, see Manual installation.

Set up with gcloud CLI

For GKE versions later than 1.35.0-gke.3065000, use one of the following gcloud CLI commands to deploy the required ML Diagnostics components (both connection-operator and injection-webhook) into your GKE cluster.

For new GKE clusters:

gcloud beta container clusters create CLUSTER_NAME --enable-managed-mldiagnostics

For existing GKE clusters:

gcloud beta container clusters update CLUSTER_NAME --enable-managed-mldiagnostics

To disable ML Diagnostics, use the following:

gcloud beta container clusters update CLUSTER_NAME --no-enable-managed-mldiagnostics

You can also enable the gcloud CLI commands through the GKE Google Cloud console:

For more information on gcloud CLI commands to set up a GKE cluster for ML Diagnostics, refer to the enable-managed-mldiagnosticsflag in the following API reference pages:

Manual installation

For GKE versions prior to 1.35.0-gke.3065000, you need to manually configure the GKE cluster to install the following:

For more information on setting up for Google Kubernetes Engine, see Configure Google Kubernetes Engine cluster.

Cert-manager

cert-manager acts as the certificate controller for your cluster, ensuring that your applications are secure and that your certificates never unintentionally expire.

Use Helm to install the following:

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0 \
  --set installCRDs=true \
  --set global.leaderElection.namespace=cert-manager \
  --timeout 10m

Injection-webhook

injection-webhook passes metadata into the SDK. Usehelm upgrade --install to install for the first time or upgrade an existing installation.

Use Helm to install the following:

helm upgrade --install mldiagnostics-injection-webhook \
  --namespace=gke-mldiagnostics \
  --create-namespace \
  --version 0.25.0 \
  oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-injection-webhook

Connection-operator

connection-operator enables on-demand profiling on GKE. Use the following table to find the correct mldiagnostics-connection-operator version:

JAX Version Helm Chart Version
0.8.x 0.24.0
0.9.x+ 0.24.0+

Use Helm to install the required version.

For JAX 0.8.x:

helm upgrade --install mldiagnostics-connection-operator \
  --namespace=gke-mldiagnostics \
  --create-namespace \
  --version 0.24.0 \
  oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator \
  --set 'mldiagnosticsConnectionOperator.controller.args={--metrics-bind-address=:8443,--health-probe-bind-address=:8081,--sidecar-timeout=65m,--disable-hostname-override}'

For JAX 0.9.x+:

helm upgrade --install mldiagnostics-connection-operator \
  --namespace=gke-mldiagnostics \
  --create-namespace \
  --version 0.24.0 \
  oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator

Label workload

For programmatic profiling, you need to trigger the injection-webhook to inject metadata into pods. Label either the workload or its namespace with managed-mldiagnostics-gke=true before deploying the workload:

apiVersion: jobset.x-k8s.io/v1alpha2  
kind: JobSet  
metadata:  
  name: single-host-tpu-v3-jobset2  
  namespace: default  
  labels:  
    managed-mldiagnostics-gke: "true"  
kubectl create namespace ai-workloads  
kubectl label namespace ai-workloads managed-mldiagnostics-gke=true