Deployment Guides — SkyPilot documentation (original) (raw)
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
Below we include minimal guides to set up a new Kubernetes cluster in different environments, including hosted services on the cloud.
Local Development Cluster
Run a local Kubernetes cluster on your laptop with sky local up
.
On-prem Clusters (RKE2, K3s, etc.)
For on-prem deployments with kubeadm, RKE2, K3s or other distributions.
Google Cloud - GKE
Google’s hosted Kubernetes service.
Amazon - EKS
Amazon’s hosted Kubernetes service.
On-demand Cloud VMs
We provide scripts to deploy k8s on on-demand cloud VMs.
Deploying locally on your laptop#
To try out SkyPilot on Kubernetes on your laptop or run SkyPilot tasks locally without requiring any cloud access, we provide thesky local up
CLI to create a 1-node Kubernetes cluster locally.
Under the hood, sky local up
uses kind, a tool for creating a Kubernetes cluster on your local machine. It runs a Kubernetes cluster inside a container, so no setup is required.
- Install Docker and kind.
- Run
sky local up
to launch a Kubernetes cluster and automatically configure your kubeconfig file: - Run
sky check
and verify that Kubernetes is enabled in SkyPilot. You can now run SkyPilot tasks on this locally hosted Kubernetes cluster usingsky launch
. - After you are done using the cluster, you can remove it with
sky local down
. This will destroy the local kubernetes cluster and switch your kubeconfig back to it’s original context:
Note
We recommend allocating at least 4 or more CPUs to your docker runtime to ensure kind has enough resources. See instructions to increase CPU allocationhere.
Note
kind does not support multiple nodes and GPUs. It is not recommended for use in a production environment. If you want to run a private on-prem cluster, see the section on on-prem deployment for more.
Deploying on Google Cloud GKE#
- Create a GKE standard cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
Example: create a GKE cluster with 2 nodes, each having 16 CPUs.
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_NAME=testcluster
gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.4-gke.1043002" --release-channel "regular" --machine-type "e2-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c" - Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the GKE cluster:
$ gcloud container clusters get-credentials --region
Example:
gcloud container clusters get-credentials testcluster --region us-central1-c
- [If using GPUs] For GKE versions newer than 1.30.1-gke.115600, NVIDIA drivers are pre-installed and no additional setup is required. If you are using an older GKE version, you may need tomanually installNVIDIA drivers for GPU support. You can do so by deploying the daemonset depending on the GPU and OS on your nodes:
For Container Optimized OS (COS) based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
For Container Optimized OS (COS) based nodes with L4 GPUs:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
For Ubuntu based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
For Ubuntu based nodes with L4 GPUs:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
Tip
To verify if GPU drivers are set up, run kubectl describe nodes
and verify that nvidia.com/gpu
resource is listed under the Capacity
section.
4. Verify your kubernetes cluster is correctly set up for SkyPilot by running sky check
:
5. [If using GPUs] Check available GPUs in the kubernetes cluster with sky show-gpus --infra k8s
$ sky show-gpus --infra k8s
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
L4 1, 2, 4 6 of 8 free
A100 1, 2 2 of 4 free
Kubernetes per node GPU availability
NODE GPU UTILIZATION
my-cluster-0 L4 4 of 4 free
my-cluster-1 L4 2 of 4 free
my-cluster-2 A100 2 of 2 free
my-cluster-3 A100 0 of 2 free
Note
GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
Deploying on Amazon EKS#
- Create a EKS cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
- Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the EKS cluster:
$ aws eks update-kubeconfig --name --region
Example:
aws eks update-kubeconfig --name testcluster --region us-west-2
- [If using GPUs] EKS clusters already come with Nvidia drivers set up. However, you will need to label the nodes with the GPU type. Use the SkyPilot node labelling tool to do so:
python -m sky.utils.kubernetes.gpu_labeler
This will create a job on each node to read the GPU type from nvidia-smi and assign askypilot.co/accelerator
label to the node. You can check the status of these jobs by running:
kubectl get jobs -n kube-system - Verify your kubernetes cluster is correctly set up for SkyPilot by running
sky check
: - [If using GPUs] Check available GPUs in the kubernetes cluster with
sky show-gpus --infra k8s
$ sky show-gpus --infra k8s
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
A100 1, 2 2 of 2 free
Kubernetes per node GPU availability
NODE GPU UTILIZATION
my-cluster-0 A100 2 of 2 free
Deploying on on-prem clusters#
If you have a list of IP addresses and the SSH credentials for your on-prem cluster, you can follow ourUsing Existing Machines guide to set up SkyPilot on your on-prem cluster.
Alternatively, you can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools, such as kubeadm,k3s orRancher. Please follow their respective guides to deploy your Kubernetes cluster.
Notes for specific Kubernetes distributions#
Some Kubernetes distributions require additional steps to set up GPU support.
Rancher Kubernetes Engine 2 (RKE2)#
Nvidia GPU operator installation on RKE2 through helm requires extra flags to set nvidia
as the default runtime for containerd.
$ helm install gpu-operator -n gpu-operator --create-namespace
nvidia/gpu-operator $HELM_OPTIONS
--set 'toolkit.env[0].name=CONTAINERD_CONFIG'
--set 'toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl'
--set 'toolkit.env[1].name=CONTAINERD_SOCKET'
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock'
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS'
--set 'toolkit.env[2].value=nvidia'
--set 'toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT'
--set-string 'toolkit.env[3].value=true'
Refer to instructions on Nvidia GPU Operator installation with Helm on RKE2 for details.
K3s#
Installing Nvidia GPU operator on K3s is similar to RKE2 instructions from Nvidia, but requires changing the CONTAINERD_CONFIG
variable to /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
. Here is an example command to install the Nvidia GPU operator on K3s:
$ helm install gpu-operator -n gpu-operator --create-namespace
nvidia/gpu-operator $HELM_OPTIONS
--set 'toolkit.env[0].name=CONTAINERD_CONFIG'
--set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml'
--set 'toolkit.env[1].name=CONTAINERD_SOCKET'
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock'
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS'
--set 'toolkit.env[2].value=nvidia'
Check the status of the GPU operator installation by running kubectl get pods -n gpu-operator
. It takes a few minutes to install and some CrashLoopBackOff errors are expected during the installation process.
Tip
If your gpu-operator installation stays stuck in CrashLoopBackOff, you may need to create a symlink to the ldconfig
binary to work around a known issue with nvidia-docker runtime. Run the following command on your nodes:
$ ln -s /sbin/ldconfig /sbin/ldconfig.real
After the GPU operator is installed, create the nvidia RuntimeClass required by K3s. This runtime class will automatically be used by SkyPilot to schedule GPU pods:
$ kubectl apply -f - <<EOF apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia EOF
Deploying on cloud VMs#
You can also spin up on-demand cloud VMs and deploy Kubernetes on them.
We provide scripts to take care of provisioning VMs, installing Kubernetes, setting up GPU support and configuring your local kubeconfig. Refer to our Deploying Kubernetes on VMs guide for more details.