Quickstart: PyTorch — SkyPilot Docs (original) (raw)
This example uses SkyPilot to train a GPT-like model (inspired by Karpathy’s minGPT) with Distributed Data Parallel (DDP) in PyTorch.
CLI
We define a SkyPilot YAML with the resource requirements, the setup commands, and the commands to run:
train.yaml
name: minGPT-ddp
resources: cpus: 4+ accelerators: L4:4 # Or A100:8, H100:8
Optional: upload a working directory to remote ~/sky_workdir.
Commands in "setup" and "run" will be executed under it.
workdir: .
Optional: upload local files.
Format:
/remote/path: /local/path
file_mounts:
~/.vimrc: ~/.vimrc
~/.netrc: ~/.netrc
setup: | git clone --depth 1 https://github.com/pytorch/examples || true cd examples git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp pip install -r requirements.txt
run: | cd examples/mingpt export LOGLEVEL=INFO
echo "Starting minGPT-ddp training"
torchrun \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
main.pyTip
In the YAML, the workdir and file_mounts fields are commented out. To learn about how to use them to mount local dirs/files or object store buckets (S3, GCS, R2) into your cluster, see Syncing Code, Git, and Files.
Tip
The SKYPILOT_NUM_GPUS_PER_NODE environment variable is automatically set by SkyPilot to the number of GPUs per node. See Environment Variables and Secrets for more.
Then, launch training:
$ sky launch -c mingpt train.yaml
Python
We use the Python SDK to create a task with the resource requirements, the setup commands, and the commands to run:
train.py
import sky
minGPT_ddp_task = sky.Task(
name='minGPT-ddp',
resources=sky.Resources(
cpus='4+',
accelerators='L4:4',
),
# Optional: upload a working directory to remote /sky_workdir.
# Commands in "setup" and "run" will be executed under it.
#
# workdir='.',
#
# Optional: upload local files.
# Format:
# /remote/path: /local/path
#
# file_mounts={
# '/.vimrc': '/.vimrc',
# '/.netrc': '~/.netrc',
# },
setup=[
'git clone --depth 1 https://github.com/pytorch/examples || true',
'cd examples',
'git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp',
'pip install -r requirements.txt',
],
run=[
'cd examples/mingpt',
'export LOGLEVEL=INFO',
'torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE main.py',
]
)
cluster_name = 'mingpt' launch_request = sky.launch(task=minGPT_ddp_task, cluster_name=cluster_name) job_id, _ = sky.stream_and_get(launch_request) sky.tail_logs(cluster_name, job_id, follow=True)
Tip
In the code, the workdir and file_mounts fields are commented out. To learn about how to use them to mount local dirs/files or object store buckets (S3, GCS, R2) into your cluster, see Syncing Code, Git, and Files.
Tip
The SKYPILOT_NUM_GPUS_PER_NODE environment variable is automatically set by SkyPilot to the number of GPUs per node. See Environment Variables and Secrets for more.
Then, run the code:
This will provision the cheapest cluster with the required resources, execute the setup commands, then execute the run commands.
After the training job starts running, you can safely Ctrl-C to detach from logging and the job will continue to run remotely on the cluster. To stop the job, use the sky cancel <cluster_name> <job_id> command (refer to CLI reference).
After training, transfer artifacts such as logs and checkpoints using familiar tools.
Tip
Feel free to copy-paste the YAML or Python code above and customize it for your own project.
Scale up the training with multiple nodes#
To scale training across multiple nodes, add num_nodes to your task YAML:
resources: cpus: 4+ accelerators: L4:4
Use 2 nodes with 4 GPUs each (8 GPUs total)
num_nodes: 2
And update the run command to use torchrun’s multi-node flags:
run: | cd examples source .venv/bin/activate cd mingpt export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.pyTip
SkyPilot automatically sets these environment variables for distributed training:
SKYPILOT_NUM_NODES: Total number of nodesSKYPILOT_NUM_GPUS_PER_NODE: Number of GPUs per nodeSKYPILOT_NODE_RANK: Rank of the current node (0, 1, 2, …)SKYPILOT_NODE_IPS: List of IP addresses of all nodes
See Environment Variables and Secrets for more.
More details in: Distributed Training with PyTorch.