GitHub - AI-Hypercomputer/gpu-recipes: Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud. (original) (raw)
Cloud GPU performance benchmark recipes
This repository contains recipes that provide instructions to reproduce specific workload performance measurements, which are part of a confidential benchmarking program. These recipes focus on helping you reliably achieve performance metrics, such as throughput, that demonstrate the combined hardware and software stack on GPUs.
Note: The recipes in this repository are not designed as general-purpose code samples or tutorials for using Compute Engine-based products.
Intended audience
This content is for you if you are a customer or partner who needs to:
- Validate hardware performance with your suppliers.
- Inform purchasing decisions using the benchmarking data.
- Reproduce optimal performance scenarios before you customize workflows for your own requirements.
How to use these recipes
To reproduce a benchmark, follow these steps:
- Identify your requirements: determine the model, GPU type, workload, framework, and orchestrator that you are interested in.
- Select a recipe: based on your requirements use theBenchmark support matrix to find a recipe that meets your needs.
- Follow the recipe: each recipe will provide you with procedures to complete the following tasks:
- prepare your environment.
- run the benchmark.
- analyze the benchmarks results. This includes not just the results but detailed logs for further analysis. You can automate your infrastructure setup using Cluster Toolkit. For more information, seeAutomated GPU environment deployment with Cluster Toolkit.
Benchmarks support matrix
Training benchmarks A3 Mega
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| GPT3-175B | A3 Mega (NVIDIA H100) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3-70B | A3 Mega (NVIDIA H100) | NeMo (25.07) | Pre-training | GKE | Link |
| Mixtral-8-7B | A3 Mega (NVIDIA H100) | NeMo (25.07) | Pre-training | GKE | Link |
Training benchmarks A3 Ultra
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-3.1-70B | A3 Ultra (NVIDIA H200) | MaxText | Pre-training | GKE | Link |
| Llama-3.1-70B | A3 Ultra (NVIDIA H200) | NeMo (24.07) | Pre-training | GKE | Link |
| Llama-3-70B | A3 Ultra (NVIDIA H200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Llama-3-70B | A3 Ultra (NVIDIA H200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
| Llama-3-8B | A3 Ultra (NVIDIA H200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
| Llama-3.1-405B | A3 Ultra (NVIDIA H200) | MaxText | Pre-training | GKE | Link |
| Llama-3.1-405B | A3 Ultra (NVIDIA H200) | NeMo (24.12) | Pre-training | GKE | Link |
| Mixtral-8-7B | A3 Ultra (NVIDIA H200) | NeMo (24.07) | Pre-training | GKE | Link |
| DeepSeek-V3 | A3 Ultra (NVIDIA H200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| GPT OSS 120B | A3 Ultra (NVIDIA H200) | NeMo (26.02) | Pre-training | GKE | Link |
| Qwen-3-30B | A3 Ultra (NVIDIA H200) | NeMo (26.02) | Pre-training | GKE | Link |
| Wan-2.1 | A3 Ultra (NVIDIA H200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
Training benchmarks A4
| Models | GPU Machine Type | Framework / Library | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-3.1-70B | A4 (NVIDIA B200) | MaxText | Pre-training | GKE | Link |
| Llama-3.1-70B | A4 (NVIDIA B200) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3.1-70B | A4 (NVIDIA B200) | NeMo (26.02) | Pre-training | GKE | Link |
| Llama-3.1-70B | A4 (NVIDIA B200) | Megatron-Bridge (25.09) | Pre-training | Slurm | Link |
| Llama-3.1-405B | A4 (NVIDIA B200) | MaxText | Pre-training | GKE | Link |
| Llama-3.1-405B | A4 (NVIDIA B200) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4 (NVIDIA B200) | NeMo (26.02) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4 (NVIDIA B200) | Megatron-Bridge (25.09) | Pre-training | Slurm | Link |
| Mixtral-8-7B | A4 (NVIDIA B200) | NeMo (25.07) | Pre-training | GKE | Link |
| PaliGemma2 | A4 (NVIDIA B200) | Hugging Face Accelerate | Finetuning | GKE | Link |
| DeepSeek-V3 | A4 (NVIDIA B200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| DeepSeek-V3 | A4 (NVIDIA B200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| GPT OSS 120B | A4 (NVIDIA B200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Llama-3-8B | A4 (NVIDIA B200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Qwen-3-235B | A4 (NVIDIA B200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| Qwen-3-235B | A4 (NVIDIA B200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Qwen-3-235B | A4 (NVIDIA B200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
| Qwen-3-30B | A4 (NVIDIA B200) | NeMo (26.02) | Pre-training | GKE | Link |
| Wan-2.1-14B | A4 (NVIDIA B200) | NeMo (25.11) | Pre-training | GKE | Link |
Training benchmarks A4X
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-3.1-8B | A4X (NVIDIA GB200) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3.1-8B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| Llama-3.1-8B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
| Llama-3.1-70B | A4X (NVIDIA GB200) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3.1-70B | A4X (NVIDIA GB200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4X (NVIDIA GB200) | NeMo (25.07) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4X (NVIDIA GB200) | NeMo (26.02) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4X (NVIDIA GB200) | Megatron-Bridge (26.02) | Pre-training | GKE | Link |
| Llama-3.1-405B | A4X (NVIDIA GB200) | Megatron-Bridge (25.09) | Pre-training | Slurm | Link |
| Nemotron-4-340B | A4X (NVIDIA GB200) | NeMo (25.09) | Pre-training | GKE | Link |
| Wan-2.1-14B | A4X (NVIDIA GB200) | NeMo (25.11) | Pre-training | GKE | Link |
| Wan-2.1-14B | A4X (NVIDIA GB200) | NeMo (26.02) | Pre-training | GKE | Link |
| Wan-2.1-14B | A4X (NVIDIA GB200) | NeMo (25.11) | Pre-training | Slurm | Link |
| DeepSeek-V3 | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| Qwen-3-235B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| Qwen-3-235B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
| Qwen-3-30B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | GKE | Link |
| Qwen-3-30B | A4X (NVIDIA GB200) | Megatron-Bridge (25.11) | Pre-training | Slurm | Link |
Inference benchmarks A3 Mega
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-4 | A3 Mega (NVIDIA H100) | SGLang | Inference | GKE | Link |
| DeepSeek R1 671B | A3 Mega (NVIDIA H100) | SGLang | Inference | GKE | Link |
| DeepSeek R1 671B | A3 Mega (NVIDIA H100) | vLLM | Inference | GKE | Link |
Inference benchmarks A3 Ultra
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| GPT OSS 120B | A3 Ultra (NVIDIA H200) | vLLM | Inference | GKE | Link |
| Llama-4 | A3 Ultra (NVIDIA H200) | vLLM | Inference | GKE | Link |
| Llama-3.1-405B | A3 Ultra (NVIDIA H200) | TensorRT-LLM | Inference | GKE | Link |
| DeepSeek R1 671B | A3 Ultra (NVIDIA H200) | SGLang | Inference | GKE | Link |
| DeepSeek R1 671B | A3 Ultra (NVIDIA H200) | vLLM | Inference | GKE | Link |
Inference benchmarks A4
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| DeepSeek R1 671B | A4 (NVIDIA B200) | vLLM | Inference | GKE | Link |
| DeepSeek R1 671B | A4 (NVIDIA B200) | SGLang | Inference | GKE | Link |
| DeepSeek R1 671B | A4 (NVIDIA B200) | TensorRT-LLM | Inference | GKE | Link |
| Llama 3.1 405B | A4 (NVIDIA B200) | TensorRT-LLM | Inference | GKE | Link |
| Qwen 2.5 VL 7B | A4 (NVIDIA B200) | TensorRT-LLM | Inference | GKE | Link |
| Qwen 3 235B A22B | A4 (NVIDIA B200) | TensorRT-LLM | Inference | GKE | Link |
| Qwen 3 32B | A4 (NVIDIA B200) | TensorRT-LLM | Inference | GKE | Link |
Inference benchmarks A4X
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| DeepSeek R1 671B | A4X (NVIDIA GB200) | vLLM (v0.14.0rc1) | Inference | GKE | Link |
| Wan2.2 T2V A14B Diffusers | A4X (NVIDIA GB200) | SGLang (latest) | Inference | GKE | Link |
| Wan2.2 I2V A14B Diffusers | A4X (NVIDIA GB200) | SGLang (latest) | Inference | GKE | Link |
| DeepSeek R1 671B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link Link for Using Google Cloud Storage (GCS) as Storage Option Link for Using Lustre as Storage Option |
| Llama 3.1 405B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Llama 3.1 70B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Llama 3.1 8B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Qwen 2.5 VL 7B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Qwen 3 235B A22B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Qwen 3 32B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
| Qwen 3 4B | A4X (NVIDIA GB200) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | Link |
Inference benchmarks G4
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Qwen3 8B | G4 (NVIDIA RTX PRO 6000 Blackwell) | vLLM | Inference | GCE | Link |
| Qwen3 30B A3B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Qwen3 4B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Qwen3 8B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Qwen3 32B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Qwen3 32B | G4 (NVIDIA RTX PRO 6000 Blackwell) | vLLM | Inference | GCE | Link |
| Llama3.1 70B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| DeepSeek R1 | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Qwen3 235B | G4 (NVIDIA RTX PRO 6000 Blackwell) | TensorRT-LLM | Inference | GCE | Link |
| Wan2.2 14B | G4 (NVIDIA RTX PRO 6000 Blackwell) | SGLang | Inference | GCE | Link |
Checkpointing benchmarks
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-3.1-70B | A3 Mega (NVIDIA H100) | NeMo | Pre-training using Google Cloud Storage buckets for checkpoints | GKE | Link |
Goodput benchmarks
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
|---|---|---|---|---|---|
| Llama-3.1-70B | A3 Mega (NVIDIA H100) | NeMo | Pre-training using the Google Cloud Resiliency library | GKE | Link |
| Llama-3.1-405B | A3 Ultra (NVIDIA H200) | NeMo | Pre-training using the Google Cloud Resiliency library | GKE | Link |
| Mixtral-8x7B | A3 Ultra (NVIDIA H200) | NeMo | Pre-training using the Google Cloud Resiliency library | GKE | Link |
Repository organization
./training: this directory contains recipes with instructions to reproduce training benchmarks with GPUs../inference: this directory contains recipes with instructions to reproduce inference benchmarks with GPUs../src: this directory contains the shared dependencies required to run benchmarks, such as Docker images and Helm charts../docs: this directory contains supporting documentation for explanations of benchmark methodologies or configurations.
Repository scope
This repository provides the steps that you can use to reproduce a specific benchmark. The actual performance measurements and the complete, confidential benchmark report are not included.
Methodology
Performance benchmarks measure the performance of various workloads on the platform. These benchmarks are primarily used to validate performance with hardware suppliers and to provide you with data for purchasing decisions.
Maintenance policy
Benchmark data is considered a point-in-time measurement and completed benchmarks are not repeated. We maintain and update the recipes in this repository on a best-effort basis.
Resources
For general guidance on how to get started using Compute products, refer to the official documentation and tutorials:
- Compute Engine overview
- Compute Engine samples
- Cloud GPU documentation
- AI Hypercomputer documentation
- Automated GPU environment deployment with Cluster Toolkit
Report issues
If you have questions or encounter problems with this repository, report them throughGitHub Issues or reach out to your Google Cloud account team for assistance.
Contributor notes
Note: This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.