GitHub - AI-Hypercomputer/gpu-recipes: Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud. (original) (raw)

Cloud GPU performance benchmark recipes

This repository contains recipes that provide instructions to reproduce specific workload performance measurements, which are part of a confidential benchmarking program. These recipes focus on helping you reliably achieve performance metrics, such as throughput, that demonstrate the combined hardware and software stack on GPUs.

Note: The recipes in this repository are not designed as general-purpose code samples or tutorials for using Compute Engine-based products.

Intended audience

This content is for you if you are a customer or partner who needs to:

Validate hardware performance with your suppliers.
Inform purchasing decisions using the benchmarking data.
Reproduce optimal performance scenarios before you customize workflows for your own requirements.

How to use these recipes

To reproduce a benchmark, follow these steps:

Identify your requirements: determine the model, GPU type, workload, framework, and orchestrator that you are interested in.
Select a recipe: based on your requirements use theBenchmark support matrix to find a recipe that meets your needs.
Follow the recipe: each recipe will provide you with procedures to complete the following tasks:
- prepare your environment.
- run the benchmark.
- analyze the benchmarks results. This includes not just the results but detailed logs for further analysis. You can automate your infrastructure setup using Cluster Toolkit. For more information, seeAutomated GPU environment deployment with Cluster Toolkit.

Benchmarks support matrix

Training benchmarks A3 Mega

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
GPT3-175B	A3 Mega (NVIDIA H100)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3-70B	A3 Mega (NVIDIA H100)	NeMo (25.07)	Pre-training	GKE	Link
Mixtral-8-7B	A3 Mega (NVIDIA H100)	NeMo (25.07)	Pre-training	GKE	Link

Training benchmarks A3 Ultra

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Llama-3.1-70B	A3 Ultra (NVIDIA H200)	MaxText	Pre-training	GKE	Link
Llama-3.1-70B	A3 Ultra (NVIDIA H200)	NeMo (24.07)	Pre-training	GKE	Link
Llama-3-70B	A3 Ultra (NVIDIA H200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Llama-3-70B	A3 Ultra (NVIDIA H200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link
Llama-3-8B	A3 Ultra (NVIDIA H200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link
Llama-3.1-405B	A3 Ultra (NVIDIA H200)	MaxText	Pre-training	GKE	Link
Llama-3.1-405B	A3 Ultra (NVIDIA H200)	NeMo (24.12)	Pre-training	GKE	Link
Mixtral-8-7B	A3 Ultra (NVIDIA H200)	NeMo (24.07)	Pre-training	GKE	Link
DeepSeek-V3	A3 Ultra (NVIDIA H200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
GPT OSS 120B	A3 Ultra (NVIDIA H200)	NeMo (26.02)	Pre-training	GKE	Link
Qwen-3-30B	A3 Ultra (NVIDIA H200)	NeMo (26.02)	Pre-training	GKE	Link
Wan-2.1	A3 Ultra (NVIDIA H200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link

Training benchmarks A4

Models	GPU Machine Type	Framework / Library	Workload Type	Orchestrator	Link to the recipe
Llama-3.1-70B	A4 (NVIDIA B200)	MaxText	Pre-training	GKE	Link
Llama-3.1-70B	A4 (NVIDIA B200)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3.1-70B	A4 (NVIDIA B200)	NeMo (26.02)	Pre-training	GKE	Link
Llama-3.1-70B	A4 (NVIDIA B200)	Megatron-Bridge (25.09)	Pre-training	Slurm	Link
Llama-3.1-405B	A4 (NVIDIA B200)	MaxText	Pre-training	GKE	Link
Llama-3.1-405B	A4 (NVIDIA B200)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3.1-405B	A4 (NVIDIA B200)	NeMo (26.02)	Pre-training	GKE	Link
Llama-3.1-405B	A4 (NVIDIA B200)	Megatron-Bridge (25.09)	Pre-training	Slurm	Link
Mixtral-8-7B	A4 (NVIDIA B200)	NeMo (25.07)	Pre-training	GKE	Link
PaliGemma2	A4 (NVIDIA B200)	Hugging Face Accelerate	Finetuning	GKE	Link
DeepSeek-V3	A4 (NVIDIA B200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
DeepSeek-V3	A4 (NVIDIA B200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
GPT OSS 120B	A4 (NVIDIA B200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Llama-3-8B	A4 (NVIDIA B200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Qwen-3-235B	A4 (NVIDIA B200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
Qwen-3-235B	A4 (NVIDIA B200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Qwen-3-235B	A4 (NVIDIA B200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link
Qwen-3-30B	A4 (NVIDIA B200)	NeMo (26.02)	Pre-training	GKE	Link
Wan-2.1-14B	A4 (NVIDIA B200)	NeMo (25.11)	Pre-training	GKE	Link

Training benchmarks A4X

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Llama-3.1-8B	A4X (NVIDIA GB200)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3.1-8B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
Llama-3.1-8B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link
Llama-3.1-70B	A4X (NVIDIA GB200)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3.1-70B	A4X (NVIDIA GB200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Llama-3.1-405B	A4X (NVIDIA GB200)	NeMo (25.07)	Pre-training	GKE	Link
Llama-3.1-405B	A4X (NVIDIA GB200)	NeMo (26.02)	Pre-training	GKE	Link
Llama-3.1-405B	A4X (NVIDIA GB200)	Megatron-Bridge (26.02)	Pre-training	GKE	Link
Llama-3.1-405B	A4X (NVIDIA GB200)	Megatron-Bridge (25.09)	Pre-training	Slurm	Link
Nemotron-4-340B	A4X (NVIDIA GB200)	NeMo (25.09)	Pre-training	GKE	Link
Wan-2.1-14B	A4X (NVIDIA GB200)	NeMo (25.11)	Pre-training	GKE	Link
Wan-2.1-14B	A4X (NVIDIA GB200)	NeMo (26.02)	Pre-training	GKE	Link
Wan-2.1-14B	A4X (NVIDIA GB200)	NeMo (25.11)	Pre-training	Slurm	Link
DeepSeek-V3	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
Qwen-3-235B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
Qwen-3-235B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link
Qwen-3-30B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	GKE	Link
Qwen-3-30B	A4X (NVIDIA GB200)	Megatron-Bridge (25.11)	Pre-training	Slurm	Link

Inference benchmarks A3 Mega

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Llama-4	A3 Mega (NVIDIA H100)	SGLang	Inference	GKE	Link
DeepSeek R1 671B	A3 Mega (NVIDIA H100)	SGLang	Inference	GKE	Link
DeepSeek R1 671B	A3 Mega (NVIDIA H100)	vLLM	Inference	GKE	Link

Inference benchmarks A3 Ultra

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
GPT OSS 120B	A3 Ultra (NVIDIA H200)	vLLM	Inference	GKE	Link
Llama-4	A3 Ultra (NVIDIA H200)	vLLM	Inference	GKE	Link
Llama-3.1-405B	A3 Ultra (NVIDIA H200)	TensorRT-LLM	Inference	GKE	Link
DeepSeek R1 671B	A3 Ultra (NVIDIA H200)	SGLang	Inference	GKE	Link
DeepSeek R1 671B	A3 Ultra (NVIDIA H200)	vLLM	Inference	GKE	Link

Inference benchmarks A4

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
DeepSeek R1 671B	A4 (NVIDIA B200)	vLLM	Inference	GKE	Link
DeepSeek R1 671B	A4 (NVIDIA B200)	SGLang	Inference	GKE	Link
DeepSeek R1 671B	A4 (NVIDIA B200)	TensorRT-LLM	Inference	GKE	Link
Llama 3.1 405B	A4 (NVIDIA B200)	TensorRT-LLM	Inference	GKE	Link
Qwen 2.5 VL 7B	A4 (NVIDIA B200)	TensorRT-LLM	Inference	GKE	Link
Qwen 3 235B A22B	A4 (NVIDIA B200)	TensorRT-LLM	Inference	GKE	Link
Qwen 3 32B	A4 (NVIDIA B200)	TensorRT-LLM	Inference	GKE	Link

Inference benchmarks A4X

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
DeepSeek R1 671B	A4X (NVIDIA GB200)	vLLM (v0.14.0rc1)	Inference	GKE	Link
Wan2.2 T2V A14B Diffusers	A4X (NVIDIA GB200)	SGLang (latest)	Inference	GKE	Link
Wan2.2 I2V A14B Diffusers	A4X (NVIDIA GB200)	SGLang (latest)	Inference	GKE	Link
DeepSeek R1 671B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link Link for Using Google Cloud Storage (GCS) as Storage Option Link for Using Lustre as Storage Option
Llama 3.1 405B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Llama 3.1 70B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Llama 3.1 8B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Qwen 2.5 VL 7B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Qwen 3 235B A22B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Qwen 3 32B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link
Qwen 3 4B	A4X (NVIDIA GB200)	TensorRT-LLM (1.3.0rc5)	Inference	GKE	Link

Inference benchmarks G4

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Qwen3 8B	G4 (NVIDIA RTX PRO 6000 Blackwell)	vLLM	Inference	GCE	Link
Qwen3 30B A3B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Qwen3 4B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Qwen3 8B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Qwen3 32B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Qwen3 32B	G4 (NVIDIA RTX PRO 6000 Blackwell)	vLLM	Inference	GCE	Link
Llama3.1 70B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
DeepSeek R1	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Qwen3 235B	G4 (NVIDIA RTX PRO 6000 Blackwell)	TensorRT-LLM	Inference	GCE	Link
Wan2.2 14B	G4 (NVIDIA RTX PRO 6000 Blackwell)	SGLang	Inference	GCE	Link

Checkpointing benchmarks

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Llama-3.1-70B	A3 Mega (NVIDIA H100)	NeMo	Pre-training using Google Cloud Storage buckets for checkpoints	GKE	Link

Goodput benchmarks

Models	GPU Machine Type	Framework	Workload Type	Orchestrator	Link to the recipe
Llama-3.1-70B	A3 Mega (NVIDIA H100)	NeMo	Pre-training using the Google Cloud Resiliency library	GKE	Link
Llama-3.1-405B	A3 Ultra (NVIDIA H200)	NeMo	Pre-training using the Google Cloud Resiliency library	GKE	Link
Mixtral-8x7B	A3 Ultra (NVIDIA H200)	NeMo	Pre-training using the Google Cloud Resiliency library	GKE	Link

Repository organization

./training: this directory contains recipes with instructions to reproduce training benchmarks with GPUs.
./inference: this directory contains recipes with instructions to reproduce inference benchmarks with GPUs.
./src: this directory contains the shared dependencies required to run benchmarks, such as Docker images and Helm charts.
./docs: this directory contains supporting documentation for explanations of benchmark methodologies or configurations.

Repository scope

This repository provides the steps that you can use to reproduce a specific benchmark. The actual performance measurements and the complete, confidential benchmark report are not included.

Methodology

Performance benchmarks measure the performance of various workloads on the platform. These benchmarks are primarily used to validate performance with hardware suppliers and to provide you with data for purchasing decisions.

Maintenance policy

Benchmark data is considered a point-in-time measurement and completed benchmarks are not repeated. We maintain and update the recipes in this repository on a best-effort basis.

Resources

For general guidance on how to get started using Compute products, refer to the official documentation and tutorials:

Report issues

If you have questions or encounter problems with this repository, report them throughGitHub Issues or reach out to your Google Cloud account team for assistance.

Contributor notes

Note: This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.