GitHub - AI-Hypercomputer/gpu-recipes: Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud. (original) (raw)

Cloud GPU performance benchmark recipes

License

This repository contains recipes that provide instructions to reproduce specific workload performance measurements, which are part of a confidential benchmarking program. These recipes focus on helping you reliably achieve performance metrics, such as throughput, that demonstrate the combined hardware and software stack on GPUs.

Note: The recipes in this repository are not designed as general-purpose code samples or tutorials for using Compute Engine-based products.

Intended audience

This content is for you if you are a customer or partner who needs to:

How to use these recipes

To reproduce a benchmark, follow these steps:

  1. Identify your requirements: determine the model, GPU type, workload, framework, and orchestrator that you are interested in.
  2. Select a recipe: based on your requirements use theBenchmark support matrix to find a recipe that meets your needs.
  3. Follow the recipe: each recipe will provide you with procedures to complete the following tasks:
    • prepare your environment.
    • run the benchmark.
    • analyze the benchmarks results. This includes not just the results but detailed logs for further analysis. You can automate your infrastructure setup using Cluster Toolkit. For more information, seeAutomated GPU environment deployment with Cluster Toolkit.

Benchmarks support matrix

Training benchmarks A3 Mega

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
GPT3-175B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link
Llama-3-70B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link
Mixtral-8-7B A3 Mega (NVIDIA H100) NeMo (25.07) Pre-training GKE Link

Training benchmarks A3 Ultra

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Ultra (NVIDIA H200) MaxText Pre-training GKE Link
Llama-3.1-70B A3 Ultra (NVIDIA H200) NeMo (24.07) Pre-training GKE Link
Llama-3-70B A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3-70B A3 Ultra (NVIDIA H200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3-8B A3 Ultra (NVIDIA H200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) MaxText Pre-training GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) NeMo (24.12) Pre-training GKE Link
Mixtral-8-7B A3 Ultra (NVIDIA H200) NeMo (24.07) Pre-training GKE Link
DeepSeek-V3 A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link
GPT OSS 120B A3 Ultra (NVIDIA H200) NeMo (26.02) Pre-training GKE Link
Qwen-3-30B A3 Ultra (NVIDIA H200) NeMo (26.02) Pre-training GKE Link
Wan-2.1 A3 Ultra (NVIDIA H200) Megatron-Bridge (26.02) Pre-training GKE Link

Training benchmarks A4

Models GPU Machine Type Framework / Library Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A4 (NVIDIA B200) MaxText Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-70B A4 (NVIDIA B200) Megatron-Bridge (25.09) Pre-training Slurm Link
Llama-3.1-405B A4 (NVIDIA B200) MaxText Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-405B A4 (NVIDIA B200) Megatron-Bridge (25.09) Pre-training Slurm Link
Mixtral-8-7B A4 (NVIDIA B200) NeMo (25.07) Pre-training GKE Link
PaliGemma2 A4 (NVIDIA B200) Hugging Face Accelerate Finetuning GKE Link
DeepSeek-V3 A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training GKE Link
DeepSeek-V3 A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
GPT OSS 120B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3-8B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (26.02) Pre-training GKE Link
Qwen-3-235B A4 (NVIDIA B200) Megatron-Bridge (25.11) Pre-training Slurm Link
Qwen-3-30B A4 (NVIDIA B200) NeMo (26.02) Pre-training GKE Link
Wan-2.1-14B A4 (NVIDIA B200) NeMo (25.11) Pre-training GKE Link

Training benchmarks A4X

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-8B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-8B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Llama-3.1-8B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link
Llama-3.1-70B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-70B A4X (NVIDIA GB200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) NeMo (25.07) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) NeMo (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) Megatron-Bridge (26.02) Pre-training GKE Link
Llama-3.1-405B A4X (NVIDIA GB200) Megatron-Bridge (25.09) Pre-training Slurm Link
Nemotron-4-340B A4X (NVIDIA GB200) NeMo (25.09) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (25.11) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (26.02) Pre-training GKE Link
Wan-2.1-14B A4X (NVIDIA GB200) NeMo (25.11) Pre-training Slurm Link
DeepSeek-V3 A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-235B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link
Qwen-3-30B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training GKE Link
Qwen-3-30B A4X (NVIDIA GB200) Megatron-Bridge (25.11) Pre-training Slurm Link

Inference benchmarks A3 Mega

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-4 A3 Mega (NVIDIA H100) SGLang Inference GKE Link
DeepSeek R1 671B A3 Mega (NVIDIA H100) SGLang Inference GKE Link
DeepSeek R1 671B A3 Mega (NVIDIA H100) vLLM Inference GKE Link

Inference benchmarks A3 Ultra

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
GPT OSS 120B A3 Ultra (NVIDIA H200) vLLM Inference GKE Link
Llama-4 A3 Ultra (NVIDIA H200) vLLM Inference GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) TensorRT-LLM Inference GKE Link
DeepSeek R1 671B A3 Ultra (NVIDIA H200) SGLang Inference GKE Link
DeepSeek R1 671B A3 Ultra (NVIDIA H200) vLLM Inference GKE Link

Inference benchmarks A4

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
DeepSeek R1 671B A4 (NVIDIA B200) vLLM Inference GKE Link
DeepSeek R1 671B A4 (NVIDIA B200) SGLang Inference GKE Link
DeepSeek R1 671B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Llama 3.1 405B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 2.5 VL 7B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 3 235B A22B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link
Qwen 3 32B A4 (NVIDIA B200) TensorRT-LLM Inference GKE Link

Inference benchmarks A4X

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
DeepSeek R1 671B A4X (NVIDIA GB200) vLLM (v0.14.0rc1) Inference GKE Link
Wan2.2 T2V A14B Diffusers A4X (NVIDIA GB200) SGLang (latest) Inference GKE Link
Wan2.2 I2V A14B Diffusers A4X (NVIDIA GB200) SGLang (latest) Inference GKE Link
DeepSeek R1 671B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link Link for Using Google Cloud Storage (GCS) as Storage Option Link for Using Lustre as Storage Option
Llama 3.1 405B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Llama 3.1 70B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Llama 3.1 8B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 2.5 VL 7B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 235B A22B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 32B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link
Qwen 3 4B A4X (NVIDIA GB200) TensorRT-LLM (1.3.0rc5) Inference GKE Link

Inference benchmarks G4

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Qwen3 8B G4 (NVIDIA RTX PRO 6000 Blackwell) vLLM Inference GCE Link
Qwen3 30B A3B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 4B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 8B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 32B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 32B G4 (NVIDIA RTX PRO 6000 Blackwell) vLLM Inference GCE Link
Llama3.1 70B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
DeepSeek R1 G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Qwen3 235B G4 (NVIDIA RTX PRO 6000 Blackwell) TensorRT-LLM Inference GCE Link
Wan2.2 14B G4 (NVIDIA RTX PRO 6000 Blackwell) SGLang Inference GCE Link

Checkpointing benchmarks

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Mega (NVIDIA H100) NeMo Pre-training using Google Cloud Storage buckets for checkpoints GKE Link

Goodput benchmarks

Models GPU Machine Type Framework Workload Type Orchestrator Link to the recipe
Llama-3.1-70B A3 Mega (NVIDIA H100) NeMo Pre-training using the Google Cloud Resiliency library GKE Link
Llama-3.1-405B A3 Ultra (NVIDIA H200) NeMo Pre-training using the Google Cloud Resiliency library GKE Link
Mixtral-8x7B A3 Ultra (NVIDIA H200) NeMo Pre-training using the Google Cloud Resiliency library GKE Link

Repository organization

Repository scope

This repository provides the steps that you can use to reproduce a specific benchmark. The actual performance measurements and the complete, confidential benchmark report are not included.

Methodology

Performance benchmarks measure the performance of various workloads on the platform. These benchmarks are primarily used to validate performance with hardware suppliers and to provide you with data for purchasing decisions.

Maintenance policy

Benchmark data is considered a point-in-time measurement and completed benchmarks are not repeated. We maintain and update the recipes in this repository on a best-effort basis.

Resources

For general guidance on how to get started using Compute products, refer to the official documentation and tutorials:

Report issues

If you have questions or encounter problems with this repository, report them throughGitHub Issues or reach out to your Google Cloud account team for assistance.

Contributor notes

Note: This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.