Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Parallelism (original) (raw)

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank.

With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the same familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:

Simple – Train billion-parameter models directly in PyTorch without managing complex parallelism or specialized systems.
Accessible – Empower researchers, startups, and enterprises to experiment with MoE architectures previously out of reach.
Efficient – Scale from eight to over 1,000 GPUs while maintaining strong performance and cost-effectiveness through built-in optimizations.

In this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find a detailed quick-start guide to reproduce benchmark results, run your own experiments, and explore configuration options—so you can experience the benefits firsthand.

Why training large MoEs is hard

Training MoEs efficiently at scale requires solving several interconnected challenges:

Expert parallelism: Distribute hundreds of experts across GPUs without overwhelming communication bandwidth.
Token routing overhead: Move tokens quickly and efficiently to the correct experts.
Memory management: Shard massive parameter sets to fit within GPU memory constraints.
Communication-computation fusion: Minimize latency from all-to-all communication and token permutation operations.

As a result of these system challenges, achieving more than 150 TFLOPs/GPU on H100 systems at BF16 precision has historically been difficult—leaving performance untapped.

NVIDIA NeMo Automodel, an open-source library within the NVIDIA NeMo framework, removes these barriers by building on top of native PyTorch parallelisms. It incorporates advanced infrastructure optimizations—previously reserved for expert ML engineers—directly into the PyTorch ecosystem.

Developers can now use PyTorch APIs while achieving over 200 TFLOPs per GPU on H100s with BF16 precision for a variety of popular 100B+ MoE architectures. For instance, DeepSeek V3 reached 250 TFLOPs/sec/GPU on 256 GPUs.

This makes large-scale MoE training accessible—empowering the broader community to research, experiment, and innovate with billion-parameter models.

Inside NeMo Automodel: architecture and optimizations

NeMo AutoModel bridges PyTorch native distributed parallelisms with NVIDIA acceleration technologies, creating a unified, high-performance training stack for MoEs.

Scaling efficiently via PyTorch distributed parallelisms

Built on PyTorch distributed, NeMo Automodel seamlessly scales models using:

Fully Sharded Data Parallelism (FSDP): Shards model parameters, gradients, and optimizer states across data-parallel ranks to minimize memory use.
Expert Parallelism (EP): Distributes MoE experts efficiently across GPUs for hundreds of experts per model.
Pipeline Parallelism (PP): Splits model layers into stages for memory-efficient multi-node large model training.
Context Parallelism (CP): Partitions long sequences for extended-context training.

Accelerating training with NVIDIA Transformer Engine

Using NVIDIA Transformer Engine kernels—including CUDNN RMSNorm, CUDNN Linear, and DotProductAttention—NeMo Automodel accelerates transformer blocks and supports different attention mechanisms such as multi-head latent attention (MLA), grouped-query attention (GQA), and sliding-window attention (SWA).

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

To achieve high efficiency at massive scale, NeMo Automodel integrates advanced token routing and expert computation components from Megatron-Core, designed specifically for MOE training.

DeepEP token dispatcher (Experimental): Scales token routing to 64+ expert parallelism degrees with highly efficient all-to-all communication and optional permute/unpermute fusion. By leveraging DeepSeek’s DeepEP optimization, NeMo Automodel minimizes communication overhead and maintains balanced expert utilization, enabling smoother scaling across hundreds of GPUs.
GroupedGEMM for MoE Experts: Aggregates multiple local expert computations into a single batched GEMM operation. This reduces kernel launches overhead, increases GPU occupancy, and significantly improves throughput and hardware utilization—especially when multiple experts share the same device..

Breakthrough performance: cost-effective MoE training for everyone

The table below shows pre-training benchmarks on DGX H100 systems with BF16 precision across major MoE architectures:

Model	#GPUs	GBS (Global Batch Size)	Parallelism [TP,PP,CP,EP,VP, FSDP]	Optimizations	TFLOPs /sec/GPU	Tokens/ sec/GPU
DeepSeek V3 671B	256	512	1,4,1,64,8,64	TE + DeepEP	250	1,002
DeepSeek V3 671B	1024	8192	1,4,1,64,8,256	TE + DeepEP	216	865
Kimi K2	256	512	1,8,1,32,4,32	TE + DeepEP	189	924
Qwen3 MoE 30B	8	512	1,1,1,8,-,8	TE + DeepEP	277	12,040
GPT-OSS 20B	8	256	1,1,1,-,-,8	TE + DeepEP + FlexAttn	279	13,058
GPT-OSS 120B	64	512	1,1,1,-,-,64	TE + DeepEP + FlexAttn	231	7,626

Table 1. Pre-training performance of representative mixture-of-experts (MoE) architectures on DGX H100 systems (BF16 precision). Note: All benchmarks use consistent measurement methodology with mock data, for a sequence length of 4096, and balanced expert routing. Peak H100 BF16 performance is 989 TFLOPs.

NeMo Automodel delivers industry-leading efficiency and scalability across diverse MoE architectures and GPU counts. Models sustain 190 to 280 TFLOPs/sec per GPU and process up to 13,000 tokens/sec, demonstrating near-linear scaling from eight to 1,024 GPUs, with DeepSeek V3 671B model reaching 250 TFLOPs/sec per GPU on 256 GPUs. All of these are done via native PyTorch parallelisms coupled with NVIDIA optimizations, unlocking peak hardware utilization and cost-effective large-scale MoE training for everyone in the PyTorch community.

If you need even more speed beyond native PyTorch parallelism, NeMo Megatron-Bridge provides additional performance optimizations.

Empowering developers through native PyTorch distributed training

By leveraging native PyTorch distributed parallelisms, NeMo Automodel brings high-performance large-scale MOE training directly into the PyTorch ecosystem. This approach eliminates dependency on external or proprietary model-parallel libraries, giving developers full flexibility to scale using tools and APIs they already know.

Most importantly, it reflects NVIDIA commitment to strengthening PyTorch and the broader open source AI ecosystem—making large-model training not just faster, but more open, interoperable, and accessible to the entire developer community.

Key benefits for developers:

Faster iteration cycles: Achieve higher throughput for quicker experimentation and model development.
Lower training costs: Better GPU utilization means fewer GPU-hours per training run.
Scalable performance: Consistent, near-linear scaling from eight GPUs to over 1,000 GPUs enables flexible infrastructure planning.
Native PyTorch integration: Leveraged PyTorch distributed to remove reliance on external model-parallel frameworks—keeping everything within the PyTorch workflow.
Ecosystem commitment: Demonstrates NVIDIA long-term investment in advancing PyTorch, ensuring future innovations are directly integrated into the core framework.
Production-ready: Includes proven, battle-tested configurations for leading open-source MoE architectures.

Quick start: train and benchmark large MoE models

Getting started with NeMo Automodel is fast and familiar for any PyTorch developer.

You can use the provided benchmark scripts and configuration files to reproduce results or train your own large-scale MoE models with NVIDIA-optimized performance.

Minimum requirements

At least eight GPUs (80 GB memory each) are recommended to reproduce the benchmark results and run fine-tuning experiments efficiently.

Follow these simple steps to run a benchmark or fine-tuning experiment:

1. Pull the NeMo docker image and start a container

docker pull nvcr.io/nvidia/nemo:25.09 docker run -it -d --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidian/nemo:25.09 bash

2. Once inside the container, clone the repo and navigate to Automodel

git clone https://github.com/NVIDIA-NeMo/Automodel.git cd Automodel

Run a benchmark

Example:Benchmark Qwen3 MoE 30B on eight GPUs

torchrun --nproc-per-node 8 nemo_automodel/recipes/llm/benchmark.py
--config examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml

Run fine-tuning

Example:Fine-tune Qwen3 MoE 30B

Note:

You’ll need to download the model checkpoint from Hugging Face first: hf download Qwen/Qwen3-30B-A3B
If you encounter a dataset instantiation error, upgrade the datasets library: pip install --upgrade datasets

torchrun --nproc-per-node 8 examples/llm_finetune/finetune.py --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml

Available configuration files:

deepseek_v3_te_deepep.yaml – DeepSeek V3 (671B parameter)
kimi_k2_te_deepep.yaml – Optimized configuration for Kimi K2
qwen3_moe_30b_te_deepep.yaml – Qwen3 MoE 30B with full NVIDIA optimizations
gptoss_20b_te_deepep.yaml – GPT-OSS 20B with FlexAttention
gptoss_120b_te_deepep.yaml – GPT-OSS 120B production configuration

Check out docs for complete performance documentation and implementation details.

Join us in democratizing open MoE training with native PyTorch

This release marks a major milestone in democratizing large-scale mixture-of-experts (MoE) training with accelerated PyTorch. But it’s only the beginning.

We’re actively working on:

Expanding model support: Adding new MoE and hybrid architectures.

Deeper optimizations: Further kernel-level and communication improvements for even higher efficiency.
Technical deep dives: Detailed explainers of NeMo AutoModel MoE design and performance techniques.
Broader benchmarking: Extending performance validation across diverse hardware and cluster configurations.

We’d love for you to get started with NeMo Automodel and be part of this journey—try the configurations, share your results, and contribute feedback through GitHub Issues. Your insights help shape the next generation of scalable, open AI training tools.

Learn how NVIDIA Blackwell NVL72 runs 10x faster and delivers 1/10 the token cost for MoE models in this blog.