Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Parallelism (original) (raw)

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank.

With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the same familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:

In this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find a detailed quick-start guide to reproduce benchmark results, run your own experiments, and explore configuration options—so you can experience the benefits firsthand.

Why training large MoEs is hard

Training MoEs efficiently at scale requires solving several interconnected challenges:

  1. Expert parallelism: Distribute hundreds of experts across GPUs without overwhelming communication bandwidth.
  2. Token routing overhead: Move tokens quickly and efficiently to the correct experts.
  3. Memory management: Shard massive parameter sets to fit within GPU memory constraints.
  4. Communication-computation fusion: Minimize latency from all-to-all communication and token permutation operations.

As a result of these system challenges, achieving more than 150 TFLOPs/GPU on H100 systems at BF16 precision has historically been difficult—leaving performance untapped.

NVIDIA NeMo Automodel, an open-source library within the NVIDIA NeMo framework, removes these barriers by building on top of native PyTorch parallelisms. It incorporates advanced infrastructure optimizations—previously reserved for expert ML engineers—directly into the PyTorch ecosystem.

Developers can now use PyTorch APIs while achieving over 200 TFLOPs per GPU on H100s with BF16 precision for a variety of popular 100B+ MoE architectures. For instance, DeepSeek V3 reached 250 TFLOPs/sec/GPU on 256 GPUs.

This makes large-scale MoE training accessible—empowering the broader community to research, experiment, and innovate with billion-parameter models.

Inside NeMo Automodel: architecture and optimizations

NeMo AutoModel bridges PyTorch native distributed parallelisms with NVIDIA acceleration technologies, creating a unified, high-performance training stack for MoEs.

Scaling efficiently via PyTorch distributed parallelisms

Built on PyTorch distributed, NeMo Automodel seamlessly scales models using:

Accelerating training with NVIDIA Transformer Engine

Using NVIDIA Transformer Engine kernels—including CUDNN RMSNorm, CUDNN Linear, and DotProductAttention—NeMo Automodel accelerates transformer blocks and supports different attention mechanisms such as multi-head latent attention (MLA), grouped-query attention (GQA), and sliding-window attention (SWA).

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

To achieve high efficiency at massive scale, NeMo Automodel integrates advanced token routing and expert computation components from Megatron-Core, designed specifically for MOE training.

Breakthrough performance: cost-effective MoE training for everyone

The table below shows pre-training benchmarks on DGX H100 systems with BF16 precision across major MoE architectures:

Model #GPUs GBS (Global Batch Size) Parallelism [TP,PP,CP,EP,VP, FSDP] Optimizations TFLOPs /sec/GPU Tokens/ sec/GPU
DeepSeek V3 671B 256 512 1,4,1,64,8,64 TE + DeepEP 250 1,002
DeepSeek V3 671B 1024 8192 1,4,1,64,8,256 TE + DeepEP 216 865
Kimi K2 256 512 1,8,1,32,4,32 TE + DeepEP 189 924
Qwen3 MoE 30B 8 512 1,1,1,8,-,8 TE + DeepEP 277 12,040
GPT-OSS 20B 8 256 1,1,1,-,-,8 TE + DeepEP + FlexAttn 279 13,058
GPT-OSS 120B 64 512 1,1,1,-,-,64 TE + DeepEP + FlexAttn 231 7,626

Table 1. Pre-training performance of representative mixture-of-experts (MoE) architectures on DGX H100 systems (BF16 precision). Note: All benchmarks use consistent measurement methodology with mock data, for a sequence length of 4096, and balanced expert routing. Peak H100 BF16 performance is 989 TFLOPs.

NeMo Automodel delivers industry-leading efficiency and scalability across diverse MoE architectures and GPU counts. Models sustain 190 to 280 TFLOPs/sec per GPU and process up to 13,000 tokens/sec, demonstrating near-linear scaling from eight to 1,024 GPUs, with DeepSeek V3 671B model reaching 250 TFLOPs/sec per GPU on 256 GPUs. All of these are done via native PyTorch parallelisms coupled with NVIDIA optimizations, unlocking peak hardware utilization and cost-effective large-scale MoE training for everyone in the PyTorch community.

If you need even more speed beyond native PyTorch parallelism, NeMo Megatron-Bridge provides additional performance optimizations.

Empowering developers through native PyTorch distributed training

By leveraging native PyTorch distributed parallelisms, NeMo Automodel brings high-performance large-scale MOE training directly into the PyTorch ecosystem. This approach eliminates dependency on external or proprietary model-parallel libraries, giving developers full flexibility to scale using tools and APIs they already know.

Most importantly, it reflects NVIDIA commitment to strengthening PyTorch and the broader open source AI ecosystem—making large-model training not just faster, but more open, interoperable, and accessible to the entire developer community.

Key benefits for developers:

Quick start: train and benchmark large MoE models

Getting started with NeMo Automodel is fast and familiar for any PyTorch developer.

You can use the provided benchmark scripts and configuration files to reproduce results or train your own large-scale MoE models with NVIDIA-optimized performance.

Minimum requirements

At least eight GPUs (80 GB memory each) are recommended to reproduce the benchmark results and run fine-tuning experiments efficiently.

Follow these simple steps to run a benchmark or fine-tuning experiment:

1. Pull the NeMo docker image and start a container

docker pull nvcr.io/nvidia/nemo:25.09 docker run -it -d --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidian/nemo:25.09 bash

2. Once inside the container, clone the repo and navigate to Automodel

git clone https://github.com/NVIDIA-NeMo/Automodel.git cd Automodel

Run a benchmark

Example:Benchmark Qwen3 MoE 30B on eight GPUs

torchrun --nproc-per-node 8 nemo_automodel/recipes/llm/benchmark.py
--config examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml

Run fine-tuning

Example:Fine-tune Qwen3 MoE 30B

Note:

torchrun --nproc-per-node 8 examples/llm_finetune/finetune.py --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml

Available configuration files:

Check out docs for complete performance documentation and implementation details.

Join us in democratizing open MoE training with native PyTorch

This release marks a major milestone in democratizing large-scale mixture-of-experts (MoE) training with accelerated PyTorch. But it’s only the beginning.

We’re actively working on:

Expanding model support: Adding new MoE and hybrid architectures.

We’d love for you to get started with NeMo Automodel and be part of this journey—try the configurations, share your results, and contribute feedback through GitHub Issues. Your insights help shape the next generation of scalable, open AI training tools.

Learn how NVIDIA Blackwell NVL72 runs 10x faster and delivers 1/10 the token cost for MoE models in this blog.