GitHub - NVIDIA-NeMo/Automodel: πŸš€ Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support (original) (raw)

πŸ“£ News and Discussions

Overview

Nemo AutoModel is a Pytorch DTensor‑native SPMD open-source training library under NVIDIA NeMo Framework, designed to streamline and scale training and finetuning for LLMs, VLMs, diffusion models, and retrieval models. Designed for flexibility, reproducibility, and scale, NeMo AutoModel enables both small-scale experiments and massive multi-GPU, multi-node deployments for fast experimentation in research and production environments.

AutoModel Logo

What you can expect:

Why PyTorch Distributed and SPMD

Table of Contents

TL;DR: SPMD turns β€œhow to parallelize” into a runtime layout choice, not a code fork.

Feature List

βœ… Available now (v0.4.0 / 26.04 container) | πŸ”œ Coming next

High-throughput scalable training

SOTA algorithms

Model Coverage and πŸ€— Ecosystem compatibility

Agentic Development and UX

Getting Started

We recommend using uv for reproducible Python environments.

Setup environment before running any recipes

uv venv

Choose ONE:

uv sync --frozen # LLM recipes (default)

uv sync --frozen --extra vlm # VLM recipes (fixes: ImportError: qwen_vl_utils is not installed)

uv sync --frozen --extra cuda # Optional CUDA deps (e.g., Transformer Engine, bitsandbytes)

uv sync --frozen --extra all # Most optional deps (includes vlm and cuda)

uv sync --frozen --all-extras # Everything (includes fa, moe, etc.)

One-off runs (examples):

uv run --extra vlm

uv run --extra cuda

uv run python -c "import nemo_automodel; print('NeMo AutoModel ready')"

Run a Recipe

All recipes are launched via the automodel CLI (or its short alias am). Each YAML config specifies the recipe class and all training parameters:

LLM example: multi-GPU fine-tuning with FSDP2

automodel examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml --nproc-per-node 8

VLM example: single-GPU fine-tuning (Gemma-3-VL) with LoRA

automodel examples/vlm_finetune/gemma3/gemma3_vl_4b_cord_v2_peft.yaml

Both commands also work with uv run:

uv run automodel examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml --nproc-per-node 8

Tip

Login-node / CI installs: If you only need to submit jobs (SLURM, k8s, NeMo-Run) and don't need to train locally, install the lightweight CLI package: pip install nemo-automodel[cli]

LLM Pre-training

LLM Pre-training Single Node

We provide an example SFT experiment using the FineWeb dataset with a nano-GPT model, ideal for quick experimentation on a single node.

automodel examples/llm_pretrain/nanogpt_pretrain.yaml --nproc-per-node 8

LLM Supervised Fine-Tuning (SFT)

We provide an example SFT experiment using the SQuAD dataset.

LLM SFT Single Node

The default SFT configuration is set to run on a single GPU. To start the experiment:

automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml

This fine-tunes the Llama3.2-1B model on the SQuAD dataset using a single GPU.

To use multiple GPUs on a single node, add the --nproc-per-node argument:

automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml --nproc-per-node 8

LLM SFT Multi Node

To launch on a SLURM cluster, copy the reference sbatch script and adapt it to your cluster:

cp slurm.sub my_cluster.sub

Edit my_cluster.sub β€” change CONFIG, #SBATCH directives, container, mounts, etc.

sbatch my_cluster.sub

All cluster-specific settings (nodes, GPUs, partition, container, mounts) live in your sbatch script. NeMo-Run (nemo_run:) sections are also supported -- see ourcluster guide for details.

LLM Parameter-Efficient Fine-Tuning (PEFT)

We provide a PEFT example using the HellaSwag dataset.

LLM PEFT Single Node

Memory-efficient SFT with LoRA

automodel examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag_peft.yaml

Override any YAML parameter via the command line:

automodel examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag_peft.yaml
--step_scheduler.local_batch_size 16

Note

Launching a multi-node PEFT example uses the same sbatch slurm.sub workflow as the SFT case above.

VLM Supervised Fine-Tuning (SFT)

We provide a VLM SFT example using Qwen2.5-VL for end-to-end fine-tuning on image-text data.

VLM SFT Single Node

Qwen2.5-VL on 8 GPUs

automodel examples/vlm_finetune/qwen2_5/qwen2_5_vl_3b_rdr.yaml --nproc-per-node 8

VLM Parameter-Efficient Fine-Tuning (PEFT)

We provide a VLM PEFT (LoRA) example for memory-efficient adaptation with Gemma3 VLM.

VLM PEFT Single Node

Gemma-3-VL PEFT on 8 GPUs

automodel examples/vlm_finetune/gemma3/gemma3_vl_4b_medpix_peft.yaml --nproc-per-node 8

Supported Models

NeMo AutoModel provides native support for a wide range of models available on the Hugging Face Hub, enabling efficient fine-tuning for various domains. Below is a small sample of ready-to-use families (train as-is or swap any compatible πŸ€— causal LM), you can specify nearly any LLM/VLM model available on πŸ€— hub:

Domain Model Family Model ID Recipes
LLM GPT-OSS GPT-OSS-20B SFT
GPT-OSS-120B SFT
LLM DeepSeek DeepSeek-V3 Pretrain
LLM Moonlight Moonlight-16B-TE Pretrain, SFT
LLM Ling 2.0 inclusionAI/Ling-mini-2.0 LoRA SFT, SFT
inclusionAI/Ling-flash-2.0 LoRA SFT, SFT
inclusionAI/Ling-1T LoRA SFT, SFT
LLM ERNIE 4.5 baidu/ERNIE-4.5-0.3B-PT SFT
baidu/ERNIE-4.5-21B-A3B-PT SFT
LLM MiMo V2 Flash XiaomiMiMo/MiMo-V2-Flash SFT
LLM LLaMA meta-llama/Llama-3.2-1B SFT, PEFT
meta-llama/Llama-3.2-3B-Instruct SFT, PEFT
meta-llama/Llama-3.1-8B FP8
meta-llama/Llama-3.3-70B-Instruct SFT, PEFT
LLM Mistral mistralai/Mistral-7B-v0.1 SFT, PEFT, FP8
mistralai/Mistral-Nemo-Base-2407 SFT, PEFT, FP8
mistralai/Mixtral-8x7B-Instruct-v0.1 SFT, PEFT
LLM Qwen Qwen/Qwen2.5-7B SFT, PEFT, FP8
Qwen/Qwen3-0.6B SFT, PEFT
Qwen/QwQ-32B SFT, PEFT
LLM Gemma google/gemma-3-270m SFT, PEFT
google/gemma-2-9b-it SFT, PEFT, FP8
google/gemma-7b SFT, PEFT
LLM Phi microsoft/phi-2 SFT, PEFT
microsoft/Phi-3-mini-4k-instruct SFT, PEFT
microsoft/phi-4 SFT, PEFT, FP8
LLM Seed ByteDance-Seed/Seed-Coder-8B-Instruct SFT, PEFT, FP8
ByteDance-Seed/Seed-OSS-36B-Instruct SFT, PEFT
LLM Baichuan baichuan-inc/Baichuan2-7B-Chat SFT, PEFT, FP8
VLM Gemma google/gemma-3-4b-it SFT, PEFT
google/gemma-3n-e4b-it SFT, PEFT

Note

Check out more LLM and VLM examples. Any causal LM on Hugging Face Hub can be used with the base recipe template, just overwrite --model.pretrained_model_name_or_path <model-id> in the CLI or in the YAML config.

Performance

NeMo AutoModel achieves great training performance on NVIDIA GPUs. Below are highlights from our benchmark results:

Model #GPUs Seq Length Model TFLOPs/sec/GPU Tokens/sec/GPU Kernel Optimizations
DeepSeek V3 671B 256 4096 250 1,002 TE + DeepEP
GPT-OSS 20B 8 4096 279 13,058 TE + DeepEP + FlexAttn
Qwen3 MoE 30B 8 4096 212 11,842 TE + DeepEP

For complete benchmark results including configuration details, see the Performance Summary.

πŸ”Œ Interoperability

πŸ—‚οΈ Project Structure

NeMo-Automodel/
β”œβ”€β”€ cli/                            # `automodel` / `am` CLI entry-point
β”‚   └── app.py
β”œβ”€β”€ docker/                         # Container build files
β”œβ”€β”€ docs/                           # Documentation and guides
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ convergence/                # Convergence test configs
β”‚   β”œβ”€β”€ diffusion/                  # Diffusion pretrain/finetune configs
β”‚   β”œβ”€β”€ dllm_sft/                   # Discrete diffusion LM SFT configs
β”‚   β”œβ”€β”€ dllm_generate/              # Discrete diffusion LM generation
β”‚   β”œβ”€β”€ llm_benchmark/              # LLM benchmarking configs
β”‚   β”œβ”€β”€ llm_finetune/               # LLM finetune YAML configs
β”‚   β”œβ”€β”€ llm_kd/                     # LLM knowledge-distillation configs
β”‚   β”œβ”€β”€ llm_pretrain/               # LLM pretrain configs
β”‚   β”œβ”€β”€ llm_seq_cls/                # LLM sequence classification configs
β”‚   β”œβ”€β”€ retrieval/                  # Bi-encoder / cross-encoder configs
β”‚   β”œβ”€β”€ vlm_benchmark/              # VLM benchmarking configs
β”‚   β”œβ”€β”€ vlm_finetune/               # VLM finetune configs
β”‚   └── vlm_generate/               # VLM generation configs
β”œβ”€β”€ nemo_automodel/
β”‚   β”œβ”€β”€ _diffusers/                 # HF Diffusers integration (NeMoAutoDiffusionPipeline)
β”‚   β”œβ”€β”€ _transformers/              # HF Transformers integration
β”‚   β”œβ”€β”€ components/                 # Core library
β”‚   β”‚   β”œβ”€β”€ _peft/                  # PEFT implementations (LoRA, QLoRA)
β”‚   β”‚   β”œβ”€β”€ attention/              # Attention implementations
β”‚   β”‚   β”œβ”€β”€ checkpoint/             # Distributed checkpointing
β”‚   β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ datasets/               # LLM, VLM, diffusion, retrieval datasets
β”‚   β”‚   β”œβ”€β”€ distributed/            # FSDP2, Megatron FSDP, pipelining, CP, etc.
β”‚   β”‚   β”œβ”€β”€ launcher/               # Launcher backends (SLURM, NeMo-Run, SkyPilot)
β”‚   β”‚   β”œβ”€β”€ loggers/                # Loggers
β”‚   β”‚   β”œβ”€β”€ loss/                   # Optimized loss functions
β”‚   β”‚   β”œβ”€β”€ models/                 # User-defined model examples
β”‚   β”‚   β”œβ”€β”€ moe/                    # Optimized kernels for MoE models
β”‚   β”‚   β”œβ”€β”€ optim/                  # Optimizer/LR scheduler components (incl. Dion)
β”‚   β”‚   β”œβ”€β”€ quantization/           # FP8, QAT, QLoRA
β”‚   β”‚   β”œβ”€β”€ training/               # Train utils
β”‚   β”‚   └── utils/                  # Misc utils
β”‚   β”œβ”€β”€ recipes/
β”‚   β”‚   β”œβ”€β”€ llm/                    # Main LLM train loop
β”‚   β”‚   β”œβ”€β”€ vlm/                    # Main VLM train loop
β”‚   β”‚   β”œβ”€β”€ diffusion/              # Diffusion training loop
β”‚   β”‚   β”œβ”€β”€ dllm/                   # Discrete diffusion LM training loop
β”‚   β”‚   └── retrieval/              # Retrieval / biencoder training loop
β”‚   └── shared/
β”œβ”€β”€ tools/                          # Developer tooling
└── tests/                          # Comprehensive test suite

Citation

If you use NeMo AutoModel in your research, please cite it using the following BibTeX entry:

@misc{nemo-automodel,
title = {NeMo AutoModel: DTensor-native SPMD library for scalable and efficient training},
howpublished = {\url{https://github.com/NVIDIA-NeMo/Automodel}},
year = {2025--2026},
note = {GitHub repository},
}

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

πŸ“„ License

NVIDIA NeMo AutoModel is licensed under the Apache License 2.0.