NVIDIA Nemotron-3-Nano-30B-A3B User Guide (original) (raw)

This guide describes how to run Nemotron-3-Nano-30B-A3B using vLLM. There are FP8 and BF16 versions.

Deployment Steps

We recommend using vLLM 0.12.0 release for full support. However, vLLM 0.11.2 also supports the model.

Pull Docker Image

Pull the vLLM v0.12.0 release docker image.

pull_image.sh

[](#%5F%5Fcodelineno-0-1)# On x86_64 systems: [](#%5F%5Fcodelineno-0-2)docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0 [](#%5F%5Fcodelineno-0-3)# On aarch64 systems: [](#%5F%5Fcodelineno-0-4)# docker pull --platform linux/aarch64 vllm/vllm-openai:v0.12.0 [](#%5F%5Fcodelineno-0-5) [](#%5F%5Fcodelineno-0-6)docker tag vllm/vllm-openai:v0.12.0 vllm/vllm-openai:deploy

DGX Spark Docker Image

Build container from source based on 0.12.0 or later release https://github.com/vllm-project/vllm/blob/v0.12.0/docker/Dockerfile

[](#%5F%5Fcodelineno-1-1)git clone https://github.com/vllm-project/vllm.git [](#%5F%5Fcodelineno-1-2) [](#%5F%5Fcodelineno-1-3)cd vllm [](#%5F%5Fcodelineno-1-4) [](#%5F%5Fcodelineno-1-5)DOCKER_BUILDKIT=1 docker build \ [](#%5F%5Fcodelineno-1-6) --build-arg max_jobs=12 \ [](#%5F%5Fcodelineno-1-7) --build-arg RUN_WHEEL_CHECK=false \ [](#%5F%5Fcodelineno-1-8) --build-arg CUDA_VERSION=13.0.1 \ [](#%5F%5Fcodelineno-1-9) --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \ [](#%5F%5Fcodelineno-1-10) --build-arg torch_cuda_arch_list='12.1' \ [](#%5F%5Fcodelineno-1-11) --platform "linux/arm64" \ [](#%5F%5Fcodelineno-1-12) --tag <docker-image-tag-name> \ [](#%5F%5Fcodelineno-1-13) --target vllm-openai \ [](#%5F%5Fcodelineno-1-14) --progress plain \ [](#%5F%5Fcodelineno-1-15) -f docker/Dockerfile \ [](#%5F%5Fcodelineno-1-16).

Pull vLLM NGC docker image release version 25.12.post1-py3

[](#%5F%5Fcodelineno-2-1)docker pull nvcr.io/nvidia/vllm:25.12.post1-py3

Jetson Thor Docker Image

[](#%5F%5Fcodelineno-3-1)docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor

Run Docker Container

Run the docker container using the docker image vllm/vllm-openai:deploy.

run_container.sh

[](#%5F%5Fcodelineno-4-1)docker run -e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME" --ipc=host --gpus all --entrypoint "/bin/bash" --rm -it vllm/vllm-openai:deploy

Note: You can mount additional directories and paths using the -v <local_path>:<path> flag if needed, such as mounting the downloaded weight paths.

The -e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME" flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to HuggingFace documentation for more information about these environment variables and refer to HuggingFace Quickstart guide about steps to generate your HuggingFace access token.

Run Docker Container on DGX Spark

With the docker container built from source or the pulled vLLM NGC container

Run Docker Container on Jetson Thor

With the pulled vLLM Jetson Thor container

Launch the vLLM Server

Below is an example command to launch the vLLM server with Nemotron-3-Nano-30B-A3B-BF16/FP8 model.

launch_server.sh

[](#%5F%5Fcodelineno-5-1)# Set up a few environment variables for better performance for Blackwell architecture. [](#%5F%5Fcodelineno-5-2)# They will be removed when the performance optimizations have been verified and enabled by default. [](#%5F%5Fcodelineno-5-3) [](#%5F%5Fcodelineno-5-4)# Supported dtypes for this model are: FP8, BF16 [](#%5F%5Fcodelineno-5-5)DTYPE="FP8" [](#%5F%5Fcodelineno-5-6) [](#%5F%5Fcodelineno-5-7)if [ "$DTYPE" = "FP8" ]; then [](#%5F%5Fcodelineno-5-8) # On FP8 only - set KV cache dtype to FP8 [](#%5F%5Fcodelineno-5-9) KV_CACHE_DTYPE="fp8" [](#%5F%5Fcodelineno-5-10) [](#%5F%5Fcodelineno-5-11) # Enable use of FlashInfer FP8 MoE [](#%5F%5Fcodelineno-5-12) export VLLM_USE_FLASHINFER_MOE_FP8=1 [](#%5F%5Fcodelineno-5-13) export VLLM_FLASHINFER_MOE_BACKEND=throughput [](#%5F%5Fcodelineno-5-14)else [](#%5F%5Fcodelineno-5-15) KV_CACHE_DTYPE="auto" [](#%5F%5Fcodelineno-5-16)fi [](#%5F%5Fcodelineno-5-17) [](#%5F%5Fcodelineno-5-18)# Launch the vLLM server [](#%5F%5Fcodelineno-5-19)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-$DTYPE \ [](#%5F%5Fcodelineno-5-20) --trust-remote-code \ [](#%5F%5Fcodelineno-5-21) --async-scheduling \ [](#%5F%5Fcodelineno-5-22) --kv-cache-dtype $KV_CACHE_DTYPE \ [](#%5F%5Fcodelineno-5-23) --tensor-parallel-size 1 &

After the server is set up, the client can now send prompt requests to the server and receive results.

DGX Spark vLLM Server Launch

Downloading the custom parser

[](#%5F%5Fcodelineno-6-1)wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

BF16 model variant

[](#%5F%5Fcodelineno-7-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ [](#%5F%5Fcodelineno-7-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-7-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-7-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-7-5) --port 8000 \ [](#%5F%5Fcodelineno-7-6) --trust-remote-code \ [](#%5F%5Fcodelineno-7-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-7-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-7-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-7-10) --reasoning-parser nano_v3

Jetson Thor vLLM Server Launch

BF16 model variant

[](#%5F%5Fcodelineno-8-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ [](#%5F%5Fcodelineno-8-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-8-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-8-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-8-5) --port 8000 \ [](#%5F%5Fcodelineno-8-6) --trust-remote-code \ [](#%5F%5Fcodelineno-8-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-8-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-8-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-8-10) --reasoning-parser nano_v3

FP8 model variant

[](#%5F%5Fcodelineno-9-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \ [](#%5F%5Fcodelineno-9-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-9-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-9-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-9-5) --port 8000 \ [](#%5F%5Fcodelineno-9-6) --trust-remote-code \ [](#%5F%5Fcodelineno-9-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-9-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-9-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-9-10) --reasoning-parser nano_v3

Configs and Parameters

You can specify the IP address and the port that you would like to run the server with using these flags:

Below are the config flags that we do not recommend changing or tuning with:

Below are a few tunable parameters you can modify based on your serving requirements:

Refer to the "Balancing between Throughput and Latencies" about how to adjust these tunable parameters to meet your deployment requirements.

### Benchmarking Performance

To benchmark the performance, you can use the vllm bench serve command.

run_performance.sh

[](#%5F%5Fcodelineno-10-1)# Set DTYPE env var to match the benchmarked checkpoint (FP8 or BF16) [](#%5F%5Fcodelineno-10-2)vllm bench serve \ [](#%5F%5Fcodelineno-10-3) --host 0.0.0.0 \ [](#%5F%5Fcodelineno-10-4) --port 8000 \ [](#%5F%5Fcodelineno-10-5) --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-$DTYPE \ [](#%5F%5Fcodelineno-10-6) --trust-remote-code \ [](#%5F%5Fcodelineno-10-7) --dataset-name random \ [](#%5F%5Fcodelineno-10-8) --random-input-len 1024 \ [](#%5F%5Fcodelineno-10-9) --random-output-len 1024 \ [](#%5F%5Fcodelineno-10-10) --num-warmups 20 \ [](#%5F%5Fcodelineno-10-11) --ignore-eos \ [](#%5F%5Fcodelineno-10-12) --max-concurrency 1024 \ [](#%5F%5Fcodelineno-10-13) --num-prompts 2048 \ [](#%5F%5Fcodelineno-10-14) --save-result --result-filename vllm_benchmark_serving_results.json

Explanations for the flags:

Interpreting Performance Benchmarking Output

Sample output by the vllm bench serve command, with the FP8 model on H200:

[](#%5F%5Fcodelineno-11-1)============ Serving Benchmark Result ============ [](#%5F%5Fcodelineno-11-2)Successful requests: 2048 [](#%5F%5Fcodelineno-11-3)Failed requests: 0 [](#%5F%5Fcodelineno-11-4)Maximum request concurrency: 1024 [](#%5F%5Fcodelineno-11-5)Benchmark duration (s): 132.49 [](#%5F%5Fcodelineno-11-6)Total input tokens: 2097155 [](#%5F%5Fcodelineno-11-7)Total generated tokens: 2097152 [](#%5F%5Fcodelineno-11-8)Request throughput (req/s): 15.46 [](#%5F%5Fcodelineno-11-9)Output token throughput (tok/s): 15828.30 [](#%5F%5Fcodelineno-11-10)Peak output token throughput (tok/s): 21157.00 [](#%5F%5Fcodelineno-11-11)Peak concurrent requests: 1088.00 [](#%5F%5Fcodelineno-11-12)Total Token throughput (tok/s): 31656.63 [](#%5F%5Fcodelineno-11-13)---------------Time to First Token---------------- [](#%5F%5Fcodelineno-11-14)Mean TTFT (ms): 4490.58 [](#%5F%5Fcodelineno-11-15)Median TTFT (ms): 1534.84 [](#%5F%5Fcodelineno-11-16)P99 TTFT (ms): 15465.31 [](#%5F%5Fcodelineno-11-17)-----Time per Output Token (excl. 1st token)------ [](#%5F%5Fcodelineno-11-18)Mean TPOT (ms): 59.45 [](#%5F%5Fcodelineno-11-19)Median TPOT (ms): 61.04 [](#%5F%5Fcodelineno-11-20)P99 TPOT (ms): 63.01 [](#%5F%5Fcodelineno-11-21)---------------Inter-token Latency---------------- [](#%5F%5Fcodelineno-11-22)Mean ITL (ms): 59.45 [](#%5F%5Fcodelineno-11-23)Median ITL (ms): 52.75 [](#%5F%5Fcodelineno-11-24)P99 ITL (ms): 131.46 [](#%5F%5Fcodelineno-11-25)==================================================

Explanations for key metrics:

Balancing between Throughput and Latencies

In LLM inference, the "throughput" can be defined as the number of generated tokens per second (the Output token throughput metric above) or the number of processed tokens per second (the Total Token Throughput metric above). These two throughput metrics are highly correlated. We usually divide the throughput by the number of GPUs used to get the "per-GPU throughput" when comparing across different parallelism configurations. The higher per-GPU throughput is, the fewer GPUs are needed to serve the same amount of the incoming requests.

On the other hand, the “latency” can be defined as the latency from when a request is sent until the first output token is generated (the TTFT metric), the latency between two generated tokens after the first one has been generated (the TPOT metric), or the end-to-end latency from when a request is sent to when the final token is generated (the E2EL metric). The TTFT affects the E2EL more when the input (prompt) sequence lengths are much longer than the output (generated) sequence lengths, while the TPOT affects the E2EL more in the opposite cases.

To achieve higher throughput, tokens from multiple requests must be batched and processed together, but that increases the latencies. Therefore, a balance must be made between throughput and latencies depending on the deployment requirements.

The two main tunable configs for Nemotron Nano 3 are the --tensor-parallel-size (TP) and --max-num-seqs (BS). How they affect the throughput and latencies can be summarized as the following:

Note that the statements above assume that the concurrency setting on the client side, like the --max-concurrency flag in the performance benchmarking command, matches the --max-num-seqs (BS) setting on the server side.