NVIDIA Nemotron-3-Nano-30B-A3B User Guide (original) (raw)

This guide describes how to run Nemotron-3-Nano-30B-A3B using vLLM. There are FP8 and BF16 versions.

Deployment Steps¶

We recommend using vLLM 0.12.0 release for full support. However, vLLM 0.11.2 also supports the model.

Pull Docker Image¶

Pull the vLLM v0.12.0 release docker image.

pull_image.sh

[](#%5F%5Fcodelineno-0-1)# On x86_64 systems: [](#%5F%5Fcodelineno-0-2)docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0 [](#%5F%5Fcodelineno-0-3)# On aarch64 systems: [](#%5F%5Fcodelineno-0-4)# docker pull --platform linux/aarch64 vllm/vllm-openai:v0.12.0 [](#%5F%5Fcodelineno-0-5) [](#%5F%5Fcodelineno-0-6)docker tag vllm/vllm-openai:v0.12.0 vllm/vllm-openai:deploy

DGX Spark Docker Image¶

Build container from source based on 0.12.0 or later release https://github.com/vllm-project/vllm/blob/v0.12.0/docker/Dockerfile

[](#%5F%5Fcodelineno-1-1)git clone https://github.com/vllm-project/vllm.git [](#%5F%5Fcodelineno-1-2) [](#%5F%5Fcodelineno-1-3)cd vllm [](#%5F%5Fcodelineno-1-4) [](#%5F%5Fcodelineno-1-5)DOCKER_BUILDKIT=1 docker build \ [](#%5F%5Fcodelineno-1-6) --build-arg max_jobs=12 \ [](#%5F%5Fcodelineno-1-7) --build-arg RUN_WHEEL_CHECK=false \ [](#%5F%5Fcodelineno-1-8) --build-arg CUDA_VERSION=13.0.1 \ [](#%5F%5Fcodelineno-1-9) --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \ [](#%5F%5Fcodelineno-1-10) --build-arg torch_cuda_arch_list='12.1' \ [](#%5F%5Fcodelineno-1-11) --platform "linux/arm64" \ [](#%5F%5Fcodelineno-1-12) --tag <docker-image-tag-name> \ [](#%5F%5Fcodelineno-1-13) --target vllm-openai \ [](#%5F%5Fcodelineno-1-14) --progress plain \ [](#%5F%5Fcodelineno-1-15) -f docker/Dockerfile \ [](#%5F%5Fcodelineno-1-16).

Pull vLLM NGC docker image release version 25.12.post1-py3

[](#%5F%5Fcodelineno-2-1)docker pull nvcr.io/nvidia/vllm:25.12.post1-py3

Jetson Thor Docker Image¶

[](#%5F%5Fcodelineno-3-1)docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor

Run Docker Container¶

Run the docker container using the docker image vllm/vllm-openai:deploy.

run_container.sh

[](#%5F%5Fcodelineno-4-1)docker run -e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME" --ipc=host --gpus all --entrypoint "/bin/bash" --rm -it vllm/vllm-openai:deploy

Note: You can mount additional directories and paths using the -v <local_path>:<path> flag if needed, such as mounting the downloaded weight paths.

The -e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME" flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to HuggingFace documentation for more information about these environment variables and refer to HuggingFace Quickstart guide about steps to generate your HuggingFace access token.

Run Docker Container on DGX Spark¶

With the docker container built from source or the pulled vLLM NGC container

Run Docker Container on Jetson Thor¶

With the pulled vLLM Jetson Thor container

Launch the vLLM Server¶

Below is an example command to launch the vLLM server with Nemotron-3-Nano-30B-A3B-BF16/FP8 model.

launch_server.sh

[](#%5F%5Fcodelineno-5-1)# Set up a few environment variables for better performance for Blackwell architecture. [](#%5F%5Fcodelineno-5-2)# They will be removed when the performance optimizations have been verified and enabled by default. [](#%5F%5Fcodelineno-5-3) [](#%5F%5Fcodelineno-5-4)# Supported dtypes for this model are: FP8, BF16 [](#%5F%5Fcodelineno-5-5)DTYPE="FP8" [](#%5F%5Fcodelineno-5-6) [](#%5F%5Fcodelineno-5-7)if [ "$DTYPE" = "FP8" ]; then [](#%5F%5Fcodelineno-5-8) # On FP8 only - set KV cache dtype to FP8 [](#%5F%5Fcodelineno-5-9) KV_CACHE_DTYPE="fp8" [](#%5F%5Fcodelineno-5-10) [](#%5F%5Fcodelineno-5-11) # Enable use of FlashInfer FP8 MoE [](#%5F%5Fcodelineno-5-12) export VLLM_USE_FLASHINFER_MOE_FP8=1 [](#%5F%5Fcodelineno-5-13) export VLLM_FLASHINFER_MOE_BACKEND=throughput [](#%5F%5Fcodelineno-5-14)else [](#%5F%5Fcodelineno-5-15) KV_CACHE_DTYPE="auto" [](#%5F%5Fcodelineno-5-16)fi [](#%5F%5Fcodelineno-5-17) [](#%5F%5Fcodelineno-5-18)# Launch the vLLM server [](#%5F%5Fcodelineno-5-19)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-$DTYPE \ [](#%5F%5Fcodelineno-5-20) --trust-remote-code \ [](#%5F%5Fcodelineno-5-21) --async-scheduling \ [](#%5F%5Fcodelineno-5-22) --kv-cache-dtype $KV_CACHE_DTYPE \ [](#%5F%5Fcodelineno-5-23) --tensor-parallel-size 1 &

After the server is set up, the client can now send prompt requests to the server and receive results.

DGX Spark vLLM Server Launch¶

Downloading the custom parser

[](#%5F%5Fcodelineno-6-1)wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

BF16 model variant

[](#%5F%5Fcodelineno-7-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ [](#%5F%5Fcodelineno-7-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-7-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-7-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-7-5) --port 8000 \ [](#%5F%5Fcodelineno-7-6) --trust-remote-code \ [](#%5F%5Fcodelineno-7-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-7-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-7-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-7-10) --reasoning-parser nano_v3

Jetson Thor vLLM Server Launch¶

BF16 model variant

[](#%5F%5Fcodelineno-8-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ [](#%5F%5Fcodelineno-8-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-8-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-8-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-8-5) --port 8000 \ [](#%5F%5Fcodelineno-8-6) --trust-remote-code \ [](#%5F%5Fcodelineno-8-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-8-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-8-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-8-10) --reasoning-parser nano_v3

FP8 model variant

[](#%5F%5Fcodelineno-9-1)vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \ [](#%5F%5Fcodelineno-9-2) --max-num-seqs 8 \ [](#%5F%5Fcodelineno-9-3) --tensor-parallel-size 1 \ [](#%5F%5Fcodelineno-9-4) --max-model-len 262144 \ [](#%5F%5Fcodelineno-9-5) --port 8000 \ [](#%5F%5Fcodelineno-9-6) --trust-remote-code \ [](#%5F%5Fcodelineno-9-7) --enable-auto-tool-choice \ [](#%5F%5Fcodelineno-9-8) --tool-call-parser qwen3_coder \ [](#%5F%5Fcodelineno-9-9) --reasoning-parser-plugin nano_v3_reasoning_parser.py \ [](#%5F%5Fcodelineno-9-10) --reasoning-parser nano_v3

Configs and Parameters¶

You can specify the IP address and the port that you would like to run the server with using these flags:

host: IP address of the server. By default, it uses 127.0.0.1.
port: The port to listen to by the server. By default, it uses port 8000.

Below are the config flags that we do not recommend changing or tuning with:

kv-cache-dtype: KV cache data type. We recommend setting it to "fp8" when using the FP8 model, otherwise set to "auto".
async-scheduling: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance.

Below are a few tunable parameters you can modify based on your serving requirements:

mamba-ssm-cache-dtype: Mamba SSM cache data type. For best model accuracy set to float32. When using vLLM from main branch or any release newer than 0.12.0, setting to float16 improves performance while degrading accuracy only slightly when compared to float32. The default value with this model until (and including) vLLM release 0.12.0 is bfloat16, in newer release or on main branch of vLLM, the default value would be either what's specified in the mamba_ssm_cache_dtype field in the model's HF config.json, or if it's not found there then float16 would be used.
tensor-parallel-size: Tensor parallelism size. Increasing this will increase the number of GPUs that are used for inference.
max-num-seqs: Maximum number of sequences per batch.
By default, this is set to a large number like 1024 on GPUs with large memory sizes.
If the actual concurrency is smaller, setting this to a smaller number matching the max concurrency may improve the performance and improve the per-user latencies.
max-model-len: Maximum number of total tokens, including the input tokens and output tokens, for each request.
By default, this is set to the maximum sequence length supported by the model.
If the actual input+output sequence length is shorter than the default, setting this to a smaller number may improve the performance.
For example, if the maximum input sequence length is 1024 tokens and maximum output sequence length is 1024, then this can be set to 2048 for better performance.

Refer to the "Balancing between Throughput and Latencies" about how to adjust these tunable parameters to meet your deployment requirements.

### Benchmarking Performance

To benchmark the performance, you can use the vllm bench serve command.

run_performance.sh

[](#%5F%5Fcodelineno-10-1)# Set DTYPE env var to match the benchmarked checkpoint (FP8 or BF16) [](#%5F%5Fcodelineno-10-2)vllm bench serve \ [](#%5F%5Fcodelineno-10-3) --host 0.0.0.0 \ [](#%5F%5Fcodelineno-10-4) --port 8000 \ [](#%5F%5Fcodelineno-10-5) --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-$DTYPE \ [](#%5F%5Fcodelineno-10-6) --trust-remote-code \ [](#%5F%5Fcodelineno-10-7) --dataset-name random \ [](#%5F%5Fcodelineno-10-8) --random-input-len 1024 \ [](#%5F%5Fcodelineno-10-9) --random-output-len 1024 \ [](#%5F%5Fcodelineno-10-10) --num-warmups 20 \ [](#%5F%5Fcodelineno-10-11) --ignore-eos \ [](#%5F%5Fcodelineno-10-12) --max-concurrency 1024 \ [](#%5F%5Fcodelineno-10-13) --num-prompts 2048 \ [](#%5F%5Fcodelineno-10-14) --save-result --result-filename vllm_benchmark_serving_results.json

Explanations for the flags:

--dataset-name: Which dataset to use for benchmarking. We use a random dataset here.
--random-input-len: Specifies the average input sequence length.
--random-output-len: Specifies the average output sequence length.
--num-warmups: Specifies the number of warmup requests. It helps to ensure the benchmark reflects the actual steady-state performance, ignoring the initial overheads.
--ignore-eos: Disables early returning when eos (end-of-sentence) token is generated.
--max-concurrency: Maximum number of in-flight requests. We recommend matching this with the --max-num-seqs flag used to launch the server.
--num-prompts: Total number of prompts used for performance benchmarking. We recommend setting it to at least five times of the --max-concurrency to measure the steady state performance.
--save-result --result-filename: Output location for the performance benchmarking result.

Interpreting Performance Benchmarking Output¶

Sample output by the vllm bench serve command, with the FP8 model on H200:

[](#%5F%5Fcodelineno-11-1)============ Serving Benchmark Result ============ [](#%5F%5Fcodelineno-11-2)Successful requests: 2048 [](#%5F%5Fcodelineno-11-3)Failed requests: 0 [](#%5F%5Fcodelineno-11-4)Maximum request concurrency: 1024 [](#%5F%5Fcodelineno-11-5)Benchmark duration (s): 132.49 [](#%5F%5Fcodelineno-11-6)Total input tokens: 2097155 [](#%5F%5Fcodelineno-11-7)Total generated tokens: 2097152 [](#%5F%5Fcodelineno-11-8)Request throughput (req/s): 15.46 [](#%5F%5Fcodelineno-11-9)Output token throughput (tok/s): 15828.30 [](#%5F%5Fcodelineno-11-10)Peak output token throughput (tok/s): 21157.00 [](#%5F%5Fcodelineno-11-11)Peak concurrent requests: 1088.00 [](#%5F%5Fcodelineno-11-12)Total Token throughput (tok/s): 31656.63 [](#%5F%5Fcodelineno-11-13)---------------Time to First Token---------------- [](#%5F%5Fcodelineno-11-14)Mean TTFT (ms): 4490.58 [](#%5F%5Fcodelineno-11-15)Median TTFT (ms): 1534.84 [](#%5F%5Fcodelineno-11-16)P99 TTFT (ms): 15465.31 [](#%5F%5Fcodelineno-11-17)-----Time per Output Token (excl. 1st token)------ [](#%5F%5Fcodelineno-11-18)Mean TPOT (ms): 59.45 [](#%5F%5Fcodelineno-11-19)Median TPOT (ms): 61.04 [](#%5F%5Fcodelineno-11-20)P99 TPOT (ms): 63.01 [](#%5F%5Fcodelineno-11-21)---------------Inter-token Latency---------------- [](#%5F%5Fcodelineno-11-22)Mean ITL (ms): 59.45 [](#%5F%5Fcodelineno-11-23)Median ITL (ms): 52.75 [](#%5F%5Fcodelineno-11-24)P99 ITL (ms): 131.46 [](#%5F%5Fcodelineno-11-25)==================================================

Explanations for key metrics:

Median Time to First Token (TTFT): The typical time elapsed from when a request is sent until the first output token is generated.
Median Time Per Output Token (TPOT): The typical time required to generate each token after the first one.
Median Inter-Token Latency (ITL): The typical time delay between a response for the completion of one output token (or output tokens) and the next response for the completion of token(s).
Output token throughput: The rate at which the system generates the output (generated) tokens.
Total Token Throughput: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

Balancing between Throughput and Latencies¶

In LLM inference, the "throughput" can be defined as the number of generated tokens per second (the Output token throughput metric above) or the number of processed tokens per second (the Total Token Throughput metric above). These two throughput metrics are highly correlated. We usually divide the throughput by the number of GPUs used to get the "per-GPU throughput" when comparing across different parallelism configurations. The higher per-GPU throughput is, the fewer GPUs are needed to serve the same amount of the incoming requests.

On the other hand, the “latency” can be defined as the latency from when a request is sent until the first output token is generated (the TTFT metric), the latency between two generated tokens after the first one has been generated (the TPOT metric), or the end-to-end latency from when a request is sent to when the final token is generated (the E2EL metric). The TTFT affects the E2EL more when the input (prompt) sequence lengths are much longer than the output (generated) sequence lengths, while the TPOT affects the E2EL more in the opposite cases.

To achieve higher throughput, tokens from multiple requests must be batched and processed together, but that increases the latencies. Therefore, a balance must be made between throughput and latencies depending on the deployment requirements.

The two main tunable configs for Nemotron Nano 3 are the --tensor-parallel-size (TP) and --max-num-seqs (BS). How they affect the throughput and latencies can be summarized as the following:

At the same BS, higher TP typically results in lower latencies but also lower throughput.
At the same TP size, higher BS typically results in higher throughput but worse latencies, but the maximum BS is limited by the amount of available GPU memory for the kv-cache after the weights are loaded.
Therefore, increasing TP (which would lower the throughput at the same BS) may allow higher BS to run (which would increase the throughput), and the net throughput gain/loss depends on models and configurations.

Note that the statements above assume that the concurrency setting on the client side, like the --max-concurrency flag in the performance benchmarking command, matches the --max-num-seqs (BS) setting on the server side.