Why nemotron 3 NVFP4 models are not deterministic using vLLM? (original) (raw)
Hello everyone,
I cannot run nemotron 3 NVFP4 models using vLLM on a single DGX Spark in a determinstic way.
I run the same prompt two consecutive time and the outputs are different.
The client request configuration has temperature at 0 and a fixed seed.
Here it is the vLLM command used:
docker run --rm -it --gpus all
-e VLLM_FLASHINFER_MOE_BACKEND=throughput
-e VLLM_USE_FLASHINFER_MOE_FP4=1
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
-e HF_HUB_OFFLINE=1
-e TRANSFORMERS_OFFLINE=1
-e VLLM_ENABLE_V1_MULTIPROCESSING=0
-e VLLM_BATCH_INVARIANT=1
-e VLLM_LOGGING_LEVEL=INFO
-v ~/.cache/huggingface:/root/.cache/huggingface
-v /home/shared/AI/VLLM_Patches/nano_v3_reasoning_parser.py:/app/nano_v3_reasoning_parser.py
-p 8000:8000
–name vllm-container
vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404
–enable-log-outputs
–enable-log-requests
–model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
–served-model-name nvidia/nemotron-3-nano
–host 0.0.0.0
–port 8000
–seed 42
–override-generation-config ‘{“temperature”: 0.0, “top_p”: 0.1, “top_k”: 1}’
–generation-config vllm
–attention-backend flashinfer
–kv-cache-dtype fp8
–tensor-parallel-size 1
–pipeline-parallel-size 1
–data-parallel-size 1
–trust-remote-code
–gpu-memory-utilization 0.70
–enable-chunked-prefill
–max-num-seqs 1
–max-num-batched-tokens 32768
–max-model-len 500000
–reasoning-parser-plugin /app/nano_v3_reasoning_parser.py
–reasoning-parser nano_v3
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
Do you see any missing or incorrect configurations in my vLLM command?
0rand June 9, 2026, 2:47pm 2
Maybe because all AI/LLM are non-deterministic in nature and by design? They predict, not calculate. Use Wikipedia. Oh wait, it gets edited all the time. Also non-deterministic. LOL
But with the nemotron FP8 quantized model I can achieve the determinism (same output generated from the same prompt on different runs on the same device, using temperature at 0)…
mangosq June 10, 2026, 10:45am 4
Temp and seed are only part of the equation (sampling in vllm). Other systems within vllm also introduce randomness.
But most importantly floating point calculation by the gpu will always produce slightly different results for the same input values. Smaller quants like nvfp4 store weights with less precision than bigger quants like fp8, thus producing more variability than fp8.
And even if we could invent a gpu that can store integers with the same scale and range as floats using the same number of bits, one might argue that the floating point variability adds to the illusion of creativity for llms, potentially allowing it to better avoid reasoning loops.