nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · vLLM MTP unusable on RTX 6000 Pro, as spec decoding consumes 20GB+ VRAM at start-up, causing OOM (original) (raw)

With MTP disabled, vLLM starts serving with roughly 77GB VRAM footprint. Enabling speculative decoding causes OOM even at 0.95 memory allocation against 96GB VRAM

Try FP8 kv cache if not on already?

Which config are you using?

@Chris-Alexiuk

--async-scheduling
--served-model-name nvidia/nemotron-3-super
--dtype auto
--kv-cache-dtype fp8
--tensor-parallel-size 1
--pipeline-parallel-size 1
--data-parallel-size 1
--trust-remote-code
--attention-backend TRITON_ATTN
--gpu-memory-utilization 0.95
--no-enable-chunked-prefill
--max-num-seqs 20
--host 0.0.0.0
--port 8000
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser-plugin "./super_v3_reasoning_parser.py"
--reasoning-parser super_v3
--max-model-len 32763
--no-enable-prefix-caching
--speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}'

(EngineCore_DP0 pid=256) MemoryError: CUDA out of memory. Tried to allocate 2.60 GiB. GPU 0 has a total capacity of 94.97 GiB of which 1.18 GiB is free. Including non-PyTorch memory, this process has 93.77 GiB memory in use. Of the allocated memory 91.62 GiB is allocated by PyTorch, and 1.41 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=256) Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1574 (most recent call first):
(EngineCore_DP0 pid=256) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x9d (0x7304aeadefdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(EngineCore_DP0 pid=256) frame #1: + 0x42815 (0x7304aebae815 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
(EngineCore_DP0 pid=256) frame #2: + 0x42aa2 (0x7304aebaeaa2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
(EngineCore_DP0 pid=256) frame #3: + 0x4315f (0x7304aebaf15f in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
(EngineCore_DP0 pid=256) frame #4: at::detail::empty_generic(c10::ArrayRef, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optionalc10::MemoryFormat) + 0x5cc (0x7304131aba9c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
(EngineCore_DP0 pid=256) frame #5: at::detail::empty_cuda(c10::ArrayRef, c10::ScalarType, std::optionalc10::Device, std::optionalc10::MemoryFormat) + 0x7f (0x7303f4018c8f in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
(EngineCore_DP0 pid=256) frame #6: at::detail::empty_cuda(c10::ArrayRef, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x6e (0x7303f4018e9e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
(EngineCore_DP0 pid=256) frame #7: at::native::empty_cuda(c10::ArrayRef, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x43 (0x7303f4315243 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
(EngineCore_DP0 pid=256) frame #8: + 0x365b8a2 (0x7303f69888a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
(EngineCore_DP0 pid=256) frame #9: + 0x365b9bc (0x7303f69889bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
(EngineCore_DP0 pid=256) frame #10: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRefc10::SymInt, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0xe6 (0x7304142dbe56 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
(EngineCore_DP0 pid=256) frame #11: + 0x2ee5b39 (0x730414743b39 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
(EngineCore_DP0 pid=256) frame #12: at::_ops::empty_memory_format::call(c10::ArrayRefc10::SymInt, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x154 (0x7304143b0994 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
(EngineCore_DP0 pid=256) frame #13: + 0x76ba25 (0x730427433a25 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #14: TVMFFIEnvTensorAlloc + 0x68 (0x73003c359968 in /usr/local/lib/python3.12/dist-packages/tvm_ffi/lib/libtvm_ffi.so)
(EngineCore_DP0 pid=256) frame #15: FusedMoeRunner::getWorkspaceInfo(long, long, long, int, int, tensorrt_llm::kernels::cutlass_kernels::ActivationType, tensorrt_llm::kernels::cutlass_kernels::MOEParallelismConfig, bool) + 0x17e (0x72ecc78274ce in /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so)
(EngineCore_DP0 pid=256) frame #16: + 0x6cea22 (0x72ecc784ea22 in /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so)
(EngineCore_DP0 pid=256) frame #17: FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#5}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const + 0x2de (0x72ecc78554de in /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so)
(EngineCore_DP0 pid=256) frame #18: tvm::ffi::Function::FromTyped<FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#5}>(FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#5}&&)::{lambda(tvm::ffi::AnyView const*, int, tvm::ffi::Any*)#1}::operator()(tvm::ffi::AnyView const*, int, tvm::ffi::Any*) + 0x12657 (0x72ecc78787f7 in /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so)
(EngineCore_DP0 pid=256) frame #19: tvm::ffi::details::FunctionObjImpl<tvm::ffi::Function::FromTyped<FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#5}>(FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#5}&&)::{lambda(tvm::ffi::AnyView const*, int, tvm::ffi::Any*)#1}>::SafeCall(void*, TVMFFIAny const*, int, TVMFFIAny*) + 0x373 (0x72ecc782dad3 in /usr/local/lib/python3.12/dist-packages/flashinfer_jit_cache/jit_cache/fused_moe_120/fused_moe_120.so)
(EngineCore_DP0 pid=256) frame #20: + 0x57914 (0x730376ee4914 in /usr/local/lib/python3.12/dist-packages/tvm_ffi/core.abi3.so)
(EngineCore_DP0 pid=256) frame #21: _PyObject_Call + 0x93 (0x580f33 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #22: _PyEval_EvalFrameDefault + 0x4fd7 (0x54e8b7 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #23: + 0xa49f56 (0x730427711f56 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #24: + 0xdc0da5 (0x730427a88da5 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #25: + 0x66a757c (0x730417f0557c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
(EngineCore_DP0 pid=256) frame #26: + 0xb13884 (0x7304277db884 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #27: + 0xb13d88 (0x7304277dbd88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #28: + 0x9f7c72 (0x7304276bfc72 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #29: + 0x40e460 (0x7304270d6460 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
(EngineCore_DP0 pid=256) frame #30: VLLM::EngineCore() [0x56cc2d]
(EngineCore_DP0 pid=256) frame #31: _PyObject_Call + 0x93 (0x580f33 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #32: _PyEval_EvalFrameDefault + 0x4fd7 (0x54e8b7 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #33: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #34: _PyObject_Call_Prepend + 0x59 (0x57e2e9 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #35: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #36: _PyObject_MakeTpCall + 0x2fb (0x53f2ab in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #37: _PyEval_EvalFrameDefault + 0x700 (0x549fe0 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #38: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #39: _PyObject_Call_Prepend + 0x59 (0x57e2e9 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #40: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #41: _PyObject_MakeTpCall + 0x2fb (0x53f2ab in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #42: _PyEval_EvalFrameDefault + 0x700 (0x549fe0 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #43: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #44: _PyObject_Call_Prepend + 0x59 (0x57e2e9 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #45: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #46: _PyObject_MakeTpCall + 0x2fb (0x53f2ab in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #47: _PyEval_EvalFrameDefault + 0x700 (0x549fe0 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #48: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #49: _PyObject_Call_Prepend + 0xbe (0x57e34e in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #50: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #51: _PyObject_Call + 0x93 (0x580f33 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #52: _PyEval_EvalFrameDefault + 0x4fd7 (0x54e8b7 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #53: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #54: _PyObject_Call_Prepend + 0xbe (0x57e34e in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #55: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #56: _PyObject_Call + 0x93 (0x580f33 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #57: _PyEval_EvalFrameDefault + 0x4fd7 (0x54e8b7 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #58: _PyObject_FastCallDictTstate + 0x1d8 (0x541b48 in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #59: _PyObject_Call_Prepend + 0xbe (0x57e34e in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #60: VLLM::EngineCore() [0x6690fd]
(EngineCore_DP0 pid=256) frame #61: _PyObject_MakeTpCall + 0x2fb (0x53f2ab in VLLM::EngineCore)
(EngineCore_DP0 pid=256) frame #62: _PyEval_EvalFrameDefault + 0x700 (0x549fe0 in VLLM::EngineCore)
(EngineCore_DP0 pid=256)

I don't think this model is intended for 6000 Pro. Even if MTP added no VRAM footprint you'd have very limited context length, no?

IMHO this model was actually released more as a technology demonstrator as something directly intended for end-users. "Here's all the materials and best-practices to properly make your own extremely performant state-of-the-art local model with NVFP4 to make the best use of Nvidia hardware." Similar to Unreal Engine putting out a demonstration game/scene when a new version of UE comes out.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images