CUDA illegal memory access with MTP speculative decoding on Nemotron-3-Super-120B-NVFP4 (vLLM cu130-nightly, single DGX Spark GB10) (original) (raw)

Hi all — running the official NVIDIA Nemotron-3-Super Spark Deployment Guide and hitting a hard blocker. MTP speculative decoding crashes at runtime with CUDA error: an
illegal memory access was encountered. Hoping someone has seen this / knows a workaround before I open an upstream issue.

Environment

Config (following NVIDIA official Spark Deployment Guide)

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
–quantization nvfp4
–kv-cache-dtype fp8
–gpu-memory-utilization 0.75
–max-model-len 32768
–max-num-seqs 4
–moe-backend marlin
–attention-backend TRITON_ATTN
–enable-chunked-prefill
–enable-prefix-caching
–enforce-eager
–enable-auto-tool-choice --tool-call-parser hermes
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3,“moe_backend”:“triton”}’
–reasoning-parser-plugin /app/super_v3_reasoning_parser.py
–reasoning-parser super_v3
–served-model-name nemotron-120b
–trust-remote-code --host 0.0.0.0 --port 8000

Env:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

What works

What breaks — first request triggers CUDA illegal memory access

First POST /v1/chat/completions causes the engine to die:

(EngineCore pid=175) ERROR 04-14 12:39:46 [core.py:1112]
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress’ in CUDA docs for more information.
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

The API server then logs EngineDeadError and every subsequent request returns 500.

What I already tried

  1. Without MTP (same image, same model, just remove --speculative-config and reasoning parser): works fine, ~15 tok/s — this is my current stable baseline.
  2. VLLM_TEST_FORCE_FP8_MARLIN=1 on/off: no effect on the CUDA crash.
  3. --enforce-eager: already set. Removing it surfaces an earlier init-time error:
    AttributeError: ‘MergedColumnParallelLinear’ object has no attribute ‘workspace’
  4. at fp8_linear.apply_weights(…) → workspace=layer.workspace (marlin FP8 kernel, modelopt_mixed quantization path). Adding --enforce-eager bypasses this specific path but
    then the runtime CUDA error shows up on the first actual request.
  5. --moe-backend triton (main model, unified backend): rejected — moe_backend=‘triton’ is not supported for NvFP4 MoE.
  6. --moe-backend flashinfer_trtllm: rejected — FLASHINFER_TRTLLM does not support the deployment configuration since kernel does not support current device cuda (expected on
    GB10 per the PSA threads).
  7. v0.19.0-cu130-ubuntu2404 (without MTP): same MergedColumnParallelLinear has no workspace at init, so this bug exists on the stable 0.19 release too, not just nightly.
  8. v0.17.1-cu130 (stable, no MTP support): this is my working baseline. Adding --speculative-config fails with Unexpected keyword argument ‘moe_backend’ inside
    SpeculativeConfig (as expected — the inner moe_backend override landed later).

Questions

  1. Has anyone successfully run Nemotron-3-Super-120B-A12B-NVFP4 with MTP on a single GB10? If yes, exact image tag + flags + env would help a lot.
  2. Is the MergedColumnParallelLinear.workspace missing attribute in modelopt_mixed + marlin FP8 linear kernel a known issue? I can file a vLLM GitHub issue if not — just want
    to avoid duplicates.
  3. Is the runtime illegal memory access likely the same root cause as #2 (workspace not allocated leading to OOB indexing), or a separate problem in the triton unquantized MoE
    path for the draft layer? TurboQuant thread (link below) mentions _next_pow2() padding for Mamba/GDN layers, wondering if the draft MTP path needs similar handling.
  4. If MTP is currently broken on GB10 for this model, is there a known-good alternative for getting above the ~15 tok/s single-stream baseline on a single Spark? I saw
    Qwen3.5-122B-A10B-NVFP4 hitting 38–51 tok/s with MTP elsewhere in this forum — is the situation purely “Qwen MTP works, Nemotron MTP doesn’t” right now?

Related threads I’ve already read

Happy to run any additional diagnostic flags (CUDA_LAUNCH_BLOCKING=1, TORCH_USE_CUDA_DSA, TORCHDYNAMO_VERBOSE=1, TORCH_LOGS=“+dynamo”) and post results. Full docker logs
available on request.

Thanks!
@bjk110 @Albond

chibri April 15, 2026, 2:35pm 2

GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub will get you where you want to be.

See PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM - #205 by eugr

After spark-vllm-docker is installed and built, this should work unless something broke in a recent build:

./run-recipe.sh nemotron-3-super-nvfp4-flashinfer --solo --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Neill June 4, 2026, 8:34pm 3

Using Eugr’s spark-vllm-docker will work as a resolution, and additionally, the latest VLLM has specific fixes for this issue.