Jetson AGX Thor + vLLM (26.02): MoE performance significantly below reference — missing fused MoE config? (original) (raw)

Hi AastaLLL,

Thank you for the guidance. I followed your recommendation to switch to the Jetson Thor container ( Package vllm · GitHub ) and repeated the full benchmark suite using a 3-run stabilization protocol (warmup + 3 runs at C1 and C8, ISL/OSL 2048/128). The results improved dramatically over my original NGC container runs.


VALIDATED RESULTS (R3 — stable run, Jetson Thor container)

Qwen3 30B-A3B (W4A16)

Llama 3.1 8B (W4A16)

Qwen3 32B (W4A16)

GPT OSS 120B (NVFP4, Thor-exclusive)

Key finding on benchmark methodology: R1 (first run after warmup) consistently underperforms by 12-23% due to cold-start effects — KV cache empty, GPU not fully engaged. R2 and R3 stabilize and match or exceed published references. Single-run benchmarks on this platform will produce misleading results regardless of which container is used.


QWEN3.5 35B-A3B NVFP4 CHECKPOINT ISSUE

During this process I identified two separate problems with the Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 checkpoint.

Issue 1 — Basic serve command runs but does not achieve published performance

The Jetson AI Lab command without MTP flags loads and serves without crashing:

vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4
–gpu-memory-utilization 0.8 --enable-prefix-caching
–reasoning-parser qwen3
–enable-auto-tool-choice --tool-call-parser qwen3_coder

However the performance does not match your published reference of 35 tok/s C1 / 125 tok/s C8.

Our R3 results with this checkpoint and command:

Server logs confirm bf16 fallback is occurring rather than true NVFP4 quantized inference. This is consistent with the weight namespace issue described below preventing correct NVFP4 loading.

Issue 2 — MTP speculative decoding causes hard crash

Adding the MTP flag as advertised on the Jetson AI Lab page:

–speculative-config ‘{“method”: “mtp”, “num_speculative_tokens”: 4}’

Causes a hard crash:

ValueError: There is no module or parameter named ‘language_model’
in Qwen3_5MoeMTP.

Root cause: Inspection of the checkpoint’s safetensors file shows all 123,973 weight keys are prefixed with ‘language_model.’ but vLLM’s Qwen3_5MoeMTP loader expects them under the flat ‘model.’ namespace. The remap_weight_names() function in qwen3_5_mtp.py does not handle this prefix.

Additionally the checkpoint contains Mamba/SSM architecture weights (linear_attn, A_log, conv1d, dt_bias) that are inconsistent with the Qwen3.5 MoE attention architecture — suggesting the safetensors file may be from a different model entirely.

The Jetson AI Lab page explicitly advertises MTP speculative decoding for this exact checkpoint and container combination, but it cannot be made to work with the published weights.

Workaround: The unquantized base model (Qwen/Qwen3.5-35B-A3B) serves successfully and MTP loads without crashing. This trades NVFP4 quantization speed for a working MTP configuration. Our R3 results with the base model:

I have prepared a detailed bug report documenting the weight key structure, full traceback, and verification commands. Happy to share the full report or attach it here if useful to the NVIDIA team. I am also prepared to file issues at the Kbenkhaled HuggingFace repository and the jetson-containers GitHub if that would help get this resolved.

Thank you again for pointing me to the correct container path — the performance difference was dramatic and the methodology findings should be useful to other Thor users.

WayNo

qwen35_mtp_bug_report.docx (11.3 KB)