"Qwen3.6-35B-A3B-NVFP4 hangs after attention backend selection across 3 vLLM images, including NVIDIA's own official recipe (original) (raw)

Update
vLLM DGX Spark Debug Log

System: NVIDIA GB10 (SM_12.1), Driver 580.159.03, CUDA 13.0.2, DGX OS 7 (Ubuntu 24.04), aarch64
RAM: 121GB unified (GPU+CPU shared pool) | Disk: 3.7TB
Session: 2026-06-14
Status: ✅ FULLY RESOLVED — Qwen3.6-35B-A3B-NVFP4 serving real inference on port 8000

Forum Update Post — Final Resolution (2026-06-14)

Thread: "Qwen3.6-35B-A3B-NVFP4 hangs after attention backend selection across 3 vLLM images, including NVIDIA's own official recipe

Update: Qwen3.6-35B-A3B-NVFP4 serving real inference. Full resolution below.

Summary of Issues Found

Issue 1 (FlashInfer JIT hang): Any Docker image where VLLM_HAS_FLASHINFER_CUBIN = False will deadlock silently when FlashInfer is selected as a backend. SM12.1 (GB10) has no pre-compiled FlashInfer cubins in the images I originally tested (eugr/spark-vllm-docker at the time, vllm/vllm-openai:nightly-aarch64, Package vllm-spark · GitHub). JIT compilation on SM12.1 deadlocks. Fix in old images: force Triton backends via --attention-config '{"backend": "TRITON_ATTN"}' + VLLM_USE_FLASHINFER_SAMPLER=0.

Issue 2 (NVFP4 weight loader KeyError): On Package vllm-spark · GitHub , serving nvidia/Qwen3.6-35B-A3B-NVFP4 crashed with KeyError: 'layers.0.mlp.experts.w2_input_scale'. The checkpoint uses model.language_model.layers.X.mlp.experts.Y.{gate_proj,up_proj,down_proj}.{weight,input_scale,weight_scale,weight_scale_2} (VLM-wrapped, per-expert per-projection). The bjk110 image’s vLLM weight loader expected a different fused key format.

Final Working Solution

Build the eugr image with today’s (2026-06-14) prebuilt wheels, which include:

vLLM 0.22.1rc1.dev511+gc621af169.d20260614 — fixes the NVFP4 weight loader
FlashInfer 0.6.13 with pre-compiled SM12.1 cubins — no more JIT hang, can use FlashInfer natively

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5

--tf5 is required: the model’s config declares transformers_version: 5.7.0.dev0.

Build takes ~4 minutes (downloads prebuilt wheels, no compilation).

Then serve:

docker run -d --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm_qwen36 \
  -e HF_TOKEN=<your_hf_token> \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  -e VLLM_LOGGING_LEVEL=INFO \
  vllm-node-tf5:latest \
  vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --enforce-eager \
    --kv-cache-dtype fp8 \
    --attention-backend FLASHINFER

Verified

$ curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"nvidia/Qwen3.6-35B-A3B-NVFP4",...,"max_model_len":32768}]}

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/Qwen3.6-35B-A3B-NVFP4",
       "messages":[{"role":"user","content":"Say hello in one word"}],
       "max_tokens":10}'
# → {"choices":[{"message":{"content":"Thinking Process:\n1.  **Analyze"...}}],"system_fingerprint":"vllm-0.22.1rc1.dev511+gc621af169.d20260614-..."}

Real tokens, no hang, system_fingerprint confirms vLLM version.

Key Diagnostics for Others

python3 -c "import vllm.envs as e; print(e.VLLM_HAS_FLASHINFER_CUBIN)" — if False, your image needs either precompiled cubins or Triton fallback.
If hitting KeyError: 'layers.X.mlp.experts.w2_input_scale' loading NVFP4: use today’s eugr prebuilt-vllm-current wheel (0.22.1rc1.dev511+). The bjk110 image’s weight loader didn’t handle the VLM-wrapped per-expert per-projection scale format.
Corrupt HF cache: blobs with .incomplete suffix = partial downloads. Delete and re-download with HF_TOKEN (nvidia/* models are gated).
VLLM_ATTENTION_BACKEND env var does not exist in older vLLM versions — override via --attention-config or --attention-backend depending on version.

Full Debug Timeline

Root Cause — CONFIRMED

FlashInfer has no pre-compiled cubins for SM12.1 (VLLM_HAS_FLASHINFER_CUBIN = False).
When any FlashInfer component is selected as a backend, it falls back to JIT CUDA kernel compilation.
JIT compilation on SM12.1 deadlocks silently. The process allocates GPU memory, then hangs with 40–160+ sleeping threads, wchan=futex_do_wait, zero GPU utilization, zero disk IO.

The hang appears immediately after Using FLASHINFER attention backend because attention backend init triggers the JIT path first. FlashInfer is also selected for MoE (FlashInfer CUTLASS) and FP8 linear (FlashInferFP8ScaledMMLinearKernel) — those would also hang if reached.

Secondary cause: All HuggingFace model caches were corrupt (incomplete downloads, all .incomplete blobs). This was masked because FlashInfer was hanging before weight loading was ever attempted.

Attempt 1 — Headless baseline + NCCL_DEBUG + VLLM_LOGGING_LEVEL=DEBUG

Model: nvidia/Qwen3.6-35B-A3B-NVFP4
Result: ❌ HUNG — headless made no difference.
Last log line: Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
Key debug finding:

DEBUG: {FLASH_ATTN: [kv_cache_dtype not supported], FLEX_ATTENTION: [kv_cache_dtype not supported],
        TURBOQUANT: [kv_cache_dtype not supported]}

--kv-cache-dtype fp8 eliminates all other backends → FlashInfer selected → JIT hang.

Attempt 2 — VLLM_ATTENTION_BACKEND=TRITON_ATTN (wrong approach)

Result: ❌ HUNG — VLLM_ATTENTION_BACKEND does not exist in vLLM v0.21.0. Silently ignored.
Finding: Inspected vllm.envs module — no such variable. Override must be via --attention-config CLI arg.

Attempt 3 — --attention-config + --kernel-config (correct approach)

--attention-config '{"backend": "TRITON_ATTN"}' --kernel-config '{"moe_backend": "triton"}'
-e VLLM_USE_FLASHINFER_SAMPLER=0

Result: ✅ PAST HANG — both backends switched to Triton, weight loading began.
Then stalled: EngineCore opened .incomplete HF blob files — downloading unauthenticated from gated model.
Discovery: All 4 cached models had fully incomplete weight files. Cache cleaned with find ... -name "*.incomplete" -delete.

Attempt 4 — TRITON_ATTN + --load-format dummy (stack validation)

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with --load-format dummy
Result: ✅ Server up, port 8000 responding, curl /v1/models → 200 OK.
Confirmed vLLM stack is fully functional with Triton backends.

Attempt 5 — Real weights: Nemotron FP8 (31GB, authenticated download)

Downloaded nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 via container with HF_TOKEN.
Result: ✅ FULL INFERENCE WORKING

INFO Application startup complete.
curl /v1/chat/completions → real tokens returned

Attempt 6 — Qwen3.6-35B-A3B-NVFP4 on bjk110 image (KeyError)

Image: ghcr.io/bjk110/vllm-spark:v022-d568
Config: --attention-config '{"backend": "TRITON_ATTN"}', kv_cache_dtype: fp8
Result: ❌ KeyError: 'layers.0.mlp.experts.w2_input_scale' at qwen3_5.py:393
Root cause: bjk110 image vLLM weight loader expected fused key format; NVFP4 checkpoint uses VLM-wrapped per-expert per-projection scales.
Checkpoint structure: model.language_model.layers.X.mlp.experts.Y.{gate_proj,up_proj,down_proj}.{weight,input_scale,weight_scale,weight_scale_2} — 3 shards (10GB + 10GB + 3GB), architecture Qwen3_5MoeForConditionalGeneration.

Attempt 7 — eugr vllm-node-tf5 image (FINAL — SUCCESS)

Trigger: eugr_nv (NVIDIA moderator) pushed prebuilt-vllm-current release 2026-06-14 with vLLM 0.22.1rc1.dev511.

Image built:

git clone https://github.com/eugr/spark-vllm-docker.git
./build-and-copy.sh --tf5   # 4 min, downloads prebuilt wheels
# → vllm-node-tf5:latest

Config (vllm_qwen36_config.yaml):

model: nvidia/Qwen3.6-35B-A3B-NVFP4
host: 0.0.0.0
port: 8000
tensor_parallel_size: 1
trust_remote_code: true
gpu_memory_utilization: 0.85
max_model_len: 32768
enforce_eager: true
kv_cache_dtype: fp8
attention_backend: FLASHINFER

Result: ✅ FULL INFERENCE WORKING

curl http://localhost:8000/v1/models
→ {"data":[{"id":"nvidia/Qwen3.6-35B-A3B-NVFP4","max_model_len":32768,...}]}

curl http://localhost:8000/v1/chat/completions -d '{"model":"nvidia/Qwen3.6-35B-A3B-NVFP4","messages":[{"role":"user","content":"Say hello in one word"}],"max_tokens":10}'
→ real tokens, system_fingerprint: vllm-0.22.1rc1.dev511+gc621af169.d20260614

Why it worked: New vLLM 0.22.1rc1 fixes the weight loader for Qwen3_5MoeForConditionalGeneration (NVFP4 VLM format). FlashInfer 0.6.13 includes pre-compiled SM12.1 cubins — no more JIT hang, attention_backend: FLASHINFER works natively.

Thank you @eugr_nv for the quick turnaround.