Qwen3-Next Usage Guide (original) (raw)
Qwen3-Next is an advanced large language model created by the Qwen team from Alibaba Cloud. It features several key improvements:
- A hybrid attention mechanism
- A highly sparse Mixture-of-Experts (MoE) structure
- Training-stability-friendly optimizations
- A multi-token prediction mechanism for faster inference
Installing vLLM¶
[](#%5F%5Fcodelineno-0-1)uv venv [](#%5F%5Fcodelineno-0-2)source .venv/bin/activate [](#%5F%5Fcodelineno-0-3)uv pip install -U vllm --torch-backend auto
Launching Qwen3-Next with vLLM¶
You can use 4x H200/H20 or 4x A100/A800 GPUs to launch this model.
Basic Multi-GPU Setup¶
[](#%5F%5Fcodelineno-1-1)vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \ [](#%5F%5Fcodelineno-1-2) --tensor-parallel-size 4 \ [](#%5F%5Fcodelineno-1-3) --served-model-name qwen3-next \ [](#%5F%5Fcodelineno-1-4) --enable-prefix-caching
If you encounter torch.AcceleratorError: CUDA error: an illegal memory access was encountered, you can add --compilation_config.cudagraph_mode=PIECEWISE to the startup parameters to resolve this issue. This IMA error may occur in Data Parallel (DP) mode.
For FP8 model¶
For SM90/SM100 machines:
[](#%5F%5Fcodelineno-2-1)vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \ [](#%5F%5Fcodelineno-2-2) --tensor-parallel-size 4 \ [](#%5F%5Fcodelineno-2-3) --enable-prefix-caching
We can accelerate the performance on SM100 machines using the FP8 FlashInfer TRTLLM MoE kernel.
[](#%5F%5Fcodelineno-3-1)VLLM_USE_FLASHINFER_MOE_FP8=1 \ [](#%5F%5Fcodelineno-3-2)VLLM_FLASHINFER_MOE_BACKEND=latency \ [](#%5F%5Fcodelineno-3-3)VLLM_USE_DEEP_GEMM=0 \ [](#%5F%5Fcodelineno-3-4)VLLM_USE_TRTLLM_ATTENTION=0 \ [](#%5F%5Fcodelineno-3-5)VLLM_ATTENTION_BACKEND=FLASH_ATTN \ [](#%5F%5Fcodelineno-3-6)vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \ [](#%5F%5Fcodelineno-3-7)--tensor-parallel-size 4
Advanced Configuration with MTP¶
Qwen3-Next also supports Multi-Token Prediction (MTP in short), you can launch the model server with the following arguments to enable MTP.
[](#%5F%5Fcodelineno-4-1)vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \ [](#%5F%5Fcodelineno-4-2)--tokenizer-mode auto --gpu-memory-utilization 0.8 \ [](#%5F%5Fcodelineno-4-3)--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}' \ [](#%5F%5Fcodelineno-4-4)--tensor-parallel-size 4 --no-enable-chunked-prefill
The speculative-config argument configures speculative decoding settings using a JSON format. The method "qwen3_next_mtp" specifies that the system should use Qwen3-Next's specialized multi-token prediction method. The "num_speculative_tokens": 2 setting means the model will speculate 2 tokens ahead during generation.
Performance Metrics¶
Benchmarking¶
We use the following script to demonstrate how to benchmark Qwen/Qwen3-Next-80B-A3B-Instruct.
[](#%5F%5Fcodelineno-5-1)vllm bench serve \ [](#%5F%5Fcodelineno-5-2) --backend vllm \ [](#%5F%5Fcodelineno-5-3) --model Qwen/Qwen3-Next-80B-A3B-Instruct \ [](#%5F%5Fcodelineno-5-4) --served-model-name qwen3-next \ [](#%5F%5Fcodelineno-5-5) --endpoint /v1/completions \ [](#%5F%5Fcodelineno-5-6) --dataset-name random \ [](#%5F%5Fcodelineno-5-7) --random-input 2048 \ [](#%5F%5Fcodelineno-5-8) --random-output 1024 \ [](#%5F%5Fcodelineno-5-9) --max-concurrency 10 \ [](#%5F%5Fcodelineno-5-10) --num-prompt 100
Usage Tips¶
Tune MoE kernel¶
When starting the model service, you may encounter the following warning in the server log(Suppose the GPU is NVIDIA_H20-3e):
[](#%5F%5Fcodelineno-6-1)(VllmWorker TP2 pid=47571) WARNING 09-09 15:47:25 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/vllm_path/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H20-3e.json']
You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware. Once tuning is complete, a JSON file with a name like E=512,N=128,device_name=NVIDIA_H20-3e.json will be generated. You can specify the directory containing this file for your deployment hardware using the environment variable VLLM_TUNED_CONFIG_FOLDER, like:
[](#%5F%5Fcodelineno-7-1)VLLM_TUNED_CONFIG_FOLDER=your_moe_tuned_dir vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \ [](#%5F%5Fcodelineno-7-2) --tensor-parallel-size 4 \ [](#%5F%5Fcodelineno-7-3) --served-model-name qwen3-next
You should see the following information printed in the server log. This indicates that the tuned MoE configuration has been loaded, which will improve the model service performance.
[](#%5F%5Fcodelineno-8-1)(VllmWorker TP2 pid=60498) INFO 09-09 16:23:07 [fused_moe.py:720] Using configuration from /your_moe_tuned_dir/E=512,N=128,device_name=NVIDIA_H20-3e.json for MoE layer.
Data Parallel Deployment¶
vLLM supports multi-parallel groups. You can refer to Data Parallel Deployment documentation and try parallel combinations that are more suitable for this model.
Function calling¶
vLLM also supports calling user-defined functions. Make sure to run your Qwen3-Next models with the following arguments.
[](#%5F%5Fcodelineno-9-1)vllm serve ... --tool-call-parser hermes --enable-auto-tool-choice
AMD GPU Support¶
Recommended approaches by hardware type are:
MI300X/MI325X/MI355X
Please follow the steps here to install and run Qwen3-Next models on AMD MI300X/MI325X/MI355X GPU.
Step 1: Installing vLLM (AMD ROCm Backend: MI300X, MI325X, MI355X)¶
Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the documentation.
[](#%5F%5Fcodelineno-10-1)uv venv [](#%5F%5Fcodelineno-10-2)source .venv/bin/activate [](#%5F%5Fcodelineno-10-3)uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700
Step 2: Start the vLLM server¶
Run the vllm online serving
[](#%5F%5Fcodelineno-11-1)SAFETENSORS_FAST_GPU=1 \ [](#%5F%5Fcodelineno-11-2)VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ [](#%5F%5Fcodelineno-11-3)vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \ [](#%5F%5Fcodelineno-11-4)--tensor-parallel-size 4 \ [](#%5F%5Fcodelineno-11-5)--max-model-len 32768 \ [](#%5F%5Fcodelineno-11-6)--no-enable-prefix-caching \ [](#%5F%5Fcodelineno-11-7)--trust-remote-code
Step 3: Run Benchmark¶
Open a new terminal and run the following command to execute the benchmark script inside the container.
[](#%5F%5Fcodelineno-12-1) vllm bench serve \ [](#%5F%5Fcodelineno-12-2) --model "Qwen/Qwen3-Next-80B-A3B-Instruct" \ [](#%5F%5Fcodelineno-12-3) --dataset-name random \ [](#%5F%5Fcodelineno-12-4) --random-input-len 8192 \ [](#%5F%5Fcodelineno-12-5) --random-output-len 1024 \ [](#%5F%5Fcodelineno-12-6) --request-rate 10000 \ [](#%5F%5Fcodelineno-12-7) --num-prompts 16 \ [](#%5F%5Fcodelineno-12-8) --ignore-eos \ [](#%5F%5Fcodelineno-12-9) --trust-remote-code