Speed Benchmark - Qwen (original) (raw)

We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths.

Results

Qwen3-0.6B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-0.6B 1 BF16 1 414.17
FP8 1 458.03
GPTQ-Int8 1 344.92
6144 BF16 1 1426.46
FP8 1 1572.95
GPTQ-Int8 1 1234.29
14336 BF16 1 2478.02
FP8 1 2689.08
GPTQ-Int8 1 2198.82
30720 BF16 1 3577.42
FP8 1 3819.86
GPTQ-Int8 1 3342.06

Qwen3-0.6B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-0.6B 1 BF16 1 58.57 1394
FP8 1 24.60 1217
GPTQ-Int8 1 26.56 986
6144 BF16 1 154.82 2066
FP8 1 73.96 1943
GPTQ-Int8 1 93.84 1658
14336 BF16 1 168.48 2963
FP8 1 104.99 2839
GPTQ-Int8 1 219.61 2554
30720 BF16 1 175.93 4755
FP8 1 132.78 4632
GPTQ-Int8 1 345.71 4347

Qwen3-1.7B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-1.7B 1 BF16 1 227.80
FP8 1 333.90
GPTQ-Int8 1 257.40
6144 BF16 1 838.28
FP8 1 1198.20
GPTQ-Int8 1 945.91
14336 BF16 1 1525.71
FP8 1 2095.61
GPTQ-Int8 1 1707.63
30720 BF16 1 2439.03
FP8 1 3165.32
GPTQ-Int8 1 2706.16

Qwen3-1.7B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-1.7B 1 BF16 1 59.83 3412
FP8 1 23.83 2726
GPTQ-Int8 1 28.06 2229
6144 BF16 1 238.53 4213
FP8 1 90.87 3462
GPTQ-Int8 1 110.82 2901
14336 BF16 1 352.59 5109
FP8 1 153.37 4359
GPTQ-Int8 1 222.78 3798
30720 BF16 1 418.13 6902
FP8 1 235.61 6151
GPTQ-Int8 1 386.85 5590

Qwen3-4B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-4B 1 BF16 1 133.13
FP8 1 200.61
AWQ-INT4 1 199.71
6144 BF16 1 466.19
FP8 1 662.26
AWQ-INT4 1 640.07
14336 BF16 1 789.25
FP8 1 1066.23
AWQ-INT4 1 1006.23
30720 BF16 1 1165.75
FP8 1 1467.71
AWQ-INT4 1 1358.84
63488 BF16 1 1423.98
FP8 1 1660.67
AWQ-INT4 1 1513.97
129042 BF16 1 1371.04
FP8 1 1497.27
AWQ-INT4 1 1375.71

Qwen3-4B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-4B 1 BF16 1 45.94 7973
FP8 1 17.33 5281
AWQ-INT4 1 51.57 2915
6144 BF16 1 159.95 8860
FP8 1 60.55 6144
AWQ-INT4 1 183.04 3881
14336 BF16 1 195.31 10012
FP8 1 96.81 7297
AWQ-INT4 1 265.22 5151
30720 BF16 1 217.97 12317
FP8 1 138.84 9611
AWQ-INT4 1 481.69 7742

Qwen3-8B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-8B 1 BF16 1 81.73
FP8 1 150.25
AWQ-INT4 1 144.11
6144 BF16 1 296.25
FP8 1 516.64
AWQ-INT4 1 477.89
14336 BF16 1 524.70
FP8 1 859.92
AWQ-INT4 1 770.44
30720 BF16 1 832.67
FP8 1 1242.24
AWQ-INT4 1 1075.91
63488 BF16 1 1112.78
FP8 1 1476.46
AWQ-INT4 1 1254.91
129042 BF16 1 1173.32
FP8 1 1393.21
AWQ-INT4 1 1198.06

Qwen3-8B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-8B 1 BF16 1 45.32 15947
FP8 1 15.46 9323
AWQ-INT4 1 51.33 6177
6144 BF16 1 146.12 16811
FP8 1 55.07 10187
AWQ-INT4 1 163.23 7113
14336 BF16 1 183.29 17963
FP8 1 89.64 11340
AWQ-INT4 1 242.97 8409
30720 BF16 1 208.98 20267
FP8 1 130.93 13644
AWQ-INT4 1 438.62 11001

Qwen3-14B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-14B 1 BF16 1 47.10
FP8 1 97.11
AWQ-INT4 1 96.49
6144 BF16 1 174.85
FP8 1 342.95
AWQ-INT4 1 321.62
14336 BF16 1 317.56
FP8 1 587.33
AWQ-INT4 1 525.74
30720 BF16 1 525.80
FP8 1 880.72
AWQ-INT4 1 744.74
63488 BF16 1 742.36
FP8 1 1089.04
AWQ-INT4 1 884.06
129042 BF16 1 826.15
FP8 1 1049.64
AWQ-INT4 1 857.56

Qwen3-14B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-14B 1 BF16 1 40.66 28402
FP8 1 13.02 16012
AWQ-INT4 1 44.67 9962
6144 BF16 1 108.52 29495
FP8 1 44.86 16972
AWQ-INT4 1 128.08 11020
14336 BF16 1 136.36 30775
FP8 1 71.96 18253
AWQ-INT4 1 220.62 12438
30720 BF16 1 155.38 33336
FP8 1 102.63 20813
AWQ-INT4 1 363.25 15323

Qwen3-32B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-32B 1 BF16 1 20.72
FP8 1 46.17
AWQ-INT4 1 47.67
6144 BF16 1 77.82
FP8 1 165.71
AWQ-INT4 1 159.99
14336 BF16 1 143.08
FP8 1 287.60
AWQ-INT4 1 260.44
30720 BF16 1 240.75
FP8 1 436.59
AWQ-INT4 1 366.84
63488 BF16 1 342.96
FP8 1 532.18
AWQ-INT4 1 425.23
129042 BF16 2 711.40 TP=2
FP8 1 491.45
AWQ-INT4 1 395.96

Qwen3-32B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-32B 1 BF16 1 26.24 62751
FP8 1 7.37 33379
AWQ-INT4 1 41.8 19109
6144 BF16 1 51.41 64583
FP8 1 23.57 34915
AWQ-INT4 1 68.71 20795
14336 BF16 1 62.41 66632
FP8 1 36.30 36963
AWQ-INT4 1 107.02 23105
30720 BF16 1 69.16 70728
FP8 1 49.44 41060
AWQ-INT4 1 188.11 27718

Qwen3-30B-A3B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-30B-A3B 1 BF16 1 137.18
FP8 1 155.55
GPTQ-INT4 1 31.29 GPTQ-Marlin
6144 BF16 1 490.10
FP8 1 551.34
GPTQ-INT4 1 120.13 GPTQ-Marlin
14336 BF16 1 849.62
FP8 1 945.13
GPTQ-INT4 1 227.27 GPTQ-Marlin
30720 BF16 1 1283.94
FP8 1 1405.91
GPTQ-INT4 1 404.45 GPTQ-Marlin
63488 BF16 1 1538.79
FP8 1 1647.89
GPTQ-INT4 1 617.09 GPTQ-Marlin
129042 BF16 1 1385.65
FP8 1 1442.14
GPTQ-INT4 1 704.82 GPTQ-Marlin

Qwen3-30B-A3B (Transformers)

Model Input length Quantization GPU Num Speed (tokens/s) GPU Memory (MB) Notes
Qwen3-30B-A3B 1 BF16 1 1.89 58462
FP8 1 0.44 30296
GPTQ-INT4 - - - MoE Kernel Unsupported
6144 BF16 1 7.45 59037
FP8 1 1.77 30872
GPTQ-INT4 - - - MoE Kernel Unsupported
14336 BF16 1 14.47 59806
FP8 1 3.5 31641
GPTQ-INT4 - - - MoE Kernel Unsupported
30720 BF16 1 27.03 61342
FP8 1 6.86 33177
GPTQ-INT4 - - - MoE Kernel Unsupported

Qwen3-235B-A22B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-235B-A22B 1 BF16 8 74.50 TP=8
FP8 4 71.65 TP=4
GPTQ-INT4 4 14.69 TP=4GPTQ-Marlin
6144 BF16 8 289.03 TP=8
FP8 4 275.16 TP=4
GPTQ-INT4 4 56.97 TP=4GPTQ-Marlin
14336 BF16 8 546.73 TP=8
FP8 4 514.23 TP=4
GPTQ-INT4 4 109.13 TP=4GPTQ-Marlin
30720 BF16 8 979.41 TP=8
FP8 4 887.90 TP=4
GPTQ-INT4 4 198.99 TP=4GPTQ-Marlin
63488 BF16 8 1493.91 TP=8
FP8 4 1269.34 TP=4
GPTQ-INT4 4 422.77 TP=4GPTQ-Marlin
129042 BF16 8 1639.54 TP=8
FP8 4 1319.66 TP=4
GPTQ-INT4 4 552.28 TP=4GPTQ-Marlin