Speed Benchmark - Qwen (original ) (raw )We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths.
Results¶ Qwen3-0.6B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-0.6B
1
BF16
1
414.17
FP8
1
458.03
GPTQ-Int8
1
344.92
6144
BF16
1
1426.46
FP8
1
1572.95
GPTQ-Int8
1
1234.29
14336
BF16
1
2478.02
FP8
1
2689.08
GPTQ-Int8
1
2198.82
30720
BF16
1
3577.42
FP8
1
3819.86
GPTQ-Int8
1
3342.06
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory(MB)
Qwen3-0.6B
1
BF16
1
58.57
1394
FP8
1
24.60
1217
GPTQ-Int8
1
26.56
986
6144
BF16
1
154.82
2066
FP8
1
73.96
1943
GPTQ-Int8
1
93.84
1658
14336
BF16
1
168.48
2963
FP8
1
104.99
2839
GPTQ-Int8
1
219.61
2554
30720
BF16
1
175.93
4755
FP8
1
132.78
4632
GPTQ-Int8
1
345.71
4347
Qwen3-1.7B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-1.7B
1
BF16
1
227.80
FP8
1
333.90
GPTQ-Int8
1
257.40
6144
BF16
1
838.28
FP8
1
1198.20
GPTQ-Int8
1
945.91
14336
BF16
1
1525.71
FP8
1
2095.61
GPTQ-Int8
1
1707.63
30720
BF16
1
2439.03
FP8
1
3165.32
GPTQ-Int8
1
2706.16
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory(MB)
Qwen3-1.7B
1
BF16
1
59.83
3412
FP8
1
23.83
2726
GPTQ-Int8
1
28.06
2229
6144
BF16
1
238.53
4213
FP8
1
90.87
3462
GPTQ-Int8
1
110.82
2901
14336
BF16
1
352.59
5109
FP8
1
153.37
4359
GPTQ-Int8
1
222.78
3798
30720
BF16
1
418.13
6902
FP8
1
235.61
6151
GPTQ-Int8
1
386.85
5590
Qwen3-4B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-4B
1
BF16
1
133.13
FP8
1
200.61
AWQ-INT4
1
199.71
6144
BF16
1
466.19
FP8
1
662.26
AWQ-INT4
1
640.07
14336
BF16
1
789.25
FP8
1
1066.23
AWQ-INT4
1
1006.23
30720
BF16
1
1165.75
FP8
1
1467.71
AWQ-INT4
1
1358.84
63488
BF16
1
1423.98
FP8
1
1660.67
AWQ-INT4
1
1513.97
129042
BF16
1
1371.04
FP8
1
1497.27
AWQ-INT4
1
1375.71
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory(MB)
Qwen3-4B
1
BF16
1
45.94
7973
FP8
1
17.33
5281
AWQ-INT4
1
51.57
2915
6144
BF16
1
159.95
8860
FP8
1
60.55
6144
AWQ-INT4
1
183.04
3881
14336
BF16
1
195.31
10012
FP8
1
96.81
7297
AWQ-INT4
1
265.22
5151
30720
BF16
1
217.97
12317
FP8
1
138.84
9611
AWQ-INT4
1
481.69
7742
Qwen3-8B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-8B
1
BF16
1
81.73
FP8
1
150.25
AWQ-INT4
1
144.11
6144
BF16
1
296.25
FP8
1
516.64
AWQ-INT4
1
477.89
14336
BF16
1
524.70
FP8
1
859.92
AWQ-INT4
1
770.44
30720
BF16
1
832.67
FP8
1
1242.24
AWQ-INT4
1
1075.91
63488
BF16
1
1112.78
FP8
1
1476.46
AWQ-INT4
1
1254.91
129042
BF16
1
1173.32
FP8
1
1393.21
AWQ-INT4
1
1198.06
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory(MB)
Qwen3-8B
1
BF16
1
45.32
15947
FP8
1
15.46
9323
AWQ-INT4
1
51.33
6177
6144
BF16
1
146.12
16811
FP8
1
55.07
10187
AWQ-INT4
1
163.23
7113
14336
BF16
1
183.29
17963
FP8
1
89.64
11340
AWQ-INT4
1
242.97
8409
30720
BF16
1
208.98
20267
FP8
1
130.93
13644
AWQ-INT4
1
438.62
11001
Qwen3-14B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-14B
1
BF16
1
47.10
FP8
1
97.11
AWQ-INT4
1
96.49
6144
BF16
1
174.85
FP8
1
342.95
AWQ-INT4
1
321.62
14336
BF16
1
317.56
FP8
1
587.33
AWQ-INT4
1
525.74
30720
BF16
1
525.80
FP8
1
880.72
AWQ-INT4
1
744.74
63488
BF16
1
742.36
FP8
1
1089.04
AWQ-INT4
1
884.06
129042
BF16
1
826.15
FP8
1
1049.64
AWQ-INT4
1
857.56
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory (MB)
Qwen3-14B
1
BF16
1
40.66
28402
FP8
1
13.02
16012
AWQ-INT4
1
44.67
9962
6144
BF16
1
108.52
29495
FP8
1
44.86
16972
AWQ-INT4
1
128.08
11020
14336
BF16
1
136.36
30775
FP8
1
71.96
18253
AWQ-INT4
1
220.62
12438
30720
BF16
1
155.38
33336
FP8
1
102.63
20813
AWQ-INT4
1
363.25
15323
Qwen3-32B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-32B
1
BF16
1
20.72
FP8
1
46.17
AWQ-INT4
1
47.67
6144
BF16
1
77.82
FP8
1
165.71
AWQ-INT4
1
159.99
14336
BF16
1
143.08
FP8
1
287.60
AWQ-INT4
1
260.44
30720
BF16
1
240.75
FP8
1
436.59
AWQ-INT4
1
366.84
63488
BF16
1
342.96
FP8
1
532.18
AWQ-INT4
1
425.23
129042
BF16
2
711.40
TP=2
FP8
1
491.45
AWQ-INT4
1
395.96
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory (MB)
Qwen3-32B
1
BF16
1
26.24
62751
FP8
1
7.37
33379
AWQ-INT4
1
41.8
19109
6144
BF16
1
51.41
64583
FP8
1
23.57
34915
AWQ-INT4
1
68.71
20795
14336
BF16
1
62.41
66632
FP8
1
36.30
36963
AWQ-INT4
1
107.02
23105
30720
BF16
1
69.16
70728
FP8
1
49.44
41060
AWQ-INT4
1
188.11
27718
Qwen3-30B-A3B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-30B-A3B
1
BF16
1
137.18
FP8
1
155.55
GPTQ-INT4
1
31.29
GPTQ-Marlin
6144
BF16
1
490.10
FP8
1
551.34
GPTQ-INT4
1
120.13
GPTQ-Marlin
14336
BF16
1
849.62
FP8
1
945.13
GPTQ-INT4
1
227.27
GPTQ-Marlin
30720
BF16
1
1283.94
FP8
1
1405.91
GPTQ-INT4
1
404.45
GPTQ-Marlin
63488
BF16
1
1538.79
FP8
1
1647.89
GPTQ-INT4
1
617.09
GPTQ-Marlin
129042
BF16
1
1385.65
FP8
1
1442.14
GPTQ-INT4
1
704.82
GPTQ-Marlin
Model
Input length
Quantization
GPU Num
Speed (tokens/s)
GPU Memory (MB)
Notes
Qwen3-30B-A3B
1
BF16
1
1.89
58462
FP8
1
0.44
30296
GPTQ-INT4
-
-
-
MoE Kernel Unsupported
6144
BF16
1
7.45
59037
FP8
1
1.77
30872
GPTQ-INT4
-
-
-
MoE Kernel Unsupported
14336
BF16
1
14.47
59806
FP8
1
3.5
31641
GPTQ-INT4
-
-
-
MoE Kernel Unsupported
30720
BF16
1
27.03
61342
FP8
1
6.86
33177
GPTQ-INT4
-
-
-
MoE Kernel Unsupported
Qwen3-235B-A22B (SGLang)¶
Model
Input Length
Quantization
GPU Num
Speed (tokens/s)
Note
Qwen3-235B-A22B
1
BF16
8
74.50
TP=8
FP8
4
71.65
TP=4
GPTQ-INT4
4
14.69
TP=4GPTQ-Marlin
6144
BF16
8
289.03
TP=8
FP8
4
275.16
TP=4
GPTQ-INT4
4
56.97
TP=4GPTQ-Marlin
14336
BF16
8
546.73
TP=8
FP8
4
514.23
TP=4
GPTQ-INT4
4
109.13
TP=4GPTQ-Marlin
30720
BF16
8
979.41
TP=8
FP8
4
887.90
TP=4
GPTQ-INT4
4
198.99
TP=4GPTQ-Marlin
63488
BF16
8
1493.91
TP=8
FP8
4
1269.34
TP=4
GPTQ-INT4
4
422.77
TP=4GPTQ-Marlin
129042
BF16
8
1639.54
TP=8
FP8
4
1319.66
TP=4
GPTQ-INT4
4
552.28
TP=4GPTQ-Marlin