NVIDIA Data Center Deep Learning Product Performance AI Inference (original) (raw)
MLPerf Inference v5.0 Performance Benchmarks
Offline Scenario, Closed Division
Network
Throughput
GPU
Server
GPU Version
Target Accuracy
Dataset
Llama3.1 405B
13,886 tokens/sec
72x GB200
NVIDIA GB200 NVL72
NVIDIA GB200
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
Subset of LongBench, LongDataCollections, Ruler, GovReport
1,538 tokens/sec
8x B200
SYS-421GE-NBRT-LCC
NVIDIA B200-SXM-180GB
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
Subset of LongBench, LongDataCollections, Ruler, GovReport
574 tokens/sec
8x H200
Cisco UCS C885A M8
NVIDIA H200-SXM-141GB
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B
98,858 tokens/sec
8x B200
NVIDIA DGX B200
NVIDIA B200-SXM-180GB
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
OpenOrca (max_seq_len=1024)
35,453 tokens/sec
8x H200
ThinkSystem SR680a V3
NVIDIA H200-SXM-141GB
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
OpenOrca (max_seq_len=1024)
Mixtral 8x7B
128,795 tokens/sec
8x B200
SYS-421GE-NBRT-LCC
NVIDIA B200-SXM-180GB
99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)
OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
63,515 tokens/sec
8x H200
ThinkSystem SR780a V3
NVIDIA H200-SXM-141GB
99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)
OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL
30 samples/sec
8x B200
NVIDIA DGX B200
NVIDIA B200-SXM-180GB
FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
Subset of coco-2014 val
19 samples/sec
8x H200
AS-4125GS-TNHR2-LCC
NVIDIA H200-SXM-141GB
FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
Subset of coco-2014 val
RGAT
450,175 samples/sec
8x H200
ThinkSystem SR780a V3
NVIDIA H200-SXM-141GB
99% of FP32 (72.86%)
IGBH
GPT-J
21,626 tokens/sec
8x H200
ThinkSystem SR780a V3
NVIDIA H200-SXM-141GB
99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)
CNN Dailymail (v3.0.0, max_seq_len=2048)
ResNet-50
773,300 samples/sec
8x H200
ThinkSystem SR680a V3
NVIDIA H200-SXM-141GB
76.46% Top1
ImageNet (224x224)
RetinaNet
15,200 samples/sec
8x H200
AS-4125GS-TNHR2-LCC
NVIDIA H200-SXM-141GB
0.3755 mAP
OpenImages (800x800)
DLRMv2
654,489 samples/sec
8x H200
HPE Cray XD670 with Cray ClusterStor
NVIDIA H200-SXM-141GB
99% of FP32 (AUC=80.31%)
Synthetic Multihot Criteo Dataset
3D-UNET
55 samples/sec
8x H200
HPE Cray XD670 with Cray ClusterStor
NVIDIA H200-SXM-141GB
99.9% of FP32 (0.86330 mean DICE score)
KiTS 2019
Server Scenario - Closed Division
Network
Throughput
GPU
Server
GPU Version
Target Accuracy
MLPerf Server LatencyConstraints (ms)
Dataset
Llama3.1 405B
8,850 tokens/sec
72x GB200
NVIDIA GB200 NVL72
NVIDIA GB200
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
TTFT/TPOT: 6000 ms/175 ms
Subset of LongBench, LongDataCollections, Ruler, GovReport
1,080 tokens/sec
8x B200
SYS-A21GE-NBRT
NVIDIA B200-SXM-180GB
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
TTFT/TPOT: 6000 ms/175 ms
Subset of LongBench, LongDataCollections, Ruler, GovReport
294 tokens/sec
8x H200
Cisco UCS C885A M8
NVIDIA H200-SXM-141GB
99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)
TTFT/TPOT: 6000 ms/175 ms
Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B Interactive
62,266 tokens/sec
8x B200
SYS-A21GE-NBRT
NVIDIA B200-SXM-180GB
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
TTFT/TPOT: 450 ms/40 ms
OpenOrca (max_seq_len=1024)
20,235 tokens/sec
8x H200
G893-SD1
NVIDIA H200-SXM-141GB
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
TTFT/TPOT: 450 ms/40 ms
OpenOrca (max_seq_len=1024)
Llama2 70B
98,443 tokens/sec
8x B200
NVIDIA DGX B200
NVIDIA B200-SXM-180GB
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
TTFT/TPOT: 2000 ms/200 ms
OpenOrca (max_seq_len=1024)
33,072 tokens/sec
8x H200
NVIDIA H200
NVIDIA H200-SXM-141GB-CTS
99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)
TTFT/TPOT: 2000 ms/200 ms
OpenOrca (max_seq_len=1024)
Mixtral 8x7B
129,047 tokens/sec
8x B200
SYS-421GE-NBRT-LCC
NVIDIA B200-SXM-180GB
99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)
TTFT/TPOT: 2000 ms/200 ms
OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
61,802 tokens/sec
8x H200
NVIDIA H200
NVIDIA H200-SXM-141GB-CTS
99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)
TTFT/TPOT: 2000 ms/200 ms
OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL
29 samples/sec
8x B200
SYS-A21GE-NBRT
NVIDIA B200-SXM-180GB
FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
20 s
Subset of coco-2014 val
18 samples/sec
8x H200
NVIDIA H200
NVIDIA H200-SXM-141GB-CTS
FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
20 s
Subset of coco-2014 val
GPT-J
21,813 queries/sec
8x H200
Cisco UCS C885A M8
NVIDIA H200-SXM-141GB
99% of FP32 (72.86%)
20 s
CNN Dailymail
ResNet-50
676,219 queries/sec
8x H200
G893-SD1
NVIDIA H200-SXM-141GB
76.46% Top1
15 ms
ImageNet (224x224)
RetinaNet
14,589 queries/sec
8x H200
AS-4125GS-TNHR2-LCC
NVIDIA H200-SXM-141GB
0.3755 mAP
100 ms
OpenImages (800x800)
DLRMv2
590,167 queries/sec
8x H200
HPE Cray XD670 with Cray ClusterStor
NVIDIA H200-SXM-141GB
99% of FP32 (AUC=80.31%)
60 ms
Synthetic Multihot Criteo Dataset
MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information. Llama2 70B Max Sequence Length = 1,024 Mixtral 8x7B Max Sequence Length = 2,048 For MLPerf™ various scenario data, click here For MLPerf™ latency constraints, click here
LLM Inference Performance of NVIDIA Data Center Products
B200 Inference Performance - Per User
Model
Attention
MoE
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
DeepSeek R1 671B
TP8
EP8
1,024
2,048
253 output tokens/sec/user
8x B200
DGX B200
FP4
TensorRT-LLM
NVIDIA B200
Attention: Tensor Parallelism = 8 MoE: Expert Parallelism = 8 TensorRT-LLM version: internal release Batch Size = 1 Input tokens not included in TPS calculations Check out this blog for more details
B200 Inference Performance - Max Throughput
Model
Attention
MoE
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
DeepSeek R1 671B
DP8
EP8
1,024
2,048
30,389 output tokens/sec
8x B200
DGX B200
FP4
TensorRT-LLM
NVIDIA B200
Attention: Data Parallelism = 8 MoE: Expert Parallelism = 8 TensorRT-LLM version: internal release Input tokens not included in TPS calculations Check out this blog for more details
H200 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v3.1 405B
1
8
128
128
3,874 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
128
2048
5,938 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
128
4096
5,168 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
8
1
2048
128
764 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.14a
NVIDIA H200
Llama v3.1 405B
1
8
5000
500
669 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
500
2000
5,084 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
1000
1000
3,400 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
2048
2048
2,941 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 405B
1
8
20000
2000
535 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
1
128
128
4,021 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
1
128
2048
4,166 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
2
128
4096
6,527 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
1
2048
128
466 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
1
5000
500
560 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
2
500
2000
6,848 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
1
1000
1000
2,823 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
2
2048
2048
4,184 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 70B
1
2
20000
2000
641 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
128
128
29,526 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
128
2048
25,399 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
128
4096
17,371 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
2048
128
3,794 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
5000
500
3,988 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
500
2000
21,021 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
1000
1000
17,538 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
2048
2048
11,969 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Llama v3.1 8B
1
1
20000
2000
1,804 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
128
128
31,938 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
128
2048
27,409 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
128
4096
18,505 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
2048
128
3,834 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
5000
500
4,042 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
500
2000
22,355 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
1000
1000
18,426 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
2048
2048
12,347 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mistral 7B
1
1
20000
2000
1,823 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
128
128
17,158 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
128
2048
15,095 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
2
128
4096
21,565 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
2048
128
2,010 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
5000
500
2,309 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
500
2000
12,105 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
1
1000
1000
10,371 output tokens/sec
1x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
2
2048
2048
14,018 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x7B
1
2
20000
2000
2,227 output tokens/sec
2x H200
DGX H200
FP8
TensorRT-LLM 0.15.0
NVIDIA H200
Mixtral 8x22B
1
8
128
128
25,179 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.14.0
NVIDIA H200
Mixtral 8x22B
1
8
128
2048
32,623 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.15.0
NVIDIA H200
Mixtral 8x22B
1
8
128
4096
25,753 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x22B
1
8
2048
128
3,095 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.15.0
NVIDIA H200
Mixtral 8x22B
1
8
5000
500
4,209 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.15.0
NVIDIA H200
Mixtral 8x22B
1
8
500
2000
27,430 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x22B
1
8
1000
1000
20,097 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.15.0
NVIDIA H200
Mixtral 8x22B
1
8
2048
2048
15,799 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.17.0
NVIDIA H200
Mixtral 8x22B
1
8
20000
2000
2,897 output tokens/sec
8x H200
DGX H200
FP8
TensorRT-LLM 0.14.0
NVIDIA H200
TP: Tensor Parallelism PP: Pipeline Parallelism For more information on pipeline parallelism, please read Llama v3.1 405B Blog Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)
GH200 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v3.1 70B
1
1
128
128
3,637 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 70B
1
4
128
2048
10,358 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Llama v3.1 70B
1
4
128
4096
6,628 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Llama v3.1 70B
1
1
2048
128
425 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 70B
1
1
5000
500
422 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 70B
1
4
500
2000
9,091 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Llama v3.1 70B
1
1
1000
1000
1,746 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 70B
1
4
2048
2048
4,865 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Llama v3.1 70B
1
4
20000
2000
959 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
128
128
29,853 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
128
2048
21,770 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
128
4096
14,190 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
2048
128
3,844 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
5000
500
3,933 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
500
2000
17,137 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
1000
1000
16,483 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
2048
2048
10,266 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Llama v3.1 8B
1
1
20000
2000
1,560 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
128
128
32,498 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
128
2048
23,337 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
128
4096
15,018 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
2048
128
3,813 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
5000
500
3,950 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
500
2000
18,556 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
1000
1000
17,252 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
2048
2048
10,756 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mistral 7B
1
1
20000
2000
1,601 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
128
128
16,859 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
128
2048
11,120 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
4
128
4096
30,066 output tokens/sec
4x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.13.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
2048
128
1,994 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
5000
500
2,078 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
500
2000
9,193 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
1000
1000
8,849 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
2048
2048
5,545 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
Mixtral 8x7B
1
1
20000
2000
861 output tokens/sec
1x GH200
NVIDIA Grace Hopper x4 P4496
FP8
TensorRT-LLM 0.17.0
NVIDIA GH200 96B
TP: Tensor Parallelism PP: Pipeline Parallelism
H100 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v3.1 70B
1
1
128
128
3,378 output tokens/sec
1x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Llama v3.1 70B
1
2
128
4096
3,897 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Llama v3.1 70B
1
2
2048
128
774 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.15.0
H100-SXM5-80GB
Llama v3.1 70B
1
2
500
2000
4,973 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Llama v3.1 70B
1
2
1000
1000
4,391 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Llama v3.1 70B
1
2
2048
2048
2,898 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Llama v3.1 70B
1
4
20000
2000
920 output tokens/sec
4x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
1
128
128
15,962 output tokens/sec
1x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
128
2048
23,010 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.15.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
128
4096
14,237 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
1
2048
128
1,893 output tokens/sec
1x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
5000
500
3,646 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
500
2000
18,186 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.14.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
1000
1000
15,932 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.14.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
2048
2048
10,686 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
Mixtral 8x7B
1
2
20000
2000
1,757 output tokens/sec
2x H100
DGX H100
FP8
TensorRT-LLM 0.17.0
H100-SXM5-80GB
TP: Tensor Parallelism PP: Pipeline Parallelism
L40S Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v3.1 8B
1
1
128
128
9,105 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
128
2048
5,366 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
128
4096
3,026 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
2048
128
1,067 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
5000
500
981 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
500
2000
4,274 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
1000
1000
4,055 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
2048
2048
2,225 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Llama v3.1 8B
1
1
20000
2000
328 output tokens/sec
1x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Mixtral 8x7B
4
1
128
128
15,278 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
2
2
128
2048
9,087 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
1
4
128
4096
5,736 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.17.0
NVIDIA L40S
Mixtral 8x7B
4
1
2048
128
2,098 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
2
2
5000
500
1,558 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
2
2
500
2000
7,974 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
2
2
1000
1000
6,579 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
Mixtral 8x7B
2
2
2048
2048
4,217 output tokens/sec
4x L40S
Supermicro SYS-521GE-TNRT
FP8
TensorRT-LLM 0.15.0
NVIDIA L40S
TP: Tensor Parallelism PP: Pipeline Parallelism
Inference Performance of NVIDIA Data Center Products