NVIDIA Data Center Deep Learning Product Performance AI Inference (original) (raw)

MLPerf Inference v5.0 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
Llama3.1 405B 13,886 tokens/sec 72x GB200 NVIDIA GB200 NVL72 NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
1,538 tokens/sec 8x B200 SYS-421GE-NBRT-LCC NVIDIA B200-SXM-180GB 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
574 tokens/sec 8x H200 Cisco UCS C885A M8 NVIDIA H200-SXM-141GB 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 98,858 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200-SXM-180GB 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
35,453 tokens/sec 8x H200 ThinkSystem SR680a V3 NVIDIA H200-SXM-141GB 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
Mixtral 8x7B 128,795 tokens/sec 8x B200 SYS-421GE-NBRT-LCC NVIDIA B200-SXM-180GB 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
63,515 tokens/sec 8x H200 ThinkSystem SR780a V3 NVIDIA H200-SXM-141GB 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL 30 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200-SXM-180GB FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
19 samples/sec 8x H200 AS-4125GS-TNHR2-LCC NVIDIA H200-SXM-141GB FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
RGAT 450,175 samples/sec 8x H200 ThinkSystem SR780a V3 NVIDIA H200-SXM-141GB 99% of FP32 (72.86%) IGBH
GPT-J 21,626 tokens/sec 8x H200 ThinkSystem SR780a V3 NVIDIA H200-SXM-141GB 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048)
ResNet-50 773,300 samples/sec 8x H200 ThinkSystem SR680a V3 NVIDIA H200-SXM-141GB 76.46% Top1 ImageNet (224x224)
RetinaNet 15,200 samples/sec 8x H200 AS-4125GS-TNHR2-LCC NVIDIA H200-SXM-141GB 0.3755 mAP OpenImages (800x800)
DLRMv2 654,489 samples/sec 8x H200 HPE Cray XD670 with Cray ClusterStor NVIDIA H200-SXM-141GB 99% of FP32 (AUC=80.31%) Synthetic Multihot Criteo Dataset
3D-UNET 55 samples/sec 8x H200 HPE Cray XD670 with Cray ClusterStor NVIDIA H200-SXM-141GB 99.9% of FP32 (0.86330 mean DICE score) KiTS 2019

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server LatencyConstraints (ms) Dataset
Llama3.1 405B 8,850 tokens/sec 72x GB200 NVIDIA GB200 NVL72 NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
1,080 tokens/sec 8x B200 SYS-A21GE-NBRT NVIDIA B200-SXM-180GB 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
294 tokens/sec 8x H200 Cisco UCS C885A M8 NVIDIA H200-SXM-141GB 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B Interactive 62,266 tokens/sec 8x B200 SYS-A21GE-NBRT NVIDIA B200-SXM-180GB 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
20,235 tokens/sec 8x H200 G893-SD1 NVIDIA H200-SXM-141GB 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
Llama2 70B 98,443 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200-SXM-180GB 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
33,072 tokens/sec 8x H200 NVIDIA H200 NVIDIA H200-SXM-141GB-CTS 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
Mixtral 8x7B 129,047 tokens/sec 8x B200 SYS-421GE-NBRT-LCC NVIDIA B200-SXM-180GB 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) TTFT/TPOT: 2000 ms/200 ms OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
61,802 tokens/sec 8x H200 NVIDIA H200 NVIDIA H200-SXM-141GB-CTS 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) TTFT/TPOT: 2000 ms/200 ms OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL 29 samples/sec 8x B200 SYS-A21GE-NBRT NVIDIA B200-SXM-180GB FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
18 samples/sec 8x H200 NVIDIA H200 NVIDIA H200-SXM-141GB-CTS FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
GPT-J 21,813 queries/sec 8x H200 Cisco UCS C885A M8 NVIDIA H200-SXM-141GB 99% of FP32 (72.86%) 20 s CNN Dailymail
ResNet-50 676,219 queries/sec 8x H200 G893-SD1 NVIDIA H200-SXM-141GB 76.46% Top1 15 ms ImageNet (224x224)
RetinaNet 14,589 queries/sec 8x H200 AS-4125GS-TNHR2-LCC NVIDIA H200-SXM-141GB 0.3755 mAP 100 ms OpenImages (800x800)
DLRMv2 590,167 queries/sec 8x H200 HPE Cray XD670 with Cray ClusterStor NVIDIA H200-SXM-141GB 99% of FP32 (AUC=80.31%) 60 ms Synthetic Multihot Criteo Dataset

MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Per User

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
DeepSeek R1 671B TP8 EP8 1,024 2,048 253 output tokens/sec/user 8x B200 DGX B200 FP4 TensorRT-LLM NVIDIA B200

Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Batch Size = 1
Input tokens not included in TPS calculations
Check out this blog for more details

B200 Inference Performance - Max Throughput

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
DeepSeek R1 671B DP8 EP8 1,024 2,048 30,389 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM NVIDIA B200

Attention: Data Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Input tokens not included in TPS calculations
Check out this blog for more details

H200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 405B 1 8 128 128 3,874 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 128 2048 5,938 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 128 4096 5,168 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 8 1 2048 128 764 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.14a NVIDIA H200
Llama v3.1 405B 1 8 5000 500 669 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 500 2000 5,084 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 1000 1000 3,400 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 2048 2048 2,941 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 405B 1 8 20000 2000 535 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 1 128 128 4,021 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 1 128 2048 4,166 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 2 128 4096 6,527 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 1 2048 128 466 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 1 5000 500 560 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 2 500 2000 6,848 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 1 1000 1000 2,823 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 2 2048 2048 4,184 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 70B 1 2 20000 2000 641 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 128 128 29,526 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 128 2048 25,399 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 128 4096 17,371 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 2048 128 3,794 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 5000 500 3,988 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 500 2000 21,021 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 1000 1000 17,538 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 2048 2048 11,969 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Llama v3.1 8B 1 1 20000 2000 1,804 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 128 128 31,938 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 128 2048 27,409 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 128 4096 18,505 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 2048 128 3,834 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 5000 500 4,042 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 500 2000 22,355 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 1000 1000 18,426 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 2048 2048 12,347 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mistral 7B 1 1 20000 2000 1,823 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 128 128 17,158 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 128 2048 15,095 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 2 128 4096 21,565 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 2048 128 2,010 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 5000 500 2,309 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 500 2000 12,105 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 1 1000 1000 10,371 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 2 2048 2048 14,018 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x7B 1 2 20000 2000 2,227 output tokens/sec 2x H200 DGX H200 FP8 TensorRT-LLM 0.15.0 NVIDIA H200
Mixtral 8x22B 1 8 128 128 25,179 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.14.0 NVIDIA H200
Mixtral 8x22B 1 8 128 2048 32,623 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.15.0 NVIDIA H200
Mixtral 8x22B 1 8 128 4096 25,753 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x22B 1 8 2048 128 3,095 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.15.0 NVIDIA H200
Mixtral 8x22B 1 8 5000 500 4,209 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.15.0 NVIDIA H200
Mixtral 8x22B 1 8 500 2000 27,430 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x22B 1 8 1000 1000 20,097 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.15.0 NVIDIA H200
Mixtral 8x22B 1 8 2048 2048 15,799 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.17.0 NVIDIA H200
Mixtral 8x22B 1 8 20000 2000 2,897 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.14.0 NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

GH200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B 1 1 128 128 3,637 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 70B 1 4 128 2048 10,358 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Llama v3.1 70B 1 4 128 4096 6,628 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Llama v3.1 70B 1 1 2048 128 425 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 70B 1 1 5000 500 422 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 70B 1 4 500 2000 9,091 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Llama v3.1 70B 1 1 1000 1000 1,746 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 70B 1 4 2048 2048 4,865 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Llama v3.1 70B 1 4 20000 2000 959 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 128 128 29,853 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 128 2048 21,770 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 128 4096 14,190 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 2048 128 3,844 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 5000 500 3,933 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 500 2000 17,137 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 1000 1000 16,483 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 2048 2048 10,266 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Llama v3.1 8B 1 1 20000 2000 1,560 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 128 128 32,498 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 128 2048 23,337 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 128 4096 15,018 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 2048 128 3,813 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 5000 500 3,950 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 500 2000 18,556 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 1000 1000 17,252 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 2048 2048 10,756 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mistral 7B 1 1 20000 2000 1,601 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 128 128 16,859 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 128 2048 11,120 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 4 128 4096 30,066 output tokens/sec 4x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.13.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 2048 128 1,994 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 5000 500 2,078 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 500 2000 9,193 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 1000 1000 8,849 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 2048 2048 5,545 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B
Mixtral 8x7B 1 1 20000 2000 861 output tokens/sec 1x GH200 NVIDIA Grace Hopper x4 P4496 FP8 TensorRT-LLM 0.17.0 NVIDIA GH200 96B

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B 1 1 128 128 3,378 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Llama v3.1 70B 1 2 128 4096 3,897 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Llama v3.1 70B 1 2 2048 128 774 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.15.0 H100-SXM5-80GB
Llama v3.1 70B 1 2 500 2000 4,973 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Llama v3.1 70B 1 2 1000 1000 4,391 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Llama v3.1 70B 1 2 2048 2048 2,898 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Llama v3.1 70B 1 4 20000 2000 920 output tokens/sec 4x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 1 128 128 15,962 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 128 2048 23,010 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.15.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 128 4096 14,237 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 1 2048 128 1,893 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 5000 500 3,646 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 500 2000 18,186 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.14.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 1000 1000 15,932 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.14.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 2048 2048 10,686 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB
Mixtral 8x7B 1 2 20000 2000 1,757 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 0.17.0 H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 8B 1 1 128 128 9,105 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 128 2048 5,366 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 128 4096 3,026 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 2048 128 1,067 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 5000 500 981 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 500 2000 4,274 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 1000 1000 4,055 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 2048 2048 2,225 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Llama v3.1 8B 1 1 20000 2000 328 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Mixtral 8x7B 4 1 128 128 15,278 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 2 2 128 2048 9,087 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 1 4 128 4096 5,736 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.17.0 NVIDIA L40S
Mixtral 8x7B 4 1 2048 128 2,098 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 2 2 5000 500 1,558 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 2 2 500 2000 7,974 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 2 2 1000 1000 6,579 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S
Mixtral 8x7B 2 2 2048 2048 4,217 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.15.0 NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512) 1 4.33 images/sec - 231.26 1x H200 DGX H200 24.10-py3 INT8 Synthetic TensorRT 10.5.0.26 NVIDIA H200
4 6.8 images/sec - 588.08 1x H200 DGX H200 24.10-py3 INT8 Synthetic TensorRT 10.5.0.26 NVIDIA H200
Stable Diffusion XL 1 0.86 images/sec - 1157.27 1x H200 DGX H200 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA H200
ResNet-50v1.5 8 20,758 images/sec 67 images/sec/watt 0.39 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
128 64,817 images/sec 107 images/sec/watt 1.97 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
EfficientNet-B0 8 16,727 images/sec 77 images/sec/watt 0.48 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
128 56,866 images/sec 122 images/sec/watt 2.25 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
EfficientNet-B4 8 4,523 images/sec 14 images/sec/watt 1.77 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
128 8,993 images/sec 15 images/sec/watt 14.23 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
HF Swin Base 8 4,938 samples/sec 11 samples/sec/watt 1.62 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
32 8,091 samples/sec 12 samples/sec/watt 3.95 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
HF Swin Large 8 3,330 samples/sec 6 samples/sec/watt 2.4 1x H200 DGX H200 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 NVIDIA H200
32 4,694 samples/sec 7 samples/sec/watt 6.82 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
HF ViT Base 8 8,695 samples/sec 19 samples/sec/watt 0.92 1x H200 DGX H200 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
64 15,570 samples/sec 23 samples/sec/watt 4.11 1x H200 DGX H200 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
HF ViT Large 8 3,634 samples/sec 6 samples/sec/watt 2.2 1x H200 DGX H200 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
64 5,454 samples/sec 8 samples/sec/watt 11.74 1x H200 DGX H200 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
QuartzNet 8 6,755 samples/sec 24 samples/sec/watt 1.18 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
128 34,234 samples/sec 90 samples/sec/watt 3.74 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200
RetinaNet-RN34 8 3,024 images/sec 8 images/sec/watt 2.65 1x H200 DGX H200 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512) 1 4.27 images/sec - 234.4 1x GH200 NVIDIA P3880 24.09-py3 INT8 Synthetic TensorRT 10.4.0.26 GH200 96GB
4 5.82 images/sec - 687.91 1x GH200 NVIDIA P3880 24.09-py3 INT8 Synthetic TensorRT 10.4.0.26 GH200 96GB
Stable Diffusion XL 1 0.68 images/sec - 1149.44 1x GH200 NVIDIA P3880 24.10-py3 INT8 Synthetic TensorRT 10.5.0 GH200 96GB
ResNet-50v1.5 8 20,736 images/sec 61 images/sec/watt 0.39 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
128 66,791 images/sec 106 images/sec/watt 1.92 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
EfficientNet-B0 8 16,814 images/sec 68 images/sec/watt 0.48 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
128 57,461 images/sec 117 images/sec/watt 2.23 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
EfficientNet-B4 8 4,489 images/sec 13 images/sec/watt 1.78 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
128 8,992 images/sec 15 images/sec/watt 14.24 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
HF Swin Base 8 4,894 samples/sec 11 samples/sec/watt 1.63 1x GH200 NVIDIA P3880 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
32 8,003 samples/sec 12 samples/sec/watt 4 1x GH200 NVIDIA P3880 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
HF Swin Large 8 3,300 samples/sec 6 samples/sec/watt 2.42 1x GH200 NVIDIA P3880 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
32 4,495 samples/sec 7 samples/sec/watt 7.12 1x GH200 NVIDIA P3880 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
HF ViT Base 8 8,588 samples/sec 19 samples/sec/watt 0.93 1x GH200 NVIDIA P3880 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
64 15,089 samples/sec 23 samples/sec/watt 4.24 1x GH200 NVIDIA P3880 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
HF ViT Large 8 3,707 samples/sec 6 samples/sec/watt 2.16 1x GH200 NVIDIA P3880 24.12-py3 FP8 Synthetic TensorRT 10.7.0 GH200 96GB
64 5,703 samples/sec 7 samples/sec/watt 11.22 1x GH200 NVIDIA P3880 24.12-py3 FP8 Synthetic TensorRT 10.7.0 GH200 96GB
QuartzNet 8 6,763 samples/sec 22 samples/sec/watt 1.18 1x GH200 NVIDIA P3880 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
128 34,497 samples/sec 88 samples/sec/watt 3.71 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e
RetinaNet-RN34 8 2,971 images/sec 5 images/sec/watt 2.69 1x GH200 NVIDIA P3880 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 GH200 144GB HBM3e

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512) 1 4.22 images/sec - 236.8 1x H100 DGX H100 24.10-py3 INT8 Synthetic TensorRT 10.5.0.26 H100 SXM5-80GB
4 6.41 images/sec - 624.6 1x H100 DGX H100 24.10-py3 INT8 Synthetic TensorRT 10.5.0.26 H100 SXM5-80GB
Stable Diffusion XL 1 0.83 images/sec - 1210.08 1x H100 DGX H100 24.10-py3 INT8 Synthetic TensorRT 10.5.0 H100 SXM5-80GB
ResNet-50v1.5 8 21,620 images/sec 63 images/sec/watt 0.37 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
128 59,718 images/sec 99 images/sec/watt 2.14 1x H100 DGX H100 25.01-py3 INT8 Synthetic TensorRT 10.8.0.40 H100-SXM5-80GB
EfficientNet-B0 8 16,425 images/sec 67 images/sec/watt 0.49 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
128 55,418 images/sec 115 images/sec/watt 2.31 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
EfficientNet-B4 8 4,544 images/sec 13 images/sec/watt 1.76 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
128 8,149 images/sec 14 images/sec/watt 15.71 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
HF Swin Base 8 4,677 samples/sec 10 samples/sec/watt 1.71 1x H100 DGX H100 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
32 7,238 samples/sec 11 samples/sec/watt 4.42 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
HF Swin Large 8 3,102 samples/sec 6 samples/sec/watt 2.58 1x H100 DGX H100 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
32 4,396 samples/sec 6 samples/sec/watt 7.28 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
HF ViT Base 8 8,280 samples/sec 17 samples/sec/watt 0.97 1x H100 DGX H100 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
64 13,907 samples/sec 21 samples/sec/watt 4.6 1x H100 DGX H100 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
HF ViT Large 8 3,691 samples/sec 5 samples/sec/watt 2.17 1x H100 DGX H100 24.12-py3 FP8 Synthetic TensorRT 10.7.0.23 H100-SXM5-80GB
64 5,323 samples/sec 8 samples/sec/watt 12.02 1x H100 DGX H100 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
QuartzNet 8 6,774 samples/sec 22 samples/sec/watt 1.18 1x H100 DGX H100 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
128 35,152 samples/sec 95 samples/sec/watt 3.64 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB
RetinaNet-RN34 8 2,759 images/sec 15 images/sec/watt 2.9 1x H100 DGX H100 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 H100-SXM5-80GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512) 1 2.49 images/sec - 401.48 1x L40S Supermicro SYS-521GE-TNRT 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L40S
4 2.91 images/sec - 1372.72 1x L40S Supermicro SYS-521GE-TNRT 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L40S
Stable Diffusion XL 1 0.37 images/sec - 2678.19 1x L40S Supermicro SYS-521GE-TNRT 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L40S
ResNet-50v1.5 8 22,998 images/sec 70 images/sec/watt 0.35 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
32 28,845 images/sec 83 images/sec/watt 4.44 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
EfficientDet-D0 8 4,680 images/sec 16 images/sec/watt 1.71 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
EfficientNet-B0 8 20,539 images/sec 95 images/sec/watt 0.39 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
32 42,709 images/sec 127 images/sec/watt 3 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
EfficientNet-B4 8 5,163 images/sec 17 images/sec/watt 1.55 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
16 4,034 images/sec 12 images/sec/watt 31.73 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
HF Swin Base 8 3,773 samples/sec 11 samples/sec/watt 2.12 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 Mixed Synthetic TensorRT 10.9.0.34 NVIDIA L40S
16 4,258 samples/sec 12 samples/sec/watt 7.52 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
HF Swin Large 8 1,933 samples/sec 6 samples/sec/watt 4.14 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
16 1,999 samples/sec 6 samples/sec/watt 16.01 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
HF ViT Base 8 6,137 samples/sec 18 samples/sec/watt 1.3 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
HF ViT Large 8 1,978 samples/sec 6 samples/sec/watt 4.05 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 FP8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
QuartzNet 8 7,559 samples/sec 29 samples/sec/watt 1.06 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
128 22,020 samples/sec 63 samples/sec/watt 5.81 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S
RetinaNet-RN34 8 1,466 images/sec 6 images/sec/watt 5.46 1x L40S Supermicro SYS-521GE-TNRT 25.03-py3 INT8 Synthetic TensorRT 10.9.0.34 NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512) 1 0.82 images/sec - 1221.73 1x L4 GIGABYTE G482-Z54-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
Stable Diffusion XL 1 0.11 images/sec - 9098.4 1x L4 GIGABYTE G482-Z54-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
ResNet-50v1.5 8 9,649 images/sec 134 images/sec/watt 0.83 1x L4 GIGABYTE G482-Z54-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
32 10,101 images/sec 111 images/sec/watt 16.27 1x L4 GIGABYTE G482-Z54-00 24.12-py3 INT8 Synthetic TensorRT 10.7.0 NVIDIA L4
BERT-BASE 8 3,323 sequences/sec 46 sequences/sec/watt 2.41 1x L4 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
24 4,052 sequences/sec 56 sequences/sec/watt 5.92 1x L4 GIGABYTE G482-Z54-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
BERT-LARGE 8 1,081 sequences/sec 15 sequences/sec/watt 7.4 1x L4 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
13 1,314 sequences/sec 19 sequences/sec/watt 9.9 1x L4 GIGABYTE G482-Z54-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
EfficientNet-B4 8 1,844 images/sec 26 images/sec/watt 4.34 1x L4 GIGABYTE G482-Z54-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
HF Swin Base 8 1,221 samples/sec 17 samples/sec/watt 6.55 1x L4 GIGABYTE G482-Z54-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA L4
HF Swin Large 8 621 samples/sec 9 samples/sec/watt 12.89 1x L4 GIGABYTE G482-Z54-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA L4
HF ViT Base 16 1,844 samples/sec 26 samples/sec/watt 4.34 1x L4 GIGABYTE G482-Z54-00 25.02-py3 FP8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
HF ViT Large 8 617 samples/sec 9 samples/sec/watt 12.96 1x L4 GIGABYTE G482-Z54-00 25.02-py3 FP8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
Megatron BERT Large QAT 24 1,789 sequences/sec 25 sequences/sec/watt 13.42 1x L4 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA L4
QuartzNet 8 3,886 samples/sec 54 samples/sec/watt 2.06 1x L4 GIGABYTE G482-Z54-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
128 6,144 samples/sec 85 samples/sec/watt 20.83 1x L4 GIGABYTE G482-Z54-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA L4
RetinaNet-RN34 8 355 images/sec 5 images/sec/watt 22.51 1x L4 GIGABYTE G482-Z54-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5 8 11,177 images/sec 40 images/sec/watt 0.72 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
128 15,473 images/sec 52 images/sec/watt 8.27 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
BERT-BASE 8 4,257 sequences/sec 15 sequences/sec/watt 1.88 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
128 5,667 sequences/sec 19 sequences/sec/watt 22.59 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
BERT-LARGE 8 1,573 sequences/sec 5 sequences/sec/watt 5.08 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
128 1,966 sequences/sec 7 sequences/sec/watt 65.11 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
EfficientNet-B0 8 11,130 images/sec 61 images/sec/watt 0.72 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
128 20,078 images/sec 67 images/sec/watt 6.38 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
EfficientNet-B4 8 2,145 images/sec 8 images/sec/watt 3.73 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
128 2,689 images/sec 9 images/sec/watt 47.59 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
HF Swin Base 8 1,697 samples/sec 6 samples/sec/watt 4.71 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
32 1,842 samples/sec 6 samples/sec/watt 17.38 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
HF Swin Large 8 959 samples/sec 3 samples/sec/watt 8.34 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
32 1,010 samples/sec 3 samples/sec/watt 31.68 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
HF ViT Base 8 2,175 samples/sec 7 samples/sec/watt 3.68 1x A40 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A40
64 2,324 samples/sec 8 samples/sec/watt 27.54 1x A40 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A40
HF ViT Large 8 694 samples/sec 2 samples/sec/watt 11.53 1x A40 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A40
64 750 samples/sec 2 samples/sec/watt 85.34 1x A40 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A40
Megatron BERT Large QAT 8 2,059 sequences/sec 7 sequences/sec/watt 3.89 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
128 2,650 sequences/sec 9 sequences/sec/watt 48.31 1x A40 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A40
QuartzNet 8 4,388 samples/sec 21 samples/sec/watt 1.82 1x A40 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A40
128 8,453 samples/sec 28 samples/sec/watt 15.14 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40
RetinaNet-RN34 8 706 images/sec 2 images/sec/watt 11.34 1x A40 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A30 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5 8 10,261 images/sec 71 images/sec/watt 0.78 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
128 16,465 images/sec 101 images/sec/watt 7.77 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
BERT-BASE 1 For Batch Size 1, please refer to Triton Inference Server page
2 For Batch Size 2, please refer to Triton Inference Server page
8 4,334 sequences/sec 26 sequences/sec/watt 1.85 1x A30 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A30
128 5,820 sequences/sec 35 sequences/sec/watt 21.99 1x A30 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A30
BERT-LARGE 1 For Batch Size 1, please refer to Triton Inference Server page
2 For Batch Size 2, please refer to Triton Inference Server page
8 1,500 sequences/sec 10 sequences/sec/watt 5.33 1x A30 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A30
128 2,053 sequences/sec 13 sequences/sec/watt 62.34 1x A30 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A30
EfficientNet-B0 8 8,993 images/sec 81 images/sec/watt 0.89 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
128 17,119 images/sec 105 images/sec/watt 7.48 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
EfficientNet-B4 8 1,875 images/sec 13 images/sec/watt 4.27 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
128 2,397 images/sec 15 images/sec/watt 53.4 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
HF Swin Base 8 1,646 samples/sec 10 samples/sec/watt 4.86 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
32 1,851 samples/sec 11 samples/sec/watt 17.28 1x A30 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A30
HF Swin Large 8 907 samples/sec 6 samples/sec/watt 8.82 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
32 1,000 samples/sec 6 samples/sec/watt 32 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
HF ViT Base 8 2,058 samples/sec 13 samples/sec/watt 3.89 1x A30 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A30
64 2,271 samples/sec 14 samples/sec/watt 28.18 1x A30 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A30
HF ViT Large 8 675 samples/sec 4 samples/sec/watt 11.86 1x A30 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A30
64 708 samples/sec 4 samples/sec/watt 90.34 1x A30 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A30
QuartzNet 8 3,434 samples/sec 29 samples/sec/watt 2.33 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
128 9,997 samples/sec 73 samples/sec/watt 12.8 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30
RetinaNet-RN34 8 703 images/sec 4 images/sec/watt 11.39 1x A30 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A30

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A10 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5 8 8,499 images/sec 57 images/sec/watt 0.94 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
128 10,654 images/sec 71 images/sec/watt 12.01 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
BERT-BASE 1 For Batch Size 1, please refer to Triton Inference Server page
2 For Batch Size 2, please refer to Triton Inference Server page
8 3,109 sequences/sec 21 sequences/sec/watt 2.57 1x A10 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A10
128 3,822 sequences/sec 26 sequences/sec/watt 33.49 1x A10 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.5.0 NVIDIA A10
BERT-LARGE 1 For Batch Size 1, please refer to Triton Inference Server page
2 For Batch Size 2, please refer to Triton Inference Server page
8 1,086 sequences/sec 7 sequences/sec/watt 7.36 1x A10 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.6.0 NVIDIA A10
128 1,265 sequences/sec 8 sequences/sec/watt 101.17 1x A10 GIGABYTE G482-Z52-00 24.10-py3 INT8 Synthetic TensorRT 10.6.0 NVIDIA A10
EfficientNet-B0 8 9,679 images/sec 65 images/sec/watt 0.83 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
128 14,418 images/sec 96 images/sec/watt 8.88 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
EfficientNet-B4 8 1,633 images/sec 11 images/sec/watt 4.9 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
128 1,863 images/sec 12 images/sec/watt 68.72 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
HF Swin Base 8 1,214 samples/sec 8 samples/sec/watt 6.59 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
32 1,258 samples/sec 8 samples/sec/watt 25.44 1x A10 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A10
HF Swin Large 8 623 samples/sec 4 samples/sec/watt 12.84 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
32 656 samples/sec 4 samples/sec/watt 48.75 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
HF ViT Base 8 1,370 samples/sec 9 samples/sec/watt 5.84 1x A10 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A10
64 1,503 samples/sec 10 samples/sec/watt 42.59 1x A10 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A10
HF ViT Large 8 453 samples/sec 3 samples/sec/watt 17.68 1x A10 GIGABYTE G482-Z52-00 25.02-py3 Mixed Synthetic TensorRT 10.8.0.43 NVIDIA A10
Megatron BERT Large QAT 8 1,566 sequences/sec 10 sequences/sec/watt 5.11 1x A10 GIGABYTE G482-Z52-00 24.12-py3 INT8 Synthetic TensorRT 10.7.0 NVIDIA A10
128 1,801 sequences/sec 12 sequences/sec/watt 71.06 1x A10 GIGABYTE G482-Z52-00 24.12-py3 INT8 Synthetic TensorRT 10.7.0 NVIDIA A10
QuartzNet 8 3,842 samples/sec 26 samples/sec/watt 2.08 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
128 5,867 samples/sec 39 samples/sec/watt 21.82 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10
RetinaNet-RN34 8 516 images/sec 4 images/sec/watt 15.5 1x A10 GIGABYTE G482-Z52-00 25.02-py3 INT8 Synthetic TensorRT 10.8.0.43 NVIDIA A10

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5 8 13,768 images/sec - images/sec/watt 0.58 1x A100 GCP A2-HIGHGPU-1G 23.10-py3 INT8 Synthetic - A100-SXM4-40GB
128 30,338 images/sec - images/sec/watt 4.22 1x A100 GCP A2-HIGHGPU-1G 23.10-py3 INT8 Synthetic - A100-SXM4-40GB
BERT-LARGE 8 2,308 images/sec - images/sec/watt 3.47 1x A100 GCP A2-HIGHGPU-1G 23.10-py3 INT8 Synthetic - A100-SXM4-40GB
128 4,045 images/sec - images/sec/watt 31.64 1x A100 GCP A2-HIGHGPU-1G 23.10-py3 INT8 Synthetic - A100-SXM4-40GB

BERT-Large: Sequence Length = 128