NVIDIA Data Center Deep Learning Product Performance AI Inference (original) (raw)

MLPerf Inference v5.0 Performance Benchmarks

Offline Scenario, Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	Dataset
Llama3.1 405B	13,886 tokens/sec	72x GB200	NVIDIA GB200 NVL72	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
1,538 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
574 tokens/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	98,858 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
35,453 tokens/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
Mixtral 8x7B	128,795 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
63,515 tokens/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL	30 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
19 samples/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
RGAT	450,175 samples/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99% of FP32 (72.86%)	IGBH
GPT-J	21,626 tokens/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
ResNet-50	773,300 samples/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	76.46% Top1	ImageNet (224x224)
RetinaNet	15,200 samples/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	0.3755 mAP	OpenImages (800x800)
DLRMv2	654,489 samples/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99% of FP32 (AUC=80.31%)	Synthetic Multihot Criteo Dataset
3D-UNET	55 samples/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99.9% of FP32 (0.86330 mean DICE score)	KiTS 2019

Server Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	MLPerf Server LatencyConstraints (ms)	Dataset
Llama3.1 405B	8,850 tokens/sec	72x GB200	NVIDIA GB200 NVL72	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
1,080 tokens/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
294 tokens/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B Interactive	62,266 tokens/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
20,235 tokens/sec	8x H200	G893-SD1	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
Llama2 70B	98,443 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
33,072 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
Mixtral 8x7B	129,047 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
61,802 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL	29 samples/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
18 samples/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
GPT-J	21,813 queries/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP32 (72.86%)	20 s	CNN Dailymail
ResNet-50	676,219 queries/sec	8x H200	G893-SD1	NVIDIA H200-SXM-141GB	76.46% Top1	15 ms	ImageNet (224x224)
RetinaNet	14,589 queries/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	0.3755 mAP	100 ms	OpenImages (800x800)
DLRMv2	590,167 queries/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99% of FP32 (AUC=80.31%)	60 ms	Synthetic Multihot Criteo Dataset

MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Per User

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
DeepSeek R1 671B	TP8	EP8	1,024	2,048	253 output tokens/sec/user	8x B200	DGX B200	FP4	TensorRT-LLM	NVIDIA B200

Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Batch Size = 1
Input tokens not included in TPS calculations
Check out this blog for more details

B200 Inference Performance - Max Throughput

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
DeepSeek R1 671B	DP8	EP8	1,024	2,048	30,389 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM	NVIDIA B200

Attention: Data Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Input tokens not included in TPS calculations
Check out this blog for more details

H200 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 405B	1	8	128	128	3,874 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	128	2048	5,938 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	128	4096	5,168 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	8	1	2048	128	764 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14a	NVIDIA H200
Llama v3.1 405B	1	8	5000	500	669 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	500	2000	5,084 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	1000	1000	3,400 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	2048	2048	2,941 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	20000	2000	535 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	128	128	4,021 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	128	2048	4,166 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	128	4096	6,527 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	2048	128	466 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	5000	500	560 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	500	2000	6,848 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	1000	1000	2,823 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	2048	2048	4,184 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	20000	2000	641 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	128	128	29,526 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	128	2048	25,399 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	128	4096	17,371 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	128	3,794 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	5000	500	3,988 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	500	2000	21,021 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	1000	1000	17,538 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	2048	11,969 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	20000	2000	1,804 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	128	128	31,938 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	128	2048	27,409 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	128	4096	18,505 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	2048	128	3,834 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	5000	500	4,042 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	500	2000	22,355 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	1000	1000	18,426 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	2048	2048	12,347 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	20000	2000	1,823 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	128	128	17,158 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	128	2048	15,095 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	128	4096	21,565 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	2048	128	2,010 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	5000	500	2,309 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	500	2000	12,105 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	1000	1000	10,371 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	2048	2048	14,018 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	20000	2000	2,227 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	128	128	25,179 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200
Mixtral 8x22B	1	8	128	2048	32,623 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	128	4096	25,753 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	128	3,095 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	5000	500	4,209 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	500	2000	27,430 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	1000	1000	20,097 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	2048	15,799 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	20000	2000	2,897 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

GH200 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 70B	1	1	128	128	3,637 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	128	2048	10,358 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	128	4096	6,628 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	2048	128	425 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	5000	500	422 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	500	2000	9,091 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	1000	1000	1,746 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	2048	2048	4,865 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	20000	2000	959 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	128	128	29,853 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	128	2048	21,770 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	128	4096	14,190 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	2048	128	3,844 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	5000	500	3,933 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	500	2000	17,137 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	1000	1000	16,483 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	2048	2048	10,266 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	20000	2000	1,560 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	128	128	32,498 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	128	2048	23,337 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	128	4096	15,018 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	2048	128	3,813 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	5000	500	3,950 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	500	2000	18,556 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	1000	1000	17,252 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	2048	2048	10,756 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	20000	2000	1,601 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	128	128	16,859 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	128	2048	11,120 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	4	128	4096	30,066 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	2048	128	1,994 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	5000	500	2,078 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	500	2000	9,193 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	1000	1000	8,849 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	2048	2048	5,545 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	20000	2000	861 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 70B	1	1	128	128	3,378 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	128	4096	3,897 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	2048	128	774 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	500	2000	4,973 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	1000	1000	4,391 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	2048	2048	2,898 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	4	20000	2000	920 output tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	1	128	128	15,962 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	128	2048	23,010 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	128	4096	14,237 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	1	2048	128	1,893 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	5000	500	3,646 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	500	2000	18,186 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	1000	1000	15,932 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	2048	2048	10,686 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	20000	2000	1,757 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 8B	1	1	128	128	9,105 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	2048	5,366 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	4096	3,026 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	128	1,067 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	5000	500	981 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	500	2000	4,274 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	1000	1000	4,055 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	2048	2,225 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	20000	2000	328 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Mixtral 8x7B	4	1	128	128	15,278 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	128	2048	9,087 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	1	4	128	4096	5,736 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Mixtral 8x7B	4	1	2048	128	2,098 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	5000	500	1,558 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	500	2000	7,974 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	1000	1000	6,579 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	2048	2048	4,217 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.33 images/sec	-	231.26	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
4	6.8 images/sec	-	588.08	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
Stable Diffusion XL	1	0.86 images/sec	-	1157.27	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA H200
ResNet-50v1.5	8	20,758 images/sec	67 images/sec/watt	0.39	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
128	64,817 images/sec	107 images/sec/watt	1.97	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
EfficientNet-B0	8	16,727 images/sec	77 images/sec/watt	0.48	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
128	56,866 images/sec	122 images/sec/watt	2.25	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
EfficientNet-B4	8	4,523 images/sec	14 images/sec/watt	1.77	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
128	8,993 images/sec	15 images/sec/watt	14.23	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
HF Swin Base	8	4,938 samples/sec	11 samples/sec/watt	1.62	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
32	8,091 samples/sec	12 samples/sec/watt	3.95	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
HF Swin Large	8	3,330 samples/sec	6 samples/sec/watt	2.4	1x H200	DGX H200	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
32	4,694 samples/sec	7 samples/sec/watt	6.82	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
HF ViT Base	8	8,695 samples/sec	19 samples/sec/watt	0.92	1x H200	DGX H200	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
64	15,570 samples/sec	23 samples/sec/watt	4.11	1x H200	DGX H200	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
HF ViT Large	8	3,634 samples/sec	6 samples/sec/watt	2.2	1x H200	DGX H200	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
64	5,454 samples/sec	8 samples/sec/watt	11.74	1x H200	DGX H200	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
QuartzNet	8	6,755 samples/sec	24 samples/sec/watt	1.18	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
128	34,234 samples/sec	90 samples/sec/watt	3.74	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200
RetinaNet-RN34	8	3,024 images/sec	8 images/sec/watt	2.65	1x H200	DGX H200	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.27 images/sec	-	234.4	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
4	5.82 images/sec	-	687.91	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
Stable Diffusion XL	1	0.68 images/sec	-	1149.44	1x GH200	NVIDIA P3880	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	GH200 96GB
ResNet-50v1.5	8	20,736 images/sec	61 images/sec/watt	0.39	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
128	66,791 images/sec	106 images/sec/watt	1.92	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
EfficientNet-B0	8	16,814 images/sec	68 images/sec/watt	0.48	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
128	57,461 images/sec	117 images/sec/watt	2.23	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
EfficientNet-B4	8	4,489 images/sec	13 images/sec/watt	1.78	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
128	8,992 images/sec	15 images/sec/watt	14.24	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
HF Swin Base	8	4,894 samples/sec	11 samples/sec/watt	1.63	1x GH200	NVIDIA P3880	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
32	8,003 samples/sec	12 samples/sec/watt	4	1x GH200	NVIDIA P3880	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
HF Swin Large	8	3,300 samples/sec	6 samples/sec/watt	2.42	1x GH200	NVIDIA P3880	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
32	4,495 samples/sec	7 samples/sec/watt	7.12	1x GH200	NVIDIA P3880	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
HF ViT Base	8	8,588 samples/sec	19 samples/sec/watt	0.93	1x GH200	NVIDIA P3880	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
64	15,089 samples/sec	23 samples/sec/watt	4.24	1x GH200	NVIDIA P3880	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
HF ViT Large	8	3,707 samples/sec	6 samples/sec/watt	2.16	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
64	5,703 samples/sec	7 samples/sec/watt	11.22	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
QuartzNet	8	6,763 samples/sec	22 samples/sec/watt	1.18	1x GH200	NVIDIA P3880	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
128	34,497 samples/sec	88 samples/sec/watt	3.71	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e
RetinaNet-RN34	8	2,971 images/sec	5 images/sec/watt	2.69	1x GH200	NVIDIA P3880	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	GH200 144GB HBM3e

H100 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.22 images/sec	-	236.8	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
4	6.41 images/sec	-	624.6	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
Stable Diffusion XL	1	0.83 images/sec	-	1210.08	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100 SXM5-80GB
ResNet-50v1.5	8	21,620 images/sec	63 images/sec/watt	0.37	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
128	59,718 images/sec	99 images/sec/watt	2.14	1x H100	DGX H100	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	H100-SXM5-80GB
EfficientNet-B0	8	16,425 images/sec	67 images/sec/watt	0.49	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
128	55,418 images/sec	115 images/sec/watt	2.31	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
EfficientNet-B4	8	4,544 images/sec	13 images/sec/watt	1.76	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
128	8,149 images/sec	14 images/sec/watt	15.71	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
HF Swin Base	8	4,677 samples/sec	10 samples/sec/watt	1.71	1x H100	DGX H100	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
32	7,238 samples/sec	11 samples/sec/watt	4.42	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
HF Swin Large	8	3,102 samples/sec	6 samples/sec/watt	2.58	1x H100	DGX H100	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
32	4,396 samples/sec	6 samples/sec/watt	7.28	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
HF ViT Base	8	8,280 samples/sec	17 samples/sec/watt	0.97	1x H100	DGX H100	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
64	13,907 samples/sec	21 samples/sec/watt	4.6	1x H100	DGX H100	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
HF ViT Large	8	3,691 samples/sec	5 samples/sec/watt	2.17	1x H100	DGX H100	24.12-py3	FP8	Synthetic	TensorRT 10.7.0.23	H100-SXM5-80GB
64	5,323 samples/sec	8 samples/sec/watt	12.02	1x H100	DGX H100	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
QuartzNet	8	6,774 samples/sec	22 samples/sec/watt	1.18	1x H100	DGX H100	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
128	35,152 samples/sec	95 samples/sec/watt	3.64	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB
RetinaNet-RN34	8	2,759 images/sec	15 images/sec/watt	2.9	1x H100	DGX H100	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	H100-SXM5-80GB

L40S Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	2.49 images/sec	-	401.48	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
4	2.91 images/sec	-	1372.72	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
Stable Diffusion XL	1	0.37 images/sec	-	2678.19	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
ResNet-50v1.5	8	22,998 images/sec	70 images/sec/watt	0.35	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
32	28,845 images/sec	83 images/sec/watt	4.44	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
EfficientDet-D0	8	4,680 images/sec	16 images/sec/watt	1.71	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
EfficientNet-B0	8	20,539 images/sec	95 images/sec/watt	0.39	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
32	42,709 images/sec	127 images/sec/watt	3	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
EfficientNet-B4	8	5,163 images/sec	17 images/sec/watt	1.55	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
16	4,034 images/sec	12 images/sec/watt	31.73	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
HF Swin Base	8	3,773 samples/sec	11 samples/sec/watt	2.12	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	Mixed	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
16	4,258 samples/sec	12 samples/sec/watt	7.52	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
HF Swin Large	8	1,933 samples/sec	6 samples/sec/watt	4.14	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
16	1,999 samples/sec	6 samples/sec/watt	16.01	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
HF ViT Base	8	6,137 samples/sec	18 samples/sec/watt	1.3	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
HF ViT Large	8	1,978 samples/sec	6 samples/sec/watt	4.05	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	FP8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
QuartzNet	8	7,559 samples/sec	29 samples/sec/watt	1.06	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
128	22,020 samples/sec	63 samples/sec/watt	5.81	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S
RetinaNet-RN34	8	1,466 images/sec	6 images/sec/watt	5.46	1x L40S	Supermicro SYS-521GE-TNRT	25.03-py3	INT8	Synthetic	TensorRT 10.9.0.34	NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	0.82 images/sec	-	1221.73	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
Stable Diffusion XL	1	0.11 images/sec	-	9098.4	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
ResNet-50v1.5	8	9,649 images/sec	134 images/sec/watt	0.83	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
32	10,101 images/sec	111 images/sec/watt	16.27	1x L4	GIGABYTE G482-Z54-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
BERT-BASE	8	3,323 sequences/sec	46 sequences/sec/watt	2.41	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
24	4,052 sequences/sec	56 sequences/sec/watt	5.92	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
BERT-LARGE	8	1,081 sequences/sec	15 sequences/sec/watt	7.4	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
13	1,314 sequences/sec	19 sequences/sec/watt	9.9	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
EfficientNet-B4	8	1,844 images/sec	26 images/sec/watt	4.34	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF Swin Base	8	1,221 samples/sec	17 samples/sec/watt	6.55	1x L4	GIGABYTE G482-Z54-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF Swin Large	8	621 samples/sec	9 samples/sec/watt	12.89	1x L4	GIGABYTE G482-Z54-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF ViT Base	16	1,844 samples/sec	26 samples/sec/watt	4.34	1x L4	GIGABYTE G482-Z54-00	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF ViT Large	8	617 samples/sec	9 samples/sec/watt	12.96	1x L4	GIGABYTE G482-Z54-00	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
Megatron BERT Large QAT	24	1,789 sequences/sec	25 sequences/sec/watt	13.42	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
QuartzNet	8	3,886 samples/sec	54 samples/sec/watt	2.06	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
128	6,144 samples/sec	85 samples/sec/watt	20.83	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
RetinaNet-RN34	8	355 images/sec	5 images/sec/watt	22.51	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A40 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	11,177 images/sec	40 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
128	15,473 images/sec	52 images/sec/watt	8.27	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
BERT-BASE	8	4,257 sequences/sec	15 sequences/sec/watt	1.88	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
128	5,667 sequences/sec	19 sequences/sec/watt	22.59	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
BERT-LARGE	8	1,573 sequences/sec	5 sequences/sec/watt	5.08	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
128	1,966 sequences/sec	7 sequences/sec/watt	65.11	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
EfficientNet-B0	8	11,130 images/sec	61 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
128	20,078 images/sec	67 images/sec/watt	6.38	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
EfficientNet-B4	8	2,145 images/sec	8 images/sec/watt	3.73	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
128	2,689 images/sec	9 images/sec/watt	47.59	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF Swin Base	8	1,697 samples/sec	6 samples/sec/watt	4.71	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
32	1,842 samples/sec	6 samples/sec/watt	17.38	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF Swin Large	8	959 samples/sec	3 samples/sec/watt	8.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
32	1,010 samples/sec	3 samples/sec/watt	31.68	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF ViT Base	8	2,175 samples/sec	7 samples/sec/watt	3.68	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
64	2,324 samples/sec	8 samples/sec/watt	27.54	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF ViT Large	8	694 samples/sec	2 samples/sec/watt	11.53	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
64	750 samples/sec	2 samples/sec/watt	85.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
Megatron BERT Large QAT	8	2,059 sequences/sec	7 sequences/sec/watt	3.89	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
128	2,650 sequences/sec	9 sequences/sec/watt	48.31	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
QuartzNet	8	4,388 samples/sec	21 samples/sec/watt	1.82	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
128	8,453 samples/sec	28 samples/sec/watt	15.14	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
RetinaNet-RN34	8	706 images/sec	2 images/sec/watt	11.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A30 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	10,261 images/sec	71 images/sec/watt	0.78	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
128	16,465 images/sec	101 images/sec/watt	7.77	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
BERT-BASE	1	For Batch Size 1, please refer to Triton Inference Server page
2	For Batch Size 2, please refer to Triton Inference Server page
8	4,334 sequences/sec	26 sequences/sec/watt	1.85	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
128	5,820 sequences/sec	35 sequences/sec/watt	21.99	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
BERT-LARGE	1	For Batch Size 1, please refer to Triton Inference Server page
2	For Batch Size 2, please refer to Triton Inference Server page
8	1,500 sequences/sec	10 sequences/sec/watt	5.33	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
128	2,053 sequences/sec	13 sequences/sec/watt	62.34	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
EfficientNet-B0	8	8,993 images/sec	81 images/sec/watt	0.89	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
128	17,119 images/sec	105 images/sec/watt	7.48	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
EfficientNet-B4	8	1,875 images/sec	13 images/sec/watt	4.27	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
128	2,397 images/sec	15 images/sec/watt	53.4	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF Swin Base	8	1,646 samples/sec	10 samples/sec/watt	4.86	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
32	1,851 samples/sec	11 samples/sec/watt	17.28	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF Swin Large	8	907 samples/sec	6 samples/sec/watt	8.82	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
32	1,000 samples/sec	6 samples/sec/watt	32	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF ViT Base	8	2,058 samples/sec	13 samples/sec/watt	3.89	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
64	2,271 samples/sec	14 samples/sec/watt	28.18	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF ViT Large	8	675 samples/sec	4 samples/sec/watt	11.86	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
64	708 samples/sec	4 samples/sec/watt	90.34	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
QuartzNet	8	3,434 samples/sec	29 samples/sec/watt	2.33	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
128	9,997 samples/sec	73 samples/sec/watt	12.8	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
RetinaNet-RN34	8	703 images/sec	4 images/sec/watt	11.39	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30

A10 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	8,499 images/sec	57 images/sec/watt	0.94	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
128	10,654 images/sec	71 images/sec/watt	12.01	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
BERT-BASE	1	For Batch Size 1, please refer to Triton Inference Server page
2	For Batch Size 2, please refer to Triton Inference Server page
8	3,109 sequences/sec	21 sequences/sec/watt	2.57	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
128	3,822 sequences/sec	26 sequences/sec/watt	33.49	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
BERT-LARGE	1	For Batch Size 1, please refer to Triton Inference Server page
2	For Batch Size 2, please refer to Triton Inference Server page
8	1,086 sequences/sec	7 sequences/sec/watt	7.36	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
128	1,265 sequences/sec	8 sequences/sec/watt	101.17	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
EfficientNet-B0	8	9,679 images/sec	65 images/sec/watt	0.83	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
128	14,418 images/sec	96 images/sec/watt	8.88	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
EfficientNet-B4	8	1,633 images/sec	11 images/sec/watt	4.9	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
128	1,863 images/sec	12 images/sec/watt	68.72	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF Swin Base	8	1,214 samples/sec	8 samples/sec/watt	6.59	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
32	1,258 samples/sec	8 samples/sec/watt	25.44	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF Swin Large	8	623 samples/sec	4 samples/sec/watt	12.84	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
32	656 samples/sec	4 samples/sec/watt	48.75	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF ViT Base	8	1,370 samples/sec	9 samples/sec/watt	5.84	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
64	1,503 samples/sec	10 samples/sec/watt	42.59	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF ViT Large	8	453 samples/sec	3 samples/sec/watt	17.68	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
Megatron BERT Large QAT	8	1,566 sequences/sec	10 sequences/sec/watt	5.11	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
128	1,801 sequences/sec	12 sequences/sec/watt	71.06	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
QuartzNet	8	3,842 samples/sec	26 samples/sec/watt	2.08	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
128	5,867 samples/sec	39 samples/sec/watt	21.82	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
RetinaNet-RN34	8	516 images/sec	4 images/sec/watt	15.5	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	13,768 images/sec	- images/sec/watt	0.58	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
128	30,338 images/sec	- images/sec/watt	4.22	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
BERT-LARGE	8	2,308 images/sec	- images/sec/watt	3.47	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
128	4,045 images/sec	- images/sec/watt	31.64	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB

BERT-Large: Sequence Length = 128