Inf1 Inference Performance — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf1

Inf1 Inference Performance#

Table of contents

Encoder Models
Convolutional Neural Networks (CNN) Models

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: September 16th, 2024

Encoder Models #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
BERT base	HuggingFace Pretrained BERT	PyTorch 1.13	inf1.xlarge	1056	20	21	$0.029	Batch	2.20.0	Data Parallel	4	fp32, bert-base-cased-finetuned-mrpc, sequence-length=128
BERT base	HuggingFace distilBERT with Tensorflow2	Tensorflow 2.10	inf1.6xlarge	2123	30	32	$0.074	Batch	2.20.0	Data Parallel	16	fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128
BERT base	Using NeuronCore Pipeline	PyTorch 1.13	inf1.6xlarge	2009	6	6	$0.078	Real Time	2.20.0	Model Pipeline	1	fp32, bert-base-uncased, sequence-length=128
BERT base (bert-base-cased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	1095	58	65	$0.028	Batch	2.20.0	Data Parallel	8	fp32, sequence-length=128
BERT base (bert-base-uncased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	1181	41	45	$0.026	Batch	2.20.0	Data Parallel	6	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	1877	34	53	$0.016	Batch	2.20.0	Data Parallel	8	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	1875	34	54	$0.016	Batch	2.20.0	Data Parallel	8	fp32, sequence-length=128
DistilRoBERTa base (distilroberta-base)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	1513	15	26	$0.020	Batch	2.20.0	Data Parallel	6	fp32, sequence-length=128

Latency optimized

Model	Scripts	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
BERT base (bert-base-cased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	126	8	8	$0.243	Real Time	2.20.0	Data Parallel	1	fp32, sequence-length=128
BERT base (bert-base-uncased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	285	10	11	$0.107	Real Time	2.20.0	Data Parallel	3	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	538	11	12	$0.057	Real Time	2.20.0	Data Parallel	6	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	593	10	11	$0.051	Real Time	2.20.0	Data Parallel	5	fp32, sequence-length=128
DistilRoBERTa base (distilroberta-base)	Compile + Benchmark	PyTorch 1.13.1	inf1.xlarge	417	7	8	$0.073	Real Time	2.20.0	Data Parallel	3	fp32, sequence-length=128

Note

Throughput and latency numbers in this table were computed using* NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.*

Convolutional Neural Networks (CNN) Models #

Model	Tutorial	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
OpenPose	Running OpenPose on Inferentia	Tensorflow 1.15	inf1.xlarge	58	60	67	$0.531	Real Time	2.12.0	Data Parallel	1	fp16
Resnet-50	ResNet-50 optimization example	Tensorflow 1.15	inf1.xlarge	2207	18	23	$0.014	Batch	2.12.0	Data Parallel	10	fp16
Resnet-50	Resnet50 model for Inferentia	PyTorch 1.13	inf1.xlarge	922	22	23	$0.033	Batch	2.20.0	Data Parallel	5	fp32
YOLO v4	Evaluate YOLO v4 on Inferentia	PyTorch 1.13	inf1.2xlarge	180	40	51	$0.268	Real Time	2.20.0	Data Parallel	1	fp32

Note

Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1

Inf1 Inference Performance — AWS Neuron Documentation (original) (raw)