Inf1 Inference Performance — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf1

Inf1 Inference Performance#

Table of contents

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: September 16th, 2024

Encoder Models#

Throughput optimized

Model Scripts Framework Inst. Type Avg Throughput (/sec) Latency P50 (ms) Latency P99 (ms) Cost per 1M inferences Application Type Neuron Version Run Mode Batch Size Model details
BERT base HuggingFace Pretrained BERT PyTorch 1.13 inf1.xlarge 1056 20 21 $0.029 Batch 2.20.0 Data Parallel 4 fp32, bert-base-cased-finetuned-mrpc, sequence-length=128
BERT base HuggingFace distilBERT with Tensorflow2 Tensorflow 2.10 inf1.6xlarge 2123 30 32 $0.074 Batch 2.20.0 Data Parallel 16 fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128
BERT base Using NeuronCore Pipeline PyTorch 1.13 inf1.6xlarge 2009 6 6 $0.078 Real Time 2.20.0 Model Pipeline 1 fp32, bert-base-uncased, sequence-length=128
BERT base (bert-base-cased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 1095 58 65 $0.028 Batch 2.20.0 Data Parallel 8 fp32, sequence-length=128
BERT base (bert-base-uncased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 1181 41 45 $0.026 Batch 2.20.0 Data Parallel 6 fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 1877 34 53 $0.016 Batch 2.20.0 Data Parallel 8 fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 1875 34 54 $0.016 Batch 2.20.0 Data Parallel 8 fp32, sequence-length=128
DistilRoBERTa base (distilroberta-base) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 1513 15 26 $0.020 Batch 2.20.0 Data Parallel 6 fp32, sequence-length=128

Latency optimized

Model Scripts Framework Inst. Type Avg Throughput (/sec) Latency P50 (ms) Latency P99 (ms) Cost per 1M inferences Application Type Neuron Version Run Mode Batch Size Model details
BERT base (bert-base-cased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 126 8 8 $0.243 Real Time 2.20.0 Data Parallel 1 fp32, sequence-length=128
BERT base (bert-base-uncased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 285 10 11 $0.107 Real Time 2.20.0 Data Parallel 3 fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 538 11 12 $0.057 Real Time 2.20.0 Data Parallel 6 fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 593 10 11 $0.051 Real Time 2.20.0 Data Parallel 5 fp32, sequence-length=128
DistilRoBERTa base (distilroberta-base) Compile + Benchmark PyTorch 1.13.1 inf1.xlarge 417 7 8 $0.073 Real Time 2.20.0 Data Parallel 3 fp32, sequence-length=128

Note

Throughput and latency numbers in this table were computed using* NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.*

Convolutional Neural Networks (CNN) Models#

Model Tutorial Framework Inst. Type Avg Throughput (/sec) Latency P50 (ms) Latency P99 (ms) Cost per 1M inferences Application Type Neuron Version Run Mode Batch Size Model details
OpenPose Running OpenPose on Inferentia Tensorflow 1.15 inf1.xlarge 58 60 67 $0.531 Real Time 2.12.0 Data Parallel 1 fp16
Resnet-50 ResNet-50 optimization example Tensorflow 1.15 inf1.xlarge 2207 18 23 $0.014 Batch 2.12.0 Data Parallel 10 fp16
Resnet-50 Resnet50 model for Inferentia PyTorch 1.13 inf1.xlarge 922 22 23 $0.033 Batch 2.20.0 Data Parallel 5 fp32
YOLO v4 Evaluate YOLO v4 on Inferentia PyTorch 1.13 inf1.2xlarge 180 40 51 $0.268 Real Time 2.20.0 Data Parallel 1 fp32

Note

Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1