Benchmark Tool — OpenVINO™ documentation (original) (raw)
This page demonstrates how to use the Benchmark Tool to estimate deep learning inference performance on supported devices. Note that the MULTI plugin mentioned here is considered a legacy tool and currently is just a mapping of theAUTO plugin.
Note
Use either Python or C++ version, depending on the language of your application.
Basic Usage#
Python
The Python benchmark_app is automatically installed when you install OpenVINO using PyPI. Before running benchmark_app, make sure the openvino_env virtual environment is activated, and navigate to the directory where your model is located.
The benchmark application works with models in the OpenVINO IR (model.xml and model.bin) and ONNX (model.onnx) formats. Make sure to convert your modelsif necessary.
To run a benchmark with default options on a model, use the following command:
benchmark_app -m model.xml
C++
To use the C++ benchmark_app, you must first build it following theBuild the Sample Applications instructions and then set up paths and environment variables by following theGet Ready for Running the Sample Applicationsinstructions. Navigate to the directory where the benchmark_app C++ sample binary was built.
Note
If you installed OpenVINO Runtime using PyPI or Anaconda Cloud, only theBenchmark Python Tool is available, and you should follow the usage instructions on that page instead.
The benchmark application works with models in the OpenVINO IR, TensorFlow, TensorFlow Lite, PaddlePaddle, PyTorch and ONNX formats. If you need it, OpenVINO also allows you to convert your models.
To run a benchmark with default options on a model, use the following command:
./benchmark_app -m model.xml
By default, the application loads the specified model and performs inference on batches of randomly-generated data inputs on CPU for 60 seconds. It displays information about the benchmark parameters as it loads the model. When the benchmark is completed, it reports the minimum, average, and maximum inference latency and the average throughput.
You may be able to improve benchmark results beyond the default configuration by configuring some of the execution parameters for your model. For example, you can use “throughput” or “latency” performance hints to optimize the runtime for higher FPS or reduced inference time. Read on to learn more about the configuration options available for benchmark_app.
Configuration Options#
You can easily configure and fine-tune benchmarks with various execution parameters, for example to achieve better performance on your device. The list of all configuration options is given in the Advanced Usage section.
Performance hints: latency and throughput#
With high-level “performance hints”, which automatically adjust parameters such as the number of processing streams and inference batch size, you can aim for low-latency or high-throughput inference.
The performance hints do not require any device-specific settings and they are completely portable between devices. The parameters are automatically configured based on the device. Therefore, you can easily port applications between hardware targets without having to re-determine the best runtime parameters for a new device.
If not specified, throughput is used as the default. To set the hint explicitly, use -hint latency or -hint throughput when running benchmark_app:
Python
benchmark_app -m model.xml -hint latency benchmark_app -m model.xml -hint throughput
C++
./benchmark_app -m model.xml -hint latency ./benchmark_app -m model.xml -hint throughput
Note
Make sure the environment is optimized for maximum performance when benchmark is running. Otherwise, different environment settings, such as power optimization settings, processor overclocking, or thermal throttling may give different results.
When you specify single options multiple times, only the last value will be used. For example, the -m flag:
Python
benchmark_app -m model.xml -m model2.xml
C++
./benchmark_app -m model.xml -m model2.xml
Latency#
Latency is the amount of time it takes to process a single inference request. Low latency is useful in applications where data needs to be inferred and acted on as quickly as possible (such as autonomous driving). For conventional devices, low latency is achieved by reducing the amount of parallel processing streams so the system can utilize as many resources as possible to quickly calculate each inference request. However, advanced devices like multi-socket CPUs and modern GPUs are capable of running multiple inference requests while delivering the same latency.
When benchmark_app is run with -hint latency, it determines the optimal number of parallel inference requests for minimizing latency while still maximizing the parallelization capabilities of the hardware. It automatically sets the number of processing streams and inference batch size to achieve the best latency.
Throughput#
Throughput is the amount of data processed by an inference pipeline at a time. It is usually measured in frames per second (FPS) or inferences per second. High throughput is beneficial for applications where large amounts of data needs to be inferred simultaneously (such as multi-camera video streams). To achieve high throughput, the runtime focuses on fully saturating the device with enough data to process. It utilizes as much memory and as many parallel streams as possible to maximize the amount of data that can be processed simultaneously.
When benchmark_app is run with -hint throughput, it maximizes the number of parallel inference requests to utilize all the threads available on the device. On GPU, it automatically sets the inference batch size to fill up the GPU memory available.
For more information on performance hints, see theHigh-level Performance Hints page. For more details on optimal runtime configurations and how they are automatically determined using performance hints, seeRuntime Inference Optimizations.
Device#
The benchmark app supports CPU and GPU devices. To run a benchmark on a chosen device, set the -d <device> argument. When run with default parameters, benchmark_appcreates 4 and 16 inference requests for CPU and GPU respectively.
In order to use GPU, the system must have the appropriate drivers installed. If no device is specified, benchmark_app will use CPU by default.
For example, to run a benchmark on GPU, use:
Python
benchmark_app -m model.xml -d GPU
C++
./benchmark_app -m model.xml -d GPU
You may also specify AUTO as the device, to let benchmark_appautomatically select the best device for benchmarking and support it with CPU when loading the model. You can use AUTO when you aim for better performance. For more information, see theAutomatic device selection page.
Note
- If either the latency or throughput hint is set, it will automatically configure streams, batch sizes, and the number of parallel infer requests for optimal performance, based on the specified device.
- Optionally, you can specify the number of parallel infer requests with the
-nireqoption. Setting a high value may improve throughput at the expense of latency, while a low value may give the opposite result.
Number of iterations#
By default, the benchmark app will run for a predefined duration, repeatedly performing inference with the model and measuring the resulting inference speed. There are several options for setting the number of inference iterations:
- Explicitly specify the number of iterations the model runs, using the
-niter <number_of_iterations>option. - Set the
-t <seconds>option to run the app for a specified amount of time. - Set both of them (execution will continue until both conditions are met).
- If neither
-niternor-tare specified, the app will run for a predefined duration that depends on the device.
The more iterations a model runs, the better the statistics will be for determining average latency and throughput.
Maximum inference rate#
By default, the benchmark app will run inference at maximum rate based on the device capabilities. The maximum inference rate can be configured by the -max_irate <MAXIMUM_INFERENCE_RATE> option. Modifying this parameter by limiting the number of executions, may result in better accuracy and reduction in power consumption.
Inputs#
The tool runs benchmarks on user-provided input images in.jpg, .bmp, or .png formats. Use -i <PATH_TO_INPUT> to specify the path to an image or a folder of images:
Python
benchmark_app -m model.xml -i test1.jpg
C++
./benchmark_app -m model.xml -i test1.jpg
The tool will repeatedly loop through the provided inputs and run inference for the specified amount of time or the number of iterations. If the -iflag is not used, the tool will automatically generate random data to fit the input shape of the model.
Examples#
For more usage examples and step-by-step instructions, see the Examples of Running the Tool section.
Advanced Usage#
Note
By default, OpenVINO samples, tools and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channel order in the sample or demo application or reconvert your model. For more information, refer to the Color Conversion section ofPreprocessing API.
Per-layer performance and logging#
The application also collects per-layer Performance Measurement (PM) counters for each executed infer request if you enable statistics dumping by setting the-report_type parameter to one of the possible values:
no_counters- includes specified configuration options, resulting FPS and latency.average_counters- extends theno_countersreport and additionally includes average PM counters values for each layer from the model.detailed_counters- extends theaverage_countersreport and additionally includes per-layer PM counters and latency for each executed infer request.
Depending on the type, the report is saved to the benchmark_no_counters_report.csv,benchmark_average_counters_report.csv, or benchmark_detailed_counters_report.csvfile located in the path specified with -report_folder. The application also saves executable graph information to an XML file, located in a folder specified with the -exec_graph_path parameter.
All configuration options#
Run the application with the -h or --help flags to get information on available options and parameters:
The help information is also displayed when you run the application without any parameters.
More information on inputs#
The benchmark tool supports topologies with one or more inputs. If a topology is not data sensitive, you can skip the input parameter, and the inputs will be filled with random values. If a model has only image input(s), provide a folder with images or a path to an image as input. If a model has some specific input(s) (besides images), prepare a binary file(s) or numpy array(s) that is filled with data of appropriate precision and provide a path to it as input. If a model has mixed input types, the input folder should contain all required files. Image inputs are filled with image files one by one. Binary inputs are filled with binary inputs one by one.
Examples of Running the Tool#
This section provides step-by-step instructions on how to run the Benchmark Tool with the asl-recognition Intel model on CPU or GPU devices. It uses random data as input.
Note
Internet access is required to execute the following steps successfully. If you have access to the Internet through a proxy server only, make sure it is configured in your OS.
Run the tool, specifying the location of the .xml model file of OpenVINO Intermediate Representation (IR), the inference device and a performance hint. The following examples show how to run the Benchmark Tool on CPU and GPU in latency and throughput mode respectively:
- On CPU (latency mode):
Python
benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency
C++
./benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency - On GPU (throughput mode):
Python
benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d GPU -hint throughput
C++
./benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d GPU -hint throughput
The application outputs the number of executed iterations, total duration of execution, latency, and throughput. Additionally, if you set the parameters:
-report_type- the application outputs a statistics report,-pc- the application outputs performance counters,-exec_graph_path- the application reports executable graph information serialized.
All measurements including per-layer PM counters are reported in milliseconds.
An example of running benchmark_app on CPU in latency mode and its output are shown below:
Python
benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency
[Step 1/11] Parsing and validating input arguments [ INFO ] Parsing input parameters [ INFO ] Input command: /home/openvino/tools/benchmark_tool/benchmark_app.py -m omz_models/intel/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency [Step 2/11] Loading OpenVINO Runtime [ INFO ] OpenVINO: [ INFO ] Build ................................. 2022.3.0-7750-c1109a7317e-feature/py_cpp_align [ INFO ] [ INFO ] Device info: [ INFO ] CPU [ INFO ] Build ................................. 2022.3.0-7750-c1109a7317e-feature/py_cpp_align [ INFO ] [ INFO ] [Step 3/11] Setting device configuration [Step 4/11] Reading model files [ INFO ] Loading model files [ INFO ] Read model took 147.82 ms [ INFO ] Original model I/O parameters: [ INFO ] Model inputs: [ INFO ] input (node: input) : f32 / [N,C,D,H,W] / {1,3,16,224,224} [ INFO ] Model outputs: [ INFO ] output (node: output) : f32 / [...] / {1,100} [Step 5/11] Resizing model to match image sizes and given batch [ INFO ] Model batch size: 1 [Step 6/11] Configuring input of the model [ INFO ] Model inputs: [ INFO ] input (node: input) : f32 / [N,C,D,H,W] / {1,3,16,224,224} [ INFO ] Model outputs: [ INFO ] output (node: output) : f32 / [...] / {1,100} [Step 7/11] Loading the model to the device [ INFO ] Compile model took 974.64 ms [ INFO ] Start of compilation memory usage: Peak 1000 KB [ INFO ] End of compilation memory usage: Peak 10000 KB [ INFO ] Compile model ram used 9000 KB [Step 8/11] Querying optimal runtime parameters [ INFO ] Model: [ INFO ] NETWORK_NAME: torch-jit-export [ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 2 [ INFO ] NUM_STREAMS: 2 [ INFO ] AFFINITY: Affinity.CORE [ INFO ] INFERENCE_NUM_THREADS: 0 [ INFO ] PERF_COUNT: False [ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'> [ INFO ] PERFORMANCE_HINT: PerformanceMode.LATENCY [ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0 [Step 9/11] Creating infer requests and preparing input tensors [ WARNING ] No input files were given for input 'input'!. This input will be filled with random values! [ INFO ] Fill input 'input' with random values [Step 10/11] Measuring performance (Start inference asynchronously, 2 inference requests, limits: 60000 ms duration) [ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop). [ INFO ] First inference took 38.41 ms [Step 11/11] Dumping statistics report [ INFO ] Count: 5380 iterations [ INFO ] Duration: 60036.78 ms [ INFO ] Latency: [ INFO ] Median: 22.04 ms [ INFO ] Average: 22.09 ms [ INFO ] Min: 20.78 ms [ INFO ] Max: 33.51 ms [ INFO ] Throughput: 89.61 FPS
C++
./benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency
[Step 1/11] Parsing and validating input arguments [ INFO ] Parsing input parameters [ INFO ] Input command: /home/openvino/bin/intel64/DEBUG/benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -hint latency [Step 2/11] Loading OpenVINO Runtime [ INFO ] OpenVINO: [ INFO ] Build ................................. 2022.3.0-7750-c1109a7317e-feature/py_cpp_align [ INFO ] [ INFO ] Device info: [ INFO ] CPU [ INFO ] Build ................................. 2022.3.0-7750-c1109a7317e-feature/py_cpp_align [ INFO ] [ INFO ] [Step 3/11] Setting device configuration [ WARNING ] Device(CPU) performance hint is set to LATENCY [Step 4/11] Reading model files [ INFO ] Loading model files [ INFO ] Read model took 141.11 ms [ INFO ] Original model I/O parameters: [ INFO ] Network inputs: [ INFO ] input (node: input) : f32 / [N,C,D,H,W] / {1,3,16,224,224} [ INFO ] Network outputs: [ INFO ] output (node: output) : f32 / [...] / {1,100} [Step 5/11] Resizing model to match image sizes and given batch [ INFO ] Model batch size: 0 [Step 6/11] Configuring input of the model [ INFO ] Model batch size: 1 [ INFO ] Network inputs: [ INFO ] input (node: input) : f32 / [N,C,D,H,W] / {1,3,16,224,224} [ INFO ] Network outputs: [ INFO ] output (node: output) : f32 / [...] / {1,100} [Step 7/11] Loading the model to the device [ INFO ] Compile model took 989.62 ms [Step 8/11] Querying optimal runtime parameters [ INFO ] Model: [ INFO ] NETWORK_NAME: torch-jit-export [ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 2 [ INFO ] NUM_STREAMS: 2 [ INFO ] AFFINITY: CORE [ INFO ] INFERENCE_NUM_THREADS: 0 [ INFO ] PERF_COUNT: NO [ INFO ] INFERENCE_PRECISION_HINT: f32 [ INFO ] PERFORMANCE_HINT: LATENCY [ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0 [Step 9/11] Creating infer requests and preparing input tensors [ WARNING ] No input files were given: all inputs will be filled with random values! [ INFO ] Test Config 0 [ INFO ] input ([N,C,D,H,W], f32, {1, 3, 16, 224, 224}, static): random (binary data is expected) [Step 10/11] Measuring performance (Start inference asynchronously, 2 inference requests, limits: 60000 ms duration) [ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop). [ INFO ] First inference took 37.27 ms [Step 11/11] Dumping statistics report [ INFO ] Count: 5470 iterations [ INFO ] Duration: 60028.56 ms [ INFO ] Latency: [ INFO ] Median: 21.79 ms [ INFO ] Average: 21.92 ms [ INFO ] Min: 20.60 ms [ INFO ] Max: 37.19 ms [ INFO ] Throughput: 91.12 FPS
The Benchmark Tool can also be used with dynamically shaped models to measure expected inference time for various input data shapes. See the -shape and-data_shape argument descriptions in the All configuration optionssection to learn more about using dynamic shapes. Below is an example of using benchmark_app with dynamic models and a portion of the resulting output:
Python
benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -shape [-1,3,16,224,224] -data_shape [1,3,16,224,224][2,3,16,224,224][4,3,16,224,224] -pcseq
[Step 9/11] Creating infer requests and preparing input tensors [ WARNING ] No input files were given for input 'input'!. This input will be filled with random values! [ INFO ] Fill input 'input' with random values [ INFO ] Defined 3 tensor groups: [ INFO ] input: {1, 3, 16, 224, 224} [ INFO ] input: {2, 3, 16, 224, 224} [ INFO ] input: {4, 3, 16, 224, 224} [Step 10/11] Measuring performance (Start inference asynchronously, 11 inference requests, limits: 60000 ms duration) [ INFO ] Benchmarking in full mode (inputs filling are included in measurement loop). [ INFO ] First inference took 201.15 ms [Step 11/11] Dumping statistics report [ INFO ] Count: 2811 iterations [ INFO ] Duration: 60271.71 ms [ INFO ] Latency: [ INFO ] Median: 207.70 ms [ INFO ] Average: 234.56 ms [ INFO ] Min: 85.73 ms [ INFO ] Max: 773.55 ms [ INFO ] Latency for each data shape group: [ INFO ] 1. input: {1, 3, 16, 224, 224} [ INFO ] Median: 118.08 ms [ INFO ] Average: 115.05 ms [ INFO ] Min: 85.73 ms [ INFO ] Max: 339.25 ms [ INFO ] 2. input: {2, 3, 16, 224, 224} [ INFO ] Median: 207.25 ms [ INFO ] Average: 205.16 ms [ INFO ] Min: 166.98 ms [ INFO ] Max: 545.55 ms [ INFO ] 3. input: {4, 3, 16, 224, 224} [ INFO ] Median: 384.16 ms [ INFO ] Average: 383.48 ms [ INFO ] Min: 305.51 ms [ INFO ] Max: 773.55 ms [ INFO ] Throughput: 108.82 FPS
C++
./benchmark_app -m omz_models/intel/asl-recognition-0004/FP16/asl-recognition-0004.xml -d CPU -shape [-1,3,16,224,224] -data_shape [1,3,16,224,224][2,3,16,224,224][4,3,16,224,224] -pcseq
[Step 9/11] Creating infer requests and preparing input tensors [ INFO ] Test Config 0 [ INFO ] input ([N,C,D,H,W], f32, {1, 3, 16, 224, 224}, dyn:{?,3,16,224,224}): random (binary data is expected) [ INFO ] Test Config 1 [ INFO ] input ([N,C,D,H,W], f32, {2, 3, 16, 224, 224}, dyn:{?,3,16,224,224}): random (binary data is expected) [ INFO ] Test Config 2 [ INFO ] input ([N,C,D,H,W], f32, {4, 3, 16, 224, 224}, dyn:{?,3,16,224,224}): random (binary data is expected) [Step 10/11] Measuring performance (Start inference asynchronously, 11 inference requests, limits: 60000 ms duration) [ INFO ] Benchmarking in full mode (inputs filling are included in measurement loop). [ INFO ] First inference took 204.40 ms [Step 11/11] Dumping statistics report [ INFO ] Count: 2783 iterations [ INFO ] Duration: 60326.29 ms [ INFO ] Latency: [ INFO ] Median: 208.20 ms [ INFO ] Average: 237.47 ms [ INFO ] Min: 85.06 ms [ INFO ] Max: 743.46 ms [ INFO ] Latency for each data shape group: [ INFO ] 1. input: {1, 3, 16, 224, 224} [ INFO ] Median: 120.36 ms [ INFO ] Average: 117.19 ms [ INFO ] Min: 85.06 ms [ INFO ] Max: 348.66 ms [ INFO ] 2. input: {2, 3, 16, 224, 224} [ INFO ] Median: 207.81 ms [ INFO ] Average: 206.39 ms [ INFO ] Min: 167.19 ms [ INFO ] Max: 578.33 ms [ INFO ] 3. input: {4, 3, 16, 224, 224} [ INFO ] Median: 387.40 ms [ INFO ] Average: 388.99 ms [ INFO ] Min: 327.50 ms [ INFO ] Max: 743.46 ms [ INFO ] Throughput: 107.61 FPS