Getting Performance Numbers — OpenVINO™ documentation (original) (raw)

  1. Benchmarking methodology for OpenVINO
    1. OpenVINO benchmarking (general)
    2. OpenVINO Model Server benchmarking (general)
    3. OpenVINO Model Server benchmarking (LLM)
  2. How to obtain benchmark results
    1. General considerations
    2. OpenVINO benchmarking (general)
    3. OpenVINO benchmarking (LLM)

Benchmarking methodology for OpenVINO#

OpenVINO benchmarking (general)#

The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark application installed. It measures the time spent on actual inference (excluding any pre or post processing) and then reports on the inferences per second (or Frames Per Second).

OpenVINO Model Server benchmarking (general)#

OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST. Its benchmark results are measured with the configuration of multiple-clients-single-server, using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms and models used. It is set not to be a bottleneck for workload intensity. The connection is dedicated only to measuring performance.

The benchmark setup for OVMS consists of four main parts:

OVMS Benchmark Setup Diagram

OpenVINO Model Server benchmarking (LLM)#

In the benchmarking results presented here, the load from clients is simulated using the benchmark_serving.py script from vLLM and the ShareGPT dataset. It represents real life usage scenarios. Both OpenVINO Model Server and vLLM expose OpenAI-compatible REST endpoints so the methodology is identical.

In the experiments, we change the average request rate to identify the tradeoff between total throughput and the TPOT latency.

Note that in the benchmarking, the feature of prefix_caching is not used.

How to obtain benchmark results#

General considerations#

When evaluating performance of a model with OpenVINO Runtime, it is required to measure a proper set of operations.

Performance conclusions should be build on reproducible data. As for the performance measurements, they should be done with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, an aggregated value can be used for the execution time for final projections:

When comparing OpenVINO Runtime performance with the framework or reference code, make sure that both versions are as similar as possible:

OpenVINO benchmarking (general)#

The default way of measuring OpenVINO performance is running a piece of code, referred to asthe benchmark tool. For Python, it is part of the OpenVINO Runtime installation, while for C++, it is available as a code sample.

Running the benchmark application#

The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as:

benchmark_app -m -d -i

Each of the OpenVINO supported devicesoffers performance settings that contain command-line equivalents in the Benchmark app.

While these settings provide really low-level control for the optimal model performance on a_specific_ device, it is recommended to always start performance evaluation with theOpenVINO High-Level Performance Hintsfirst, like so:

for throughput prioritization

benchmark_app -hint tput -m -d

for latency prioritization

benchmark_app -hint latency -m -d

Internal Inference Performance Counters and Execution Graphs#

More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. Both C++ and Pythonversions of the benchmark_app support a -pc command-line parameter that outputs an internal execution breakdown.

For example, the table below is part of performance counters forCPU inference. of a TensorFlow implementation of ResNet-50Keep in mind that since the device is CPU, the realTime wall clock and the cpu time layers are the same. Information about layer precision is also stored in the performance counters.

layerName execStatus layerType execType realTime (ms) cpuTime (ms)
resnet_model/batch_normalization_15/FusedBatchNorm/Add EXECUTED Convolution jit_avx512_1x1_I8 0.377 0.377
resnet_model/conv2d_16/Conv2D/fq_input_0 NOT_RUN FakeQuantize undef 0 0
resnet_model/batch_normalization_16/FusedBatchNorm/Add EXECUTED Convolution jit_avx512_I8 0.499 0.499
resnet_model/conv2d_17/Conv2D/fq_input_0 NOT_RUN FakeQuantize undef 0 0
resnet_model/batch_normalization_17/FusedBatchNorm/Add EXECUTED Convolution jit_avx512_1x1_I8 0.399 0.399
resnet_model/add_4/fq_input_0 NOT_RUN FakeQuantize undef 0 0
resnet_model/add_4 NOT_RUN Eltwise undef 0 0
resnet_model/add_5/fq_input_1 NOT_RUN FakeQuantize undef 0 0

The execStatus column of the table includes the following possible values:

- EXECUTED - the layer was executed by standalone primitive.

- NOT_RUN - the layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.

The execType column of the table includes inference primitives with specific suffixes. The layers could have the following marks:

- The I8 suffix is for layers that had 8-bit data type input and were computed in 8-bit precision.

- The FP32 suffix is for layers computed in 32-bit precision.

All Convolution layers are executed in int8 precision. The rest of the layers are fused into Convolutions using post-operation optimization, as described inCPU Device. This contains layer names (as seen in OpenVINO IR), type of the layer, and execution statistics.

Both benchmark_app versions also support the exec_graph_path command-line option. It requires OpenVINO to output the same execution statistics per layer, but in the form of plugin-specific Netron-viewable graph to the specified file.

Especially when performance-debugginglatency, note that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example, running singleOpenVINO streamwith multiple requests would produce nearly identical counters as running a single inference request, while the actual latency can be quite different.

Lastly, the performance statistics with both performance counters and execution graphs are averaged, so such data for theinputs of dynamic shapesshould be measured carefully, preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data.

Use ITT to Get Performance Insights#

In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools likeIntel® VTune™ Profilerto get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.

OpenVINO benchmarking (LLM)#

Large Language Models require a different benchmarking approach to static models. A detailed description will be added soon.