Optimizing for Throughput — OpenVINO™ documentation (original) (raw)

As described in the section on the latency-specific optimizations, one of the possible use cases is delivering every single request with minimal delay. Throughput, on the other hand, is about inference scenarios in which potentially large numbers of inference requests are served simultaneously to improve resource use.

The associated increase in latency is not linearly dependent on the number of requests executed in parallel. A trade-off between overall throughput and serial performance of individual requests can be achieved with the right performance configuration of OpenVINO.

Basic and Advanced Ways of Leveraging Throughput#

There are two ways of leveraging throughput with individual devices:

In both cases, the application should be designed to execute multiple inference requests in parallel, as described in the following section.

Throughput-Oriented Application Design#

In general, most throughput-oriented inference applications should:

Multi-Device Execution#

OpenVINO offers the automatic, scalable multi-device inference mode, which is a simple application-transparent way to improve throughput. There is no need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance inference requests between devices, etc. For the application using it, multi-device is like any other device, as it manages all processes internally. Just like with other throughput-oriented scenarios, there are several major pre-requisites for optimal multi-device performance:

Keep in mind that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for certain resources such as the memory-bandwidth, which is shared between CPU and iGPU.

Note

While the legacy approach of optimizing the parameters of each device separately works, the Automatic Device Selection allow configuring all devices (that are part of the specific multi-device configuration) at once.