Automatic Batching — OpenVINO™ documentation (original) (raw)

The Automatic Batching Execution mode (or Auto-batching for short) performs automatic batching on-the-fly to improve device utilization by grouping inference requests together, without programming effort from the user. With Automatic Batching, gathering the input and scattering the output from the individual inference requests required for the batch happen transparently, without affecting the application code.

Auto Batching can be used directly as a virtual device or as an option for inference on CPU/GPU/NPU (by means of configuration/hint). These 2 ways are provided for the user to enable the BATCH devices explicitly or implicitly, with the underlying logic remaining the same. An example of the difference is that the CPU device doesn’t support implicitly to enable BATCH device, commands such as ./benchmark_app -m <model> -d CPU -hint tput will not apply BATCH device implicitly, but ./benchmark_app -m <model> -d "BATCH:CPU(16) can explicitly load BATCH device.

Auto-batching primarily targets the existing code written for inferencing many requests, each instance with the batch size 1. To get corresponding performance improvements, the application must be running multiple inference requests simultaneously. Auto-batching can also be used via a particular virtual device.

This article provides a preview of the Automatic Batching function, including how it works, its configurations, and testing performance.

How Automatic Batching Works#

Enabling Automatic Batching

Batching is a straightforward way of leveraging the compute power of GPU and saving on communication overheads. Automatic Batching is “implicitly” triggered on the GPU when ov::hint::PerformanceMode::THROUGHPUT is specified for the ov::hint::performance_mode property for the compile_model or set_property calls.

Python

import openvino.properties as props
import openvino.properties.hint as hints

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)

C++

auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));

To enable Auto-batching in the legacy apps not akin to the notion of performance hints, you need to use the explicit device notion, such as BATCH:GPU.

Disabling Automatic Batching

Auto-Batching can be disabled (for example, for the GPU device) to prevent being triggered by ov::hint::PerformanceMode::THROUGHPUT. To do that, set ov::hint::allow_auto_batching to false in addition to the ov::hint::performance_mode, as shown below:

Python

# disabling the automatic batching
# leaving intact other configurations options that the device selects for the 'throughput' hint 
config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT,
          hints.allow_auto_batching: False}
compiled_model = core.compile_model(model, "GPU", config)

C++

// disabling the automatic batching // leaving intact other configurations options that the device selects for the 'throughput' hint auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT), ov::hint::allow_auto_batching(false));

Configuring Automatic Batching#

Following the OpenVINO naming convention, the batching device is assigned the label of BATCH. The configuration options are as follows:

Parameter name Parameter description Examples
AUTO_BATCH_DEVICE The name of the device to apply Automatic batching, with the optional batch size value in brackets. BATCH:GPU triggers the automatic batch size selection. BATCH:GPU(4) directly specifies the batch size.
ov::auto_batch_timeout The timeout value, in ms. (1000 by default) You can reduce the timeout value to avoid performance penalty when the data arrives too unevenly. For example, set it to “100”, or the contrary, i.e., make it large enough to accommodate input preparation (e.g. when it is a serial process).

Automatic Batch Size Selection#

In both the THROUGHPUT hint and the explicit BATCH device cases, the optimal batch size is selected automatically, as the implementation queries the ov::optimal_batch_size property from the device and passes the model graph as the parameter. The actual value depends on the model and device specifics, for example, the on-device memory for dGPUs. The support for Auto-batching is not limited to GPU. However, if a device does not support ov::optimal_batch_size yet, to work with Auto-batching, an explicit batch size must be specified, e.g., BATCH:<device>(16).

This “automatic batch size selection” works on the presumption that the application queries ov::optimal_number_of_infer_requests to create the requests of the returned number and run them simultaneously:

Python

# when the batch size is automatically selected by the implementation
# it is important to query/create and run the sufficient requests
config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)
num_requests = compiled_model.get_property(props.optimal_number_of_infer_requests)

C++

// when the batch size is automatically selected by the implementation // it is important to query/create and run the sufficient #requests auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)); auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);

Optimizing Performance by Limiting Batch Size#

If not enough inputs were collected, the timeout value makes the transparent execution fall back to the execution of individual requests. This value can be configured via the AUTO_BATCH_TIMEOUT property. The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, when your parallel slack is bounded, provide OpenVINO with an additional hint.

For example, when the application processes only 4 video streams, there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional ov::hint::num_requests configuration key set to 4. This will limit the batch size for the GPU and the number of inference streams for the CPU, hence each device uses ov::hint::num_requests while converting the hint to the actual device configuration options:

Python

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT,
          hints.num_requests: "4"}
# limiting the available parallel slack for the 'throughput'
# so that certain parameters (like selected batch size) are automatically accommodated accordingly 
compiled_model = core.compile_model(model, "GPU", config)

C++

// limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests // so that certain parameters (like selected batch size) are automatically accommodated accordingly auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT), ov::hint::num_requests(4));

For the explicit usage, you can limit the batch size by using BATCH:GPU(4), where 4 is the number of requests running in parallel.

Automatic Batching as an explicit device#

The below examples show how AUTO Batching can be used in the form of device that the user can apply to perform inference directly:

./benchmark_app -m -d "BATCH:GPU" ./benchmark_app -m -d "BATCH:GPU(16)" ./benchmark_app -m -d "BATCH:CPU(16)"

Automatic Batching as underlying device configured to other devices#

In the following example, BATCH device will be configured to another device in case of tput/ctput mode.

./benchmark_app -m -d GPU -hint tput ./benchmark_app -m -d AUTO -hint tput ./benchmark_app -m -d AUTO -hint ctput ./benchmark_app -m -d AUTO:GPU -hint ctput

Note

If you run ./benchmark_app, do not set batch_size by -b <batch_size>, otherwise AUTO mode will not be applied.

Other Performance Considerations#

To achieve the best performance with Automatic Batching, the application should:

Limitations#

The following are limitations of the current AUTO Batching implementations:

Note

BATCH device supports GPU by default, but GPU still may not trigger auto_batch in tput mode if model or GPU memory size are not allowed. Which means it is required to check supported_properties of GPU tput mode compiled_model before doing any actions (set/get) with ov::auto_batch_timeout property.
To make sure BATCH device supports GPU by default, ov::model is required for core.compile_model. A string of model file path to core.compile_model will be passed to GPU Plugin directly due to performance consideration and without involving BATCH.

Testing Performance with Benchmark_app#

Using the benchmark_app sample is the best way to evaluate the performance of Automatic Batching:

Note that Benchmark_app performs a warm-up run of a single request. As Auto-Batching requires significantly more requests to execute in batch, this warm-up run hits the default timeout value (1000 ms), as reported in the following example:

[ INFO ] First inference took 1000.18ms

This value also exposed as the final execution statistics on the benchmark_app exit:

[ INFO ] Latency: [ INFO ] Max: 1000.18 ms

This is NOT the actual latency of the batched execution, so you are recommended to refer to other metrics in the same log, for example, “Median” or “Average” execution.

Additional Resources#