Performance Hints and Thread Scheduling — OpenVINO™ documentation (original) (raw)

To simplify the configuration of hardware devices, it is recommended to use the ov::hint::PerformanceMode::LATENCY and ov::hint::PerformanceMode::THROUGHPUThigh-level performance hints. Both performance hints ensure optimal portability and scalability of applications across various platforms and models.

For additional details on the above configurations, refer toMulti-stream Execution.

Latency Hint#

In this scenario, the default setting of ov::hint::scheduling_core_type is determined by the model precision and the ratio of P-cores and E-cores.

Note

P-cores is short for Performance-cores and E-cores stands for Efficient-cores. These types of cores are available starting with the 12th Gen Intel® Core™ processors.

INT8 Model FP32 Model
E-cores / P-cores < 2 P-cores P-cores
2 <= E-cores / P-cores < 4 P-cores P-cores and E-cores
4 <= E-cores / P-cores P-cores and E-cores P-cores and E-cores

Note

Both P-cores and E-cores may be used for any configuration starting with 14th Gen Intel® Core™ processors on Windows.

Then the default settings for low-level performance properties on Windows and Linux are as follows:

Property Windows Linux
ov::num_streams 1 1
ov::inference_num_threads is equal to the number of P-cores or P-cores+E-cores on one socket is equal to the number of P-cores or P-cores+E-cores on one socket
ov::hint::scheduling_core_type Core Type Table of Latency Hint Core Type Table of Latency Hint
ov::hint::enable_hyper_threading No No
ov::hint::enable_cpu_pinning No / Not Supported Yes except using P-cores and E-cores together

Note

Throughput Hint#

In this scenario, thread scheduling first evaluates the memory pressure of the model being inferred on the current platform, and determines the number of threads per stream, as shown below.

Memory Pressure Threads per Stream
low 1 P-core or 2 E-cores
medium 2
high 3 or 4 or 5

Then the value of ov::num_streams is calculated by dividing ov::inference_num_threadsby the number of threads per stream. The default settings for low-level performance properties on Windows and Linux are as follows:

Property Windows Linux
ov::num_streams Calculated as above Calculated as above
ov::inference_num_threads Number of P-cores and E-cores Number of P-cores and E-cores
ov::hint::scheduling_core_type P-cores and E-cores P-cores and E-cores
ov::hint::enable_hyper_threading Yes / No Yes / No
ov::hint::enable_cpu_pinning No Yes

Note

By default, different core types are not mixed within a single stream in this scenario. The cores from different NUMA nodes are not mixed within a single stream.

Multi-Threading Optimization#

The following properties can be used to limit the available CPU resources for model inference. If the platform or operating system supports this behavior, the OpenVINO Runtime will perform multi-threading scheduling based on the limited available CPU.

Python

Use one logical processor for inference

compiled_model_1 = core.compile_model( model=model, device_name=device_name, config={properties.inference_num_threads(): 1}, )

Use logical processors of Efficient-cores for inference on hybrid platform

compiled_model_2 = core.compile_model( model=model, device_name=device_name, config={ properties.hint.scheduling_core_type(): properties.hint.SchedulingCoreType.ECORE_ONLY, }, )

Use one logical processor per CPU core for inference when hyper threading is on

compiled_model_3 = core.compile_model( model=model, device_name=device_name, config={properties.hint.enable_hyper_threading(): False}, )

C++

    // Use one logical processor for inference
    auto compiled_model_1 = core.compile_model(model, device, ov::inference_num_threads(1));

    // Use logical processors of Efficient-cores for inference on hybrid platform
    auto compiled_model_2 = core.compile_model(model, device, ov::hint::scheduling_core_type(ov::hint::SchedulingCoreType::ECORE_ONLY));

    // Use one logical processor per CPU core for inference when hyper threading is on
    auto compiled_model_3 = core.compile_model(model, device, ov::hint::enable_hyper_threading(false));

Note

ov::hint::scheduling_core_type and ov::hint::enable_hyper_threading only support Intel® x86-64 CPU on Linux and Windows in the current release.

In some use cases, OpenVINO Runtime will enable CPU thread pinning by default for better performance. Users can also turn this feature on or off using the property ov::hint::enable_cpu_pinning. Disabling thread pinning may be beneficial in complex applications where several workloads are executed in parallel.

Python

Disable CPU thread pinning for inference when the system supports it

compiled_model_4 = core.compile_model( model=model, device_name=device_name, config={properties.hint.enable_cpu_pinning(): False}, )

C++

    // Disable CPU threads pinning for inference when system support it
    auto compiled_model_4 = core.compile_model(model, device, ov::hint::enable_cpu_pinning(false));

For details on multi-stream execution check theoptimization guide.

Composability of different threading runtimes#

OpenVINO is by default built with the oneTBB threading library, oneTBB has a feature worker_wait, similar to OpenMP busy-wait, which makes OpenVINO inference threads wait actively for a while after a task done. The intention is to avoid CPU inactivity in the transition time between inference tasks.

In the pipeline that runs OpenVINO inferences on the CPU along with other sequential application logic, using different threading runtimes (e.g., OpenVINO inferences use oneTBB, while other application logic uses OpenMP) will cause both to occupy CPU cores for additional time after the task done, leading to overhead.

Recommended solutions: