Manage deep learning models with OpenVINO Model Server | Red Hat Developer (original) (raw)

OpenVINO is a toolkit developed by Intel for deploying and optimizing AI models across various hardware platforms. At its core, model serving involves three main components: the model to deploy, an inference runtime and a model server.OpenVINO in this context is the runtime that supports multiple types of model, while OpenVINO Model Server (OVMS) is built on top of the runtime, designed to streamline the deployment and management of deep learning models in production environments by leveraging these two key components.

Model-serving engines like OVMS facilitate the deployment and management of models, abstracting hardware-specific complexities and enabling seamless integration with existing infrastructure(i.e. providing metrics). Inferencing runtimes are responsible for executing inference requests in real time, and making decisions based on incoming data. Together, they form a robust framework for efficient and scalable AI deployment.

Key features

Architecture overview

OpenVINO Model Server includes multiple components, including an inference engine and hardware abstraction layer. It interacts with the client application via API protocols and uses an internal model registry to pull the models from storage. These components work together to ensure efficient model deployment, execution, and management. See Figure 1 for an illustration of this.

A horizontal chart depicting the architecture overview of the OpenVino Model Server including an inference engine and hardware abstraction layer over a white backgrounf.

Figure 1: Architecture overview of OpenVino Model Server.

The components of OVMS include:

Model conversion to OpenVINO intermediate representation

Preparing models for serving with OpenVINO involves a two-step process. First, the models are converted into an OpenVINO format optimized and sized for specific hardware types. Then, the models are loaded into the OVMS model server.

  1. Conversion process: Models trained in various frameworks such as TensorFlow, PyTorch, and ONNX are first converted to the OpenVINO Intermediate Representation (IR) format using the OpenVINO Model Optimizer. The Model Optimizer uses various techniques to provide better performance and efficiency. This process includes steps like model quantization, optimization, and transformation to ensure compatibility with OpenVINO.
  2. Loading IR models: Once converted, the OpenVINO IR models are loaded into OVMS. This enables the models to be deployed efficiently across different hardware platforms supported by OpenVINO.

Supported frameworks and models

OpenVINO Model Server supports a wide range of deep learning frameworks and model formats, enabling seamless deployment and integration.

Deployment scenarios

OpenVINO Model Server offers significant flexibility by supporting deployment across various hardware footprints. This adaptability enables OVMS to address diverse use cases and environments, ensuring optimal performance and efficiency. The deployment options include:

Performance and optimization

OpenVINO provides various optimization techniques to enhance model performance and efficiency. These include model quantization, which reduces model size and computational complexity. OVMS further improves throughput by implementing batch processing, which allows the server to process multiple inference requests simultaneously.

Automatic batching:

Asynchronous API:

Number of wired streams:

CPU pinning and CPU affinity:

Benchmarking: OpenVino Model Server performance benchmarks demonstrate its efficiency and scalability across different hardware platforms and inference workloads.

Security considerations

Ensuring the security of models and data is crucial when deploying AI solutions. OpenVINO Model Server incorporates several security measures to protect sensitive information and maintain the integrity of its operations.

Integration and extensibility

APIs and SDKs: OVMS offers APIs and SDKs for integrating with custom applications and frameworks. These interfaces enable seamless integration of OVMS into existing workflows and environments, allowing users to leverage its capabilities without disrupting their development processes.

Custom plugins: OVMS supports custom plugins for extending its functionality and integrating with specialized hardware or software components. Users can develop and deploy custom plugins to meet specific requirements or optimize performance for their use cases.

OpenShift AI integration with OpenVINO and KServe

Red Hat OpenShift AI isan integrated MLOps platform for building, training, deploying, and monitoring predictive and gen AI models at scale across hybrid cloud environments. OpenShift AI uses KServe, a flexible machine learning model serving framework, to serve and support multiple generative AI inference runtimes. OpenShift AI also includes OpenVINO Model Server as one of the serving engines and formats supported**.**

How do OpenShift AI users benefit from OpenVINO?

OpenShift AI users can significantly enhance their AI/ML workflows by integrating OpenVINO for optimized model performance and KServe for scalable model serving. This combination provides flexibility, efficiency, and comprehensive monitoring capabilities.

Workflow with KServe and OpenVINO

To illustrate this integration, below is diagram and detailed explanation of the workflow (Figure 2).

A vertical diagram depicting the KServe and OpenVino workflow above a gray background. The chart begins with Client Applications at the top then flows into the cluster which includes Ingress/Load Balancer, KServe Inference Service, and Pod (OVMS + OpenVino Runtime) before existing the cluster and flowing into Hardware Accelerators.

Figure 2: A diagram of the KServe and OpenVINO workflow.

Practical example: Running an inference request and getting metrics

Step-by-step guide:

  1. Model optimization:
    • Train your model: Use a preferred framework such as TensorFlow, PyTorch, or ONNX.
    • Optimize the trained model: Use the OpenVINO Model Optimizer to convert your model into the Intermediate Representation (IR) format.
  2. Deploying with KServe on OpenShift:
    • Install OpenVINO toolkit operator: Use the OpenShift console to deploy the OpenVINO toolkit operator for managing model deployments.
    • Deploy OVMS: Deploy the OpenVINO Model Server on OpenShift using KServe to manage model inference requests.
  3. Running inference:
    • Send inference requests: Use client applications to send inference requests to the KServe endpoints.
    • Monitor metrics: Use KServe's built-in monitoring tools to gather metrics on model performance and resource utilization.

Sample inference request

This inference request demonstrates how to send data gathered from the application and input it into the actual model running on the remote API endpoint, whether on-premise or in the cloud.

Example code:

import requests

response=requests.post("http://<kserve-endpoint>/v2/models/<model-name>:predict",json={"instances": input_data})print(response.json())

Conclusion

OpenVINO Model Server (OVMS) streamlines the deployment and management of deep learning models across various environments by leveraging the powerful optimization capabilities of the OpenVINO toolkit. With its support for popular frameworks like TensorFlow, PyTorch, Caffe, and ONNX, as well as multiple model formats, OVMS offers flexibility and ease of integration. Its robust architecture includes features such as model quantization, which reduces computational complexity, and batch processing to enhance throughput, making it ideal for handling high-volume inference workloads. Furthermore, OVMS ensures the security of models and data through comprehensive authentication, authorization, and encryption mechanisms. Whether deployed on-premise, in the cloud, or at the edge, OVMS provides scalable, efficient, and secure AI deployment, enabling organizations to harness the full potential of their AI models with minimal complexity and maximum performance.

By integrating OVMS with Red Hat OpenShift AI and the KServe model serving framework embedded in OpenShift AI, users can achieve enhanced flexibility, scalability, and monitoring capabilities, making it an ideal solution for modern AI/ML workflows. To explore how the joint Red Hat and Intel AI solution can further benefit your AI ecosystem, check out the Red Hat and Intel AI Solution Brief.

Last updated: January 15, 2025