PyTorch on Google Cloud: How to deploy PyTorch models on Vertex AI (original) (raw)

Rajesh Thallam

Solutions Architect, Generative AI Solutions

Vaibhav Singh

Group Product Manager

This article is the next step in the series of PyTorch on Google Cloud using Vertex AI. In the preceding article, we fine-tuned a Hugging Face Transformers model for a sentiment classification task using PyTorch on Vertex Training service. In this post, we show how to deploy a PyTorch model on the Vertex Prediction service for serving predictions from trained model artifacts.

Now let’s walk through the deployment of a Pytorch model using TorchServe as a custom container by deploying the model artifacts to a Vertex Endpoint. You can find the accompanying code for this blog post on the GitHub repository and the Jupyter Notebook.

Deploying a PyTorch Model on Vertex Prediction Service

Vertex Prediction service is Google Cloud's managed model serving platform. As a managed service, the platform handles infrastructure setup, maintenance, and management. Vertex Prediction supports both CPU and GPU inferencing and offers a selection of n1-standard machine shapes in Compute Engine, letting you customize the scale unit to fit your requirements. Vertex Prediction service is the most effective way to deploy your models to serve predictions for the following reasons:

TorchServe is the recommended framework to deploy PyTorch models in production. TorchServe’s CLI makes it easy to deploy a PyTorch model locally or can be packaged as a container that can be scaled out by the Vertex Prediction service. The custom container capability of Vertex Prediction provides a flexible way to define the environment where the TorchServe model server is run.

In this blog post, we deploy a container running a TorchServe model server on the Vertex Prediction service to serve predictions from a fine-tuned transformer model from Hugging Face for the sentiment classification task. You can then send input requests with text to a Vertex Endpoint to classify sentiment as positive or negative.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image3_urDYHlo.max-500x500.png

Figure 1. Serving with custom containers on Vertex Prediction service

Following are the steps to deploy a PyTorch model on Vertex Prediction:

  1. Download the trained model artifacts.
  2. Package the trained model artifacts including default or custom handlers by creating an archive file using the Torch Model Archiver tool.
  3. Build a custom container (Docker) compatible with the Vertex Prediction service to serve the model using TorchServe.
  4. Upload the model with the custom container image as a Vertex Model resource.
  5. Create a Vertex Endpoint and deploy the model resource to the endpoint to serve predictions.

1. Download the trained model artifacts

Model artifacts are created by the training application code that are required to serve predictions. TorchServe expects model artifacts to be in either a saved model binary (.bin) format or a traced model (.pth or .pt) format. In the previous post, we trained a Hugging Face Transformer model on the Vertex Training service and saved the model as a model binary (.bin) by calling the .save_model() method and then saved the model artifacts to a Cloud Storage bucket.

Based on the training job name, you can get the location of model artifacts from Vertex Training using the Cloud Console or gcloud ai custom-jobs describe command and then download the artifacts from the Cloud Storage bucket.

2. Create a custom model handler to handle prediction requests

TorchServe uses a base handler module to pre-process the input before being fed to the model or post-process the model output before sending the prediction response. TorchServe provides default handlers for common use cases such as image classification, object detection, segmentation and text classification. For the sentiment analysis task, we will create a custom handler because the input text needs to be tokenized using the same tokenizer used at the training time to avoid the training-serving skew.

The custom handler presented here does the following:

3. Create custom container image with TorchServe to serve predictions

When deploying a PyTorch model on the Vertex Prediction service, you must use a custom container image that runs a HTTP server, such as TorchServe in this case. The custom container image must meet the requirements to be compatible with the Vertex Prediction service. We create a Dockerfile with TorchServe as the base image that meets custom container image requirements and performs the following steps:

Let’s understand the functionality of TorchServe and Torch Model Archiver tools in these steps.

Torch Model Archiver

Torchserve provides a model archive utility to package a PyTorch model for deployment and the resulting model archive file is used by torchserve at serving time. Following is the torch-model-archiver command added in Dockerfile to generate a model archive file for the text classification model:

TorchServe

TorchServe wraps PyTorch models into a set of REST APIs served by a HTTP web server. Adding the torchserve command to the CMD or ENTRYPOINT of the custom container launches this server. In this article we will only explore prediction and health check APIs. The Explainable AI API for PyTorch models on Vertex endpoints is currently supported only for tabular data.

**TorchServe** **Config** (--ts-config parameter): TorchServe config allows you to customize the inference address and management ports. We also configure service_envelop field to json to indicate the expected input format for TorchServe. Refer to TorchServe documentation to configure other parameters. We create a config.properties file and pass it as TorchServe config.

4. Build and push the custom container image

Before pushing the image to the Container Registry, you can test the docker image locally by sending input requests to a local TorchServe deployment running inside docker.

This request uses a test sentence. If successful, the server returns the prediction in the following format:

Now push the custom container image to the Container Registry, which will be deployed to the Vertex Endpoint in the next step.

NOTE: You can also build and push the custom container image to the Artifact Registry repository instead of the Container Registry repository.

5. Deploying the serving container to Vertex Endpoint

We have packaged the model and built the serving container image. The next step is to deploy it to a Vertex Endpoint. A model must be deployed to an endpoint before it can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency. We use Vertex SDK for Python to upload the model and deploy it to an endpoint. Following steps are applicable to any model trained either on Vertex Training service or elsewhere such as on-prem.

Upload model

We upload the model artifacts to Vertex AI and create a Model resource for the deployment. In this example the artifact is the serving container image URI. Notice that the predict and health routes (mandatory routes) and container port(s) are also specified at this step.

After the model is uploaded, you can view the model in the Models page on the Google Cloud Console under the Vertex AI section.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_LC5PaZs.max-900x900.png

Figure 2. Models page on Google Cloud console under the Vertex AI section

Create endpoint

Create a service endpoint to deploy one or more models. An endpoint provides a service URL where the prediction requests are sent. You can skip this step if you are deploying the model to an existing endpoint.

After the endpoint is created, you can view the endpoint in the Endpoints page on the Google Cloud Console under the Vertex AI section.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image5_poBVxnv.max-500x500.png

Figure 3. Endpoints page on Google Cloud console under the Vertex AI section

Deploy the model to endpoint

The final step is deploying the model to an endpoint. The deploy method provides the interface to specify the endpoint where the model is deployed and compute parameters including machine type, scaling minimum and maximum replica counts, and traffic split.

After deploying the model to the endpoint, you can manage and monitor the deployed models from the Endpoints page on the Google Cloud Console under the Vertex AI section.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image4_rQ6YjQB.max-1100x1100.png

Figure 4. Manage and monitor models deployed on Endpoint from Google Cloud console under the Vertex AI section

Test the deployment

Now that the model is deployed, we can use the endpoint.predict() method to send base64 encoded text to the prediction request and get the predicted sentiment in response.

Posted in