Vertex AI custom training overview (original) (raw)

Vertex AI provides a managed training service that lets you operationalize large scale model training. You can use Vertex AI to run training applications based on any machine learning (ML) framework on Google Cloud infrastructure. For the following popular ML frameworks, Vertex AI also has integrated support that simplifies the preparation process for model training and serving:

This page explains the benefits of custom training on Vertex AI, the workflow involved, and the various training options that are available.

Vertex AI operationalizes training at scale

There are several challenges to operationalizing model training. These challenges include the time and cost needed to train models, the depth of skills required to manage the compute infrastructure, and the need to provide enterprise-level security. Vertex AI addresses these challenges while providing a host of other benefits.

Fully managed compute infrastructure

Managed infrastructure Model training on Vertex AI is a fully managed service that requires no administration of physical infrastructure. You can train ML models without the need to provision or manage servers. You only pay for the compute resources that you consume. Vertex AI also handles job logging, queuing, and monitoring.

High-performance

High-performance Vertex AI training jobs are optimized for ML model training, which can provide faster performance than directly running your training application on a GKE cluster. You can also identify and debug performance bottlenecks in your training job by using Cloud Profiler.

Distributed training

Distributed training Reduction Server is an all-reduce algorithm in Vertex AI that can increase throughput and reduce latency of multi-node distributed training on NVIDIA graphics processing units (GPUs). This optimization helps reduce the time and cost of completing large training jobs.

Hyperparameter optimization

Hyperparameter tuning Hyperparameter tuning jobs run multiple trials of your training application using different hyperparameter values. You specify a range of values to test and Vertex AI discovers the optimal values for your model within that range.

Enterprise security

ML operations (MLOps) integrations

MLOps Vertex AI provides a suite ofintegrated MLOps tools and features that you can use for the following purposes: Orchestrate end-to-end ML workflows. Perform feature engineering. Run experiments. Manage and iterate your models. Track ML metadata. Monitor and evaluate model quality.

Workflow for custom training

The following diagram shows a high-level overview of the custom training workflow on Vertex AI. The sections that follow describe each step in detail.

Workflow for custom training

Load and prepare training data

For the best performance and support, use one of the following Google Cloud services as your data source:

For a comparison of these services, seeData preparation overview.

You can also specify aVertex AI managed datasetas the data source when using a training pipeline to train your model. Training a custom model and an AutoML model using the same dataset lets you compare the performance of the two models.

Prepare your training application

To prepare your training application for use on Vertex AI, do the following:

Implement training code best practices

Your training application should implement thetraining code best practices for Vertex AI. These best practices relate to the ability of your training application to do the following:

Select a container type

Vertex AI runs your training application in aDocker container image. A Docker container image is a self-contained software package that includes code and all dependencies, which can run in almost any computing environment. You can either specify the URI of aprebuilt container imageto use, or create and upload acustom container imagethat has your training application and dependencies pre-installed.

The following table shows the differences between prebuilt and custom container images:

Specifications Prebuilt container images Custom container images
ML framework Each container image is specific to an ML framework. Use any ML framework or use none.
ML framework version Each container image is specific to an ML framework version. Use any ML framework version, including minor versions and nightly builds.
Application dependencies Common dependencies for the ML framework are pre-installed. You can specify additional dependencies to install in your training application. Pre-install the dependencies that your training application needs.
Application delivery format Python source distribution. Single Python file. Pre-install the training application in the custom container image.
Effort to set up Low High
Recommended for Python training applications based on an ML framework and framework version that has a prebuilt container image available. Greater customization and control. Non-Python training applications. Private or custom dependencies. Training applications that use an ML framework or framework version that has no prebuilt container image available.

Package your training application

After you've determined the type of container image to use, package your training application into one of the following formats based on the container image type:

Configure training job

A Vertex AI training job performs the following tasks:

Vertex AI offersthree types of training jobsfor running your training application:

When creating a training job, specify the compute resources to use for running your training application and configure your container settings.

Compute configurations

Specify the compute resources to use for a training job. Vertex AI supports single-node training, where the training job runs on one VM, anddistributed training, where the training job runs on multiple VMs.

The compute resources that you can specify for your training job are as follows:

Container configurations

Thecontainer configurationsthat you need to make depend on whether you're using a prebuilt or custom container image.

Create a training job

After your data and training application are prepared, run your training application by creating one of the following training jobs:

To create the training job, you can use the Google Cloud console, Google Cloud CLI, Vertex AI SDK for Python, or the Vertex AI API.

(Optional) Import model artifacts into Vertex AI

Your training application likely outputs one or more model artifacts to a specified location, usually a Cloud Storage bucket. Before you can get inferences in Vertex AI from your model artifacts, firstimport the model artifacts into Vertex AI Model Registry.

Like container images for training, Vertex AI gives you the choice of using prebuilt orcustom container images for inferences. If a prebuilt container image for inferences is available for your ML framework and framework version, we recommend using a prebuilt container image.

What's next