GitHub - ROCm/gpu-operator (original) (raw)

AMD GPU Operator

📖 GPU Operator Documentation Site

For the most detailed and up-to-date documentation please visit our Instinct Documenation site: https://instinct.docs.amd.com/projects/gpu-operator

Introduction

AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.

Components

Features

Compatibility

Prerequisites

helm repo add jetstack https://charts.jetstack.io --force-update

helm install cert-manager jetstack/cert-manager
--namespace cert-manager
--create-namespace
--version v1.15.1
--set crds.enabled=true

Quick Start

1. Add the AMD Helm Repository

helm repo add rocm https://rocm.github.io/gpu-operator helm repo update

2. Install the Operator

Basic installation

helm install amd-gpu-operator rocm/gpu-operator-charts
--namespace kube-amd-gpu
--create-namespace
--version=v1.2.0

Installation Options

Warning

It is strongly recommended to use AMD-optimized KMM images included in the operator release. This is not required when installing the GPU Operator on Red Hat OpenShift.

3. Install Custom Resource

After the installation of AMD GPU Operator, you need to create the DeviceConfig custom resource in order to trigger the operator to start to work. By preparing the DeviceConfig in the YAML file, you can create the resouce by running kubectl apply -f deviceconfigs.yaml. For custom resource definition and more detailed information, please refer to Custom Resource Installation Guide.

Grafana Dashboards

Following dashboards are provided for visualizing GPU metrics collected from device-metrics-exporter:

Contributing

Please refer to our Developer Guide.

Support

For bugs and feature requests, please file an issue on our GitHub Issues page.

License

The AMD GPU Operator is licensed under the Apache License 2.0.