Deploy a simple mlp training script as a Kubernetes job — AWS Neuron Documentation (original) (raw)

This document is relevant for: Trn1, Trn2

Deploy a simple mlp training script as a Kubernetes job#

This tutorial uses mlp train as a teaching example on how to deploy an training application using Kubernetes on the Trn1 instances. For more advanced example, please refer to Tutorial: Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS

Prerequisite:#

EKS Setup For Neuron: to setup k8s support on your cluster.
Trn1 instances as worker nodes with attached roles allowing:
- ECR read access policy to retrieve container images from ECR:arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
Have a container image that is build using Run Training in PyTorch Neuron Container

Deploy a mlp training image#

Create a file named mlp_train.yaml with the contents below.

Note

In the image: add the appropriate location of the image

apiVersion: v1 kind: Pod metadata: name: trn1-mlp spec: restartPolicy: Never schedulerName: default-scheduler hostNetwork: true nodeSelector: beta.kubernetes.io/instance-type: trn1.32xlarge beta.kubernetes.io/instance-type: trn1.2xlarge containers: - name: trn1-mlp command: ["/usr/local/bin/python3"] args: ["/opt/ml/mlp_train.py"] image: 647554078242.dkr.ecr.us-east-1.amazonaws.com/sunda-pt:k8s_mlp_0907 imagePullPolicy: IfNotPresent env: - name: NEURON_RT_LOG_LEVEL value: "INFO" resources: limits: aws.amazon.com/neuron: 2 requests: aws.amazon.com/neuron: 2

Deploy the pod.

kubectl apply -f mlp_train.yaml

3. Check the logs to make sure training completed

kubectl logs

Your log should have the following

Final loss is 0.1977 ----------End Training ---------------

This document is relevant for: Trn1, Trn2