Deploy a simple mlp training script as a Kubernetes job — AWS Neuron Documentation (original) (raw)
This document is relevant for: Trn1
, Trn2
Deploy a simple mlp training script as a Kubernetes job#
This tutorial uses mlp train as a teaching example on how to deploy an training application using Kubernetes on the Trn1 instances. For more advanced example, please refer to Tutorial: Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS
Prerequisite:#
- EKS Setup For Neuron: to setup k8s support on your cluster.
- Trn1 instances as worker nodes with attached roles allowing:
- ECR read access policy to retrieve container images from ECR:arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- Have a container image that is build using Run Training in PyTorch Neuron Container
Deploy a mlp training image#
- Create a file named mlp_train.yaml with the contents below.
Note
In the image: add the appropriate location of the image
apiVersion: v1 kind: Pod metadata: name: trn1-mlp spec: restartPolicy: Never schedulerName: default-scheduler hostNetwork: true nodeSelector: beta.kubernetes.io/instance-type: trn1.32xlarge beta.kubernetes.io/instance-type: trn1.2xlarge containers: - name: trn1-mlp command: ["/usr/local/bin/python3"] args: ["/opt/ml/mlp_train.py"] image: 647554078242.dkr.ecr.us-east-1.amazonaws.com/sunda-pt:k8s_mlp_0907 imagePullPolicy: IfNotPresent env: - name: NEURON_RT_LOG_LEVEL value: "INFO" resources: limits: aws.amazon.com/neuron: 2 requests: aws.amazon.com/neuron: 2
- Deploy the pod.
kubectl apply -f mlp_train.yaml
3. Check the logs to make sure training completed
kubectl logs
Your log should have the following
Final loss is 0.1977 ----------End Training ---------------
This document is relevant for: Trn1
, Trn2