How Amazon SageMaker AI Provides Training Information (original) (raw)

This section explains how SageMaker AI makes training information, such as training data, hyperparameters, and other configuration information, available to your Docker container.

When you send a CreateTrainingJob request to SageMaker AI to start model training, you specify the Amazon Elastic Container Registry (Amazon ECR) path of the Docker image that contains the training algorithm. You also specify the Amazon Simple Storage Service (Amazon S3) location where training data is stored and algorithm-specific parameters. SageMaker AI makes this information available to the Docker container so that your training algorithm can use it. This section explains how we make this information available to your Docker container. For information about creating a training job, see CreateTrainingJob. For more information on the way that SageMaker AI containers organize information, see SageMaker Training and Inference Toolkits.

Topics

Hyperparameters

SageMaker AI makes the hyperparameters in a CreateTrainingJob request available in the Docker container in the/opt/ml/input/config/hyperparameters.json file.

The following is an example of a hyperparameter configuration inhyperparameters.json to specify the num_round andeta hyperparameters in the CreateTrainingJob operation for XGBoost.

{
    "num_round": "128",
    "eta": "0.001"
}            

For a complete list of hyperparameters that can be used for the SageMaker AI built-in XGBoost algorithm, see XGBoost Hyperparameters.

The hyperparameters that you can tune depend on the algorithm that you are training. For a list of hyperparameters available for a SageMaker AI built-in algorithm, find them listed in Hyperparameters under the algorithm link in Use Amazon SageMaker AI Built-in Algorithms or Pre-trained Models.

Environment Variables

SageMaker AI sets the following environment variables in your container:

Input Data Configuration

SageMaker AI makes the data channel information in the InputDataConfig parameter from your CreateTrainingJob request available in the/opt/ml/input/config/inputdataconfig.json file in your Docker container.

For example, suppose that you specify three data channels (train,evaluation, and validation) in your request. SageMaker AI provides the following JSON:


{
  "train" : {"ContentType":  "trainingContentType",
             "TrainingInputMode": "File",
             "S3DistributionType": "FullyReplicated",
             "RecordWrapperType": "None"},
  "evaluation" : {"ContentType":  "evalContentType",
                  "TrainingInputMode": "File",
                  "S3DistributionType": "FullyReplicated",
                  "RecordWrapperType": "None"},
  "validation" : {"TrainingInputMode": "File",
                  "S3DistributionType": "FullyReplicated",
                  "RecordWrapperType": "None"}
} 
Note

SageMaker AI provides only relevant information about each data channel (for example, the channel name and the content type) to the container, as shown in the previous example. S3DistributionType will be set asFullyReplicated if you specify EFS or FSxLustre as input data sources.

Training Data

The TrainingInputMode parameter in theAlgorithmSpecification of the CreateTrainingJob request specifies how the training dataset is made available to your container. The following input modes are available.

Note

Channels that use file system data sources such as Amazon EFS and Amazon FSx must use File mode. In this case, the directory path provided in the channel is mounted at/opt/ml/input/data/`channel_name`.

Note

Channels that use FastFile mode must use aS3DataType of "S3Prefix".
FastFile mode presents a folder view that uses the forward slash (/) as the delimiter for grouping Amazon S3 objects into folders. S3Uri prefixes must not correspond to a partial folder name. For example, if an Amazon S3 dataset containss3://amzn-s3-demo-bucket/train-01/data.csv, then neithers3://amzn-s3-demo-bucket/train nors3://amzn-s3-demo-bucket/train-01 are allowed asS3Uri prefixes.
A trailing forward slash is recommended to define a channel corresponding to a folder. For example, thes3://amzn-s3-demo-bucket/train-01/ channel for thetrain-01 folder. Without the trailing forward slash, the channel would be ambiguous if there existed another folders3://amzn-s3-demo-bucket/train-011/ or files3://amzn-s3-demo-bucket/train-01.txt/.

SageMaker AI model training supports high-performance S3 Express One Zone directory buckets as a data input location for file mode, fast file mode, and pipe mode. To use S3 Express One Zone, input the location of the S3 Express One Zone directory bucket instead of an Amazon S3 general purpose bucket. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to AmazonSageMakerFullAccesspolicy for details. You can only encrypt your SageMaker AI output data in directory buckets with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported for storing SageMaker AI output data in directory buckets. For more information, seeS3 Express One Zone.

Distributed Training Configuration

If you're performing distributed training with multiple containers, SageMaker AI makes information about all containers available in the/opt/ml/input/config/resourceconfig.json file.

To enable inter-container communication, this JSON file contains information for all containers. SageMaker AI makes this file available for both File andPipe mode algorithms. The file provides the following information:

The following is an example file on node 1 in a three-node cluster:

{
    "current_host": "algo-1",
    "hosts": ["algo-1","algo-2","algo-3"],
    "network_interface_name":"eth1"
}