Supported frameworks, AWS Regions, and instances types (original) (raw)

Before using the SageMaker AI distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.

Supported frameworks

The following tables show the deep learning frameworks and their versions that SageMaker AI and SMDDP support. The SMDDP library is available in SageMaker AI Framework Containers, integrated in Docker containers distributed by the SageMaker model parallelism (SMP) library v2, or downloadable as a binary file.

Topics

PyTorch

PyTorch version SMDDP library version SageMaker AI Framework Container images pre-installed with SMDDP SMP Docker images pre-installed with SMDDP URL of the binary file**
v2.3.1 smdistributed-dataparallel==v2.5.0 Not available 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121 https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed\_dataparallel-2.5.0-cp311-cp311-linux\_x86\_64.whl
v2.3.0 smdistributed-dataparallel==v2.3.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker Currently not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\_dataparallel-2.3.0-cp311-cp311-linux\_x86\_64.whl
v2.2.0 smdistributed-dataparallel==v2.2.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\_dataparallel-2.2.0-cp310-cp310-linux\_x86\_64.whl
v2.1.0 smdistributed-dataparallel==v2.1.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\_dataparallel-2.1.0-cp310-cp310-linux\_x86\_64.whl
v2.0.1 smdistributed-dataparallel==v2.0.1 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\_dataparallel-2.0.2-cp310-cp310-linux\_x86\_64.whl
v2.0.0 smdistributed-dataparallel==v1.8.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed\_dataparallel-1.8.0-cp310-cp310-linux\_x86\_64.whl
v1.13.1 smdistributed-dataparallel==v1.7.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed\_dataparallel-1.7.0-cp39-cp39-linux\_x86\_64.whl
v1.12.1 smdistributed-dataparallel==v1.6.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed\_dataparallel-1.6.0-cp38-cp38-linux\_x86\_64.whl
v1.12.0 smdistributed-dataparallel==v1.5.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\_dataparallel-1.5.0-cp38-cp38-linux\_x86\_64.whl
v1.11.0 smdistributed-dataparallel==v1.4.1 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed\_dataparallel-1.4.1-cp38-cp38-linux\_x86\_64.whl

** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker AI distributed data parallel library.

Note

The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed (torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In accordance with the change, the following smdistributed APIs for the PyTorch distributed package have been deprecated.

If you need to use the previous versions of the library (v1.3.0 or before), see thearchived SageMaker AI distributed data parallelism documentation in the SageMaker AI Python SDK documentation.

PyTorch Lightning

The SMDDP library is available for PyTorch Lightning in the following SageMaker AI Framework Containers for PyTorch and the SMP Docker containers.

PyTorch Lightning v2

PyTorch Lightning version PyTorch version SMDDP library version SageMaker AI Framework Container images pre-installed with SMDDP SMP Docker images pre-installed with SMDDP URL of the binary file**
2.2.5 2.3.0 smdistributed-dataparallel==v2.3.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker Currently not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\_dataparallel-2.3.0-cp311-cp311-linux\_x86\_64.whl
2.2.0 2.2.0 smdistributed-dataparallel==v2.2.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\_dataparallel-2.2.0-cp310-cp310-linux\_x86\_64.whl
2.1.2 2.1.0 smdistributed-dataparallel==v2.1.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\_dataparallel-2.1.0-cp310-cp310-linux\_x86\_64.whl
2.1.0 2.0.1 smdistributed-dataparallel==v2.0.1 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker Not available https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\_dataparallel-2.0.2-cp310-cp310-linux\_x86\_64.whl

PyTorch Lightning v1

PyTorch Lightning version PyTorch version SMDDP library version SageMaker AI Framework Container images pre-installed with SMDDP URL of the binary file**
1.7.2 1.7.0 1.6.4 1.6.3 1.5.10 1.12.0 smdistributed-dataparallel==v1.5.0 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\_dataparallel-1.5.0-cp38-cp38-linux\_x86\_64.whl

** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker AI distributed data parallel library.

Note

PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the PyTorch DLCs. When you construct a SageMaker AI PyTorch estimator and submit a training job request in Step 2, you need to provide requirements.txt to installpytorch-lightning and lightning-bolts in the SageMaker AI PyTorch training container.

# requirements.txt
pytorch-lightning
lightning-bolts

For more information about specifying the source directory to place therequirements.txt file along with your training script and a job submission, see Using third-party libraries in the Amazon SageMaker AI Python SDK documentation.

Hugging Face Transformers

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging Face Container Versions.

TensorFlow (deprecated)

Important

The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.

TensorFlow version SMDDP library version
2.9.1, 2.10.1, 2.11.0 smdistributed-dataparallel==v1.4.1
2.8.3 smdistributed-dataparallel==v1.3.0

AWS Regions

The SMDDP library is available in all of the AWS Regions where the AWS Deep Learning Containers for SageMaker AI and the SMP Docker images are in service.

Supported instance types

The SMDDP library requires one of the following instance types.

Instance type
ml.p3dn.24xlarge*
ml.p4d.24xlarge
ml.p4de.24xlarge
Tip

To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Important

* The SMDDP library has discontinued support for optimizing its collective communication operations on P3 instances. While you can still utilize the SMDDP optimizedAllReduce collective on ml.p3dn.24xlarge instances, there will be no further development support to enhance performance on this instance type. Note that the SMDDP optimized AllGather collective is only available for P4 instances.

For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions atRequest a service quota increase for SageMaker AI resources.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.