Supported frameworks, AWS Regions, and instances types (original) (raw)
Before using the SageMaker AI distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.
Supported frameworks
The following tables show the deep learning frameworks and their versions that SageMaker AI and SMDDP support. The SMDDP library is available in SageMaker AI Framework Containers, integrated in Docker containers distributed by the SageMaker model parallelism (SMP) library v2, or downloadable as a binary file.
Topics
PyTorch
PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|
v2.3.1 | smdistributed-dataparallel==v2.5.0 | Not available | 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed\_dataparallel-2.5.0-cp311-cp311-linux\_x86\_64.whl |
v2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\_dataparallel-2.3.0-cp311-cp311-linux\_x86\_64.whl |
v2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\_dataparallel-2.2.0-cp310-cp310-linux\_x86\_64.whl |
v2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\_dataparallel-2.1.0-cp310-cp310-linux\_x86\_64.whl |
v2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\_dataparallel-2.0.2-cp310-cp310-linux\_x86\_64.whl |
v2.0.0 | smdistributed-dataparallel==v1.8.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed\_dataparallel-1.8.0-cp310-cp310-linux\_x86\_64.whl |
v1.13.1 | smdistributed-dataparallel==v1.7.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed\_dataparallel-1.7.0-cp39-cp39-linux\_x86\_64.whl |
v1.12.1 | smdistributed-dataparallel==v1.6.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed\_dataparallel-1.6.0-cp38-cp38-linux\_x86\_64.whl |
v1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\_dataparallel-1.5.0-cp38-cp38-linux\_x86\_64.whl |
v1.11.0 | smdistributed-dataparallel==v1.4.1 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed\_dataparallel-1.4.1-cp38-cp38-linux\_x86\_64.whl |
** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker AI distributed data parallel library.
Note
The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed (torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In accordance with the change, the following smdistributed APIs for the PyTorch distributed package have been deprecated.
smdistributed.dataparallel.torch.distributed
is deprecated. Use thetorch.distributed package instead.smdistributed.dataparallel.torch.parallel.DistributedDataParallel
is deprecated. Use the torch.nn.parallel.DistributedDataParallel API instead.
If you need to use the previous versions of the library (v1.3.0 or before), see thearchived SageMaker AI distributed data parallelism documentation in the SageMaker AI Python SDK documentation.
PyTorch Lightning
The SMDDP library is available for PyTorch Lightning in the following SageMaker AI Framework Containers for PyTorch and the SMP Docker containers.
PyTorch Lightning v2
PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|---|
2.2.5 | 2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\_dataparallel-2.3.0-cp311-cp311-linux\_x86\_64.whl |
2.2.0 | 2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\_dataparallel-2.2.0-cp310-cp310-linux\_x86\_64.whl |
2.1.2 | 2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\_dataparallel-2.1.0-cp310-cp310-linux\_x86\_64.whl |
2.1.0 | 2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\_dataparallel-2.0.2-cp310-cp310-linux\_x86\_64.whl |
PyTorch Lightning v1
PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|
1.7.2 1.7.0 1.6.4 1.6.3 1.5.10 | 1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\_dataparallel-1.5.0-cp38-cp38-linux\_x86\_64.whl |
** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker AI distributed data parallel library.
Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the PyTorch DLCs. When you construct a SageMaker AI PyTorch estimator and submit a training job request in Step 2, you need to provide requirements.txt
to installpytorch-lightning
and lightning-bolts
in the SageMaker AI PyTorch training container.
# requirements.txt
pytorch-lightning
lightning-bolts
For more information about specifying the source directory to place therequirements.txt
file along with your training script and a job submission, see Using third-party libraries in the Amazon SageMaker AI Python SDK documentation.
Hugging Face Transformers
The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging Face Container Versions.
TensorFlow (deprecated)
Important
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.
TensorFlow version | SMDDP library version |
---|---|
2.9.1, 2.10.1, 2.11.0 | smdistributed-dataparallel==v1.4.1 |
2.8.3 | smdistributed-dataparallel==v1.3.0 |
AWS Regions
The SMDDP library is available in all of the AWS Regions where the AWS Deep Learning Containers for SageMaker AI and the SMP Docker images are in service.
Supported instance types
The SMDDP library requires one of the following instance types.
Instance type |
---|
ml.p3dn.24xlarge* |
ml.p4d.24xlarge |
ml.p4de.24xlarge |
Tip
To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.
Important
* The SMDDP library has discontinued support for optimizing its collective communication operations on P3 instances. While you can still utilize the SMDDP optimizedAllReduce
collective on ml.p3dn.24xlarge
instances, there will be no further development support to enhance performance on this instance type. Note that the SMDDP optimized AllGather
collective is only available for P4 instances.
For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance Types page. For information about instance pricing, see Amazon SageMaker Pricing.
If you encountered an error message similar to the following, follow the instructions atRequest a service quota increase for SageMaker AI resources.
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.