(Archived) SageMaker model parallelism library v1.x (original) (raw)
Important
As of December 19, 2023, the SageMaker model parallelism (SMP) library v2 is released. In favor of the SMP library v2, the SMP v1 capabilites are no longer supported in future releases. The following section and topics are archived and specific to using the SMP library v1. For information about using the SMP library v2, see SageMaker model parallelism library v2.
Use Amazon SageMaker AI's model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.
You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.
Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located at Distributed Training APIs in the SageMaker Python SDK v2.199.0 documentation.
Topics
- Introduction to Model Parallelism
- Supported Frameworks and AWS Regions
- Core Features of the SageMaker Model Parallelism Library
- Run a SageMaker Distributed Training Job with Model Parallelism
- Checkpointing and Fine-Tuning a Model with Model Parallelism
- Amazon SageMaker AI model parallelism library v1 examples
- SageMaker Distributed Model Parallelism Best Practices
- The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
- Model Parallel Troubleshooting
SMP release notes
Introduction to Model Parallelism
Did this page help you? - Yes
Thanks for letting us know we're doing a good job!
If you've got a moment, please tell us what we did right so we can do more of it.
Did this page help you? - No
Thanks for letting us know this page needs work. We're sorry we let you down.
If you've got a moment, please tell us how we can make the documentation better.