The SageMaker Distributed Model Parallel Library Overview — sagemaker 2.199.0 documentation (original) (raw)

The Amazon SageMaker distributed model parallel library is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

You can use the library to automatically partition your existing TensorFlow and PyTorch workloads across multiple GPUs with minimal code changes. The library’s API can be accessed through the Amazon SageMaker SDK.