The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls (original) (raw)

Review the following tips and pitfalls before using Amazon SageMaker AI's model parallelism library. This list includes tips that are applicable across frameworks. For TensorFlow and PyTorch specific tips, see Modify a TensorFlow training script and Modify a PyTorch Training Script, respectively.

Batch Size and Number of Microbatches

Manual Partitioning

Data Preparation

Returning Tensors from smp.DistributedModel

The@smp.step Decorator

Delaying Parameter Initialization

For very large models over 100 billion parameters, weight initialization through the CPU memory might result in an out-of-memory error. To get around this, the library offers smp.delay_param_initialization context manager. This delays the physical allocation of parameters until they move to GPU during the first execution of asmp.step-decorated function. This avoids unnecessary memory usage of the CPU during the initialization of training. Use the context manager when you create a model object as shown in the following code.

with smp.delay_param_initialization(enabled=True):    
    model = MyModel()

Tensor Parallelism for PyTorch

## WRONG  
model = MyModel()  
optimizer = SomeOptimizer(model.parameters())  
model = smp.DistributedModel(model)  # optimizer now has outdated parameters!  

Instead, the optimizer should be created with the parameters of thesmp.DistributedModel as follows:

## CORRECT  
model = smp.DistributedModel(MyModel())  
optimizer = SomeOptimizer(model.optimizers())  
with smp.tensor_parallelism():  
    linear = nn.Linear(60, 60)  
# will pass  
assert tuple(linear.weight.shape) == (60, 60)  
distributed_linear = smp.DistributedModel(linear)  
# will fail. the number of input channels will have been divided by smp.tp_size()  
assert tuple(distributed_linear.module.weight.shape) == (60, 60)