Train models with billions of parameters — PyTorch Lightning 2.5.1.post0 documentation (original) (raw)

Audience: Users who want to train massive models of billions of parameters efficiently across multiple GPUs and machines.

Lightning provides advanced and optimized model-parallel training strategies to support massive models of billions of parameters. Check out this amazing video for an introduction to model parallelism and its benefits:


When NOT to use model-parallel strategies

Model parallel techniques help when model sizes are fairly large; roughly 500M+ parameters is where we’ve seen benefits. For small models (for example ResNet50 of around 80M Parameters) where the weights, activations, optimizer states and gradients all fit in GPU memory, you do not need to use a model-parallel strategy. Instead, use regular distributed data-parallel (DDP) training to scale your batch size and speed up training across multiple GPUs and machines. There are several DDP optimizations you can explore if memory and speed are a concern.


Choosing the right strategy for your use case

If you’ve determined that your model is large enough that you need to leverage model parallelism, you have two training strategies to choose from: FSDP, the native solution that comes built-in with PyTorch, or the popular third-party DeepSpeed library. Both have a very similar feature set and have been used to train the largest SOTA models in the world. Our recommendation is

The table below points out a few important differences between the two.

Differences between FSDP and DeepSpeed

FSDP DeepSpeed
Dependencies None Requires the deepspeed package
Configuration options Simpler and easier to get started More comprehensive, allows finer control
Configuration Via Trainer Via Trainer or configuration file
Activation checkpointing Yes Yes, but requires changing the model code
Offload parameters CPU CPU or disk
Distributed checkpoints Coming soon Yes

Get started

Once you’ve chosen the right strategy for your use case, follow the full guide below to get started.


Third-party strategies

Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning. If you want to try some of the latest and greatest features for model-parallel training, check out these strategies.