Quantization (original) (raw)

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they’re quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Interested in adding a new quantization method to Diffusers? Refer to the Contribute new quantization method guide to learn more about adding a new quantization method.

When to use what?

Diffusers currently supports the following quantization methods.

BitsandBytes
TorchAO
GGUF
Quanto

This resource provides a good overview of the pros and cons of different quantization techniques.

< > Update on GitHub