Per Layer Streaming Quantization · Issue #655 · pytorch/ao (original) (raw)
Navigation Menu
- Explore
- Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Description
A tradeoff users have often complained, most recently @aredden about is either they
- quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
- Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM
Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile
where we don't compile things layer wise and generally expect the model to be on the device where its compiled