Per Layer Streaming Quantization · Issue #655 · pytorch/ao (original) (raw)

Explore
- GitHub Sponsors Fund open source developers
- The ReadME Project GitHub community articles
- Enterprise platform AI-powered developer platform
Pricing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description

A tradeoff users have often complained, most recently @aredden about is either they

quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM

Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile where we don't compile things layer wise and generally expect the model to be on the device where its compiled

Per Layer Streaming Quantization · Issue #655 · pytorch/ao (original) (raw)

Navigation Menu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description