Per Layer Streaming Quantization · Issue #655 · pytorch/ao (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

@msaroufim

Description

@msaroufim

A tradeoff users have often complained, most recently @aredden about is either they

  1. quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
  2. Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM

Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile where we don't compile things layer wise and generally expect the model to be on the device where its compiled