Add support for int4 weight-only QAT by andrewor14 · Pull Request #383 · pytorch/ao (original) (raw)

Summary: This commit adds support for int4 weight-only QAT, which simulates the numerics of the existing
Int4WeightOnlyQuantizer. The main motivation for this is to provide an end-to-end path for running QAT and lowering to the efficient int4 tinygemm cuda kernel. To enable this, we have to add new fake quantization primitives to match the numerics of the tinygemm kernel, and this required refactoring existing quant primitives to skip dtype casting.

Test Plan:
python test/quantization/test_qat.py -k test_qat_4w_linear

Reviewers: jerryzh168, msaroufim

Subscribers: jerryzh168, msaroufim, HDCharles, supriyar