HIGGS (original) (raw)
HIGGS is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.
Runtime support for HIGGS is implemented through the FLUTE library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn’t support quantized training and backward passes in general at the moment.
Run the command below to install FLUTE.
Create a HiggsConfig with the number of bits to quantize a model to.
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
model = AutoModelForCausalLM.from_pretrained( "google/gemma-2-9b-it", quantization_config=HiggsConfig(bits=4), device_map="auto", )
Find models pre-quantized with HIGGS in the official ISTA-DASLab collection.
torch.compile
HIGGS is fully compatible with torch.compile.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
model = AutoModelForCausalLM.from_pretrained( "google/gemma-2-9b-it", quantization_config=HiggsConfig(bits=4), device_map="auto", )
model = torch.compile(model)
Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.
Batch Size | BF16 (with torch.compile) | HIGGS 4bit (without torch.compile) | HIGGS 4bit (with torch.compile) |
---|---|---|---|
1 | 59 | 41 | 124 |
4 | 57 | 42 | 123 |
16 | 56 | 41 | 120 |