Quantization (original) (raw)

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes.

Quantization techniques that aren’t supported in Transformers can be added with the HfQuantizer class.

Learn how to quantize models in the Quantization guide.

QuantoConfig

class transformers.QuantoConfig

< source >

( weights = 'int8' activations = None modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using quanto.

Safety checker that arguments are correct

AqlmConfig

class transformers.AqlmConfig

< source >

( in_group_size: int = 8 out_group_size: int = 1 num_codebooks: int = 1 nbits_per_codebook: int = 16 linear_weights_not_to_quantize: typing.Optional[typing.List[str]] = None **kwargs )

Parameters

This is a wrapper class about aqlm parameters.

Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.

VptqConfig

class transformers.VptqConfig

< source >

( enable_proxy_error: bool = False config_for_layers: typing.Dict[str, typing.Any] = {} shared_layer_config: typing.Dict[str, typing.Any] = {} modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Parameters

This is a wrapper class about vptq parameters.

Safety checker that arguments are correct

AwqConfig

class transformers.AwqConfig

< source >

( bits: int = 4 group_size: int = 128 zero_point: bool = True version: AWQLinearVersion = <AWQLinearVersion.GEMM: 'gemm'> backend: AwqBackendPackingMethod = <AwqBackendPackingMethod.AUTOAWQ: 'autoawq'> do_fuse: typing.Optional[bool] = None fuse_max_seq_len: typing.Optional[int] = None modules_to_fuse: typing.Optional[dict] = None modules_to_not_convert: typing.Optional[typing.List] = None exllama_config: typing.Optional[typing.Dict[str, int]] = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using auto-awq library awq quantization relying on auto_awq backend.

Safety checker that arguments are correct

EetqConfig

class transformers.EetqConfig

< source >

( weights: str = 'int8' modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using eetq.

Safety checker that arguments are correct

GPTQConfig

class transformers.GPTQConfig

< source >

( bits: int tokenizer: typing.Any = None dataset: typing.Union[typing.List[str], str, NoneType] = None group_size: int = 128 damp_percent: float = 0.1 desc_act: bool = False sym: bool = True true_sequential: bool = True checkpoint_format: str = 'gptq' meta: typing.Optional[typing.Dict[str, typing.Any]] = None backend: typing.Optional[str] = None use_cuda_fp16: bool = False model_seqlen: typing.Optional[int] = None block_name_to_quantize: typing.Optional[str] = None module_name_preceding_first_block: typing.Optional[typing.List[str]] = None batch_size: int = 1 pad_token_id: typing.Optional[int] = None use_exllama: typing.Optional[bool] = None max_input_length: typing.Optional[int] = None exllama_config: typing.Optional[typing.Dict[str, typing.Any]] = None cache_block_outputs: bool = True modules_in_block_to_quantize: typing.Optional[typing.List[typing.List[str]]] = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using optimum api for gptq quantization relying on auto_gptq backend.

Get compatible class with optimum gptq config dict

Safety checker that arguments are correct

Get compatible dict for optimum gptq config

BitsAndBytesConfig

class transformers.BitsAndBytesConfig

< source >

( load_in_8bit = False load_in_4bit = False llm_int8_threshold = 6.0 llm_int8_skip_modules = None llm_int8_enable_fp32_cpu_offload = False llm_int8_has_fp16_weight = False bnb_4bit_compute_dtype = None bnb_4bit_quant_type = 'fp4' bnb_4bit_use_double_quant = False bnb_4bit_quant_storage = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using bitsandbytes.

This replaces load_in_8bit or load_in_4bittherefore both options are mutually exclusive.

Currently only supports LLM.int8(), FP4, and NF4 quantization. If more methods are added to bitsandbytes, then more arguments will be added to this class.

Returns True if the model is quantizable, False otherwise.

Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.

This method returns the quantization method used for the model. If the model is not quantizable, it returnsNone.

to_diff_dict

< source >

( ) → Dict[str, Any]

Dictionary of all the attributes that make up this configuration instance,

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

HfQuantizer

class transformers.quantizers.HfQuantizer

< source >

( quantization_config: QuantizationConfigMixin **kwargs )

Abstract class of the HuggingFace quantizer. Supports for now quantizing HF transformers models for inference and/or quantization. This class is used only for transformers.PreTrainedModel.from_pretrained and cannot be easily used outside the scope of that method yet.

Attributes quantization_config (transformers.utils.quantization_config.QuantizationConfigMixin): The quantization config that defines the quantization parameters of your model that you want to quantize. modules_to_not_convert (List[str], optional): The list of module names to not convert when quantizing the model. required_packages (List[str], optional): The list of required pip packages to install prior to using the quantizer requires_calibration (bool): Whether the quantization method requires to calibrate the model before using it. requires_parameters_quantization (bool): Whether the quantization method requires to create a new Parameter. For example, for bitsandbytes, it is required to create a new xxxParameter in order to properly quantize the model.

adjust_max_memory

< source >

( max_memory: typing.Dict[str, typing.Union[int, str]] )

adjust max_memory argument for infer_auto_device_map() if extra memory is needed for quantization

adjust_target_dtype

< source >

( torch_dtype: torch.dtype )

Parameters

Override this method if you want to adjust the target_dtype variable used in from_pretrainedto compute the device_map in case the device_map is a str. E.g. for bitsandbytes we force-set target_dtypeto torch.int8 and for 4-bit we pass a custom enum accelerate.CustomDtype.int4.

check_quantized_param

< source >

( model: PreTrainedModel param_value: torch.Tensor param_name: str state_dict: typing.Dict[str, typing.Any] **kwargs )

checks if a loaded state_dict component is part of quantized param + some validation; only defined if requires_parameters_quantization == True for quantization methods that require to create a new parameters for quantization.

create_quantized_param

< source >

( *args **kwargs )

takes needed components from state_dict and creates quantized param; only applicable if requires_parameters_quantization == True

Potentially dequantize the model to retrive the original model, with some loss in accuracy / performance. Note not all quantization schemes support this.

get_special_dtypes_update

< source >

( model torch_dtype: torch.dtype )

Parameters

returns dtypes for modules that are not quantized - used for the computation of the device_map in case one passes a str as a device_map. The method will use the modules_to_not_convert that is modified in _process_model_before_weight_loading.

postprocess_model

< source >

( model: PreTrainedModel **kwargs )

Parameters

Post-process the model post weights loading. Make sure to override the abstract method _process_model_after_weight_loading.

preprocess_model

< source >

( model: PreTrainedModel **kwargs )

Parameters

Setting model attributes and/or converting model before weights loading. At this point the model should be initialized on the meta device so you can freely manipulate the skeleton of the model in order to replace modules in-place. Make sure to override the abstract method _process_model_before_weight_loading.

update_device_map

< source >

( device_map: typing.Optional[typing.Dict[str, typing.Any]] )

Parameters

Override this method if you want to pass a override the existing device map with a new one. E.g. for bitsandbytes, since accelerate is a hard requirement, if no device_map is passed, the device_map is set to `“auto”“

update_expected_keys

< source >

( model expected_keys: typing.List[str] loaded_keys: typing.List[str] )

Parameters

Override this method if you want to adjust the update_expected_keys.

update_missing_keys

< source >

( model missing_keys: typing.List[str] prefix: str )

Parameters

Override this method if you want to adjust the missing_keys.

update_missing_keys_after_loading

< source >

( model missing_keys: typing.List[str] prefix: str )

Parameters

Override this method if you want to adjust the missing_keys after loading the model params, but before the model is post-processed.

update_torch_dtype

< source >

( torch_dtype: torch.dtype )

Parameters

Some quantization methods require to explicitly set the dtype of the model to a target dtype. You need to override this method in case you want to make sure that behavior is preserved

updates the tp plan for the scales

update_unexpected_keys

< source >

( model unexpected_keys: typing.List[str] prefix: str )

Parameters

Override this method if you want to adjust the unexpected_keys.

This method is used to potentially check for potential conflicts with arguments that are passed in from_pretrained. You need to define it for all future quantizers that are integrated with transformers. If no explicit check are needed, simply return nothing.

HiggsConfig

class transformers.HiggsConfig

< source >

( bits: int = 4 p: int = 2 modules_to_not_convert: typing.Optional[typing.List[str]] = None hadamard_size: int = 512 group_size: int = 256 tune_metadata: typing.Optional[typing.Dict[str, typing.Any]] = None **kwargs )

Parameters

HiggsConfig is a configuration class for quantization using the HIGGS method.

Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.

HqqConfig

class transformers.HqqConfig

< source >

( nbits: int = 4 group_size: int = 64 view_as_float: bool = False axis: typing.Optional[int] = None dynamic_config: typing.Optional[dict] = None skip_modules: typing.List[str] = ['lm_head'] **kwargs )

Parameters

This is wrapper around hqq’s BaseQuantizeConfig.

from_dict

< source >

( config: typing.Dict[str, typing.Any] )

Override from_dict, used in AutoQuantizationConfig.from_dict in quantizers/auto.py

Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.

to_diff_dict

< source >

( ) → Dict[str, Any]

Dictionary of all the attributes that make up this configuration instance,

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

FbgemmFp8Config

class transformers.FbgemmFp8Config

< source >

( activation_scale_ub: float = 1200.0 modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Parameters

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using fbgemm fp8 quantization.

CompressedTensorsConfig

class transformers.CompressedTensorsConfig

< source >

( config_groups: typing.Dict[str, typing.Union[ForwardRef('QuantizationScheme'), typing.List[str]]] = None format: str = 'dense' quantization_status: QuantizationStatus = 'initialized' kv_cache_scheme: typing.Optional[ForwardRef('QuantizationArgs')] = None global_compression_ratio: typing.Optional[float] = None ignore: typing.Optional[typing.List[str]] = None sparsity_config: typing.Dict[str, typing.Any] = None quant_method: str = 'compressed-tensors' run_compressed: bool = True **kwargs )

Parameters

This is a wrapper class that handles compressed-tensors quantization config options. It is a wrapper around compressed_tensors.QuantizationConfig

from_dict

< source >

( config_dict return_unused_kwargs = False **kwargs ) → QuantizationConfigMixin

Parameters

Returns

QuantizationConfigMixin

The configuration object instantiated from those parameters.

Instantiates a CompressedTensorsConfig from a Python dictionary of parameters. Optionally unwraps any args from the nested quantization_config

Quantization config to be added to config.json

Serializes this instance to a Python dictionary. Returns:Dict[str, Any]: Dictionary of all the attributes that make up this configuration instance.

to_diff_dict

< source >

( ) → Dict[str, Any]

Dictionary of all the attributes that make up this configuration instance,

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

TorchAoConfig

class transformers.TorchAoConfig

< source >

( quant_type: typing.Union[str, ForwardRef('AOBaseConfig')] modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

from_dict

< source >

( config_dict return_unused_kwargs = False **kwargs )

Create configuration from a dictionary.

Create the appropriate quantization method based on configuration.

Validate configuration and set defaults.

Convert configuration to a dictionary.

BitNetConfig

class transformers.BitNetConfig

< source >

( modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Safety checker that arguments are correct

SpQRConfig

class transformers.SpQRConfig

< source >

( bits: int = 3 beta1: int = 16 beta2: int = 16 shapes: typing.Optional[typing.Dict[str, int]] = None modules_to_not_convert: typing.Optional[typing.List[str]] = None **kwargs )

Parameters

This is a wrapper class about spqr parameters. Refer to the original publication for more details.

Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.

FineGrainedFP8Config

class transformers.FineGrainedFP8Config

< source >

( activation_scheme: str = 'dynamic' weight_block_size: typing.Tuple[int, int] = (128, 128) modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

Parameters

FineGrainedFP8Config is a configuration class for fine-grained FP8 quantization used mainly for deepseek models.

Safety checker that arguments are correct

QuarkConfig

class transformers.QuarkConfig

< source >

( **kwargs )

< > Update on GitHub