GitHub - vllm-project/llm-compressor: Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM (original) (raw)

tool icon LLM Compressor

llmcompressor is an easy-to-use library for optimizing models for deployment with vllm, including:

✨ Read the announcement blog here! ✨

LLM Compressor Flow

🚀 What's New!

Big updates have landed in LLM Compressor! Check out these exciting new features:

Supported Formats

Supported Algorithms

When to Use Which Optimization

Please refer to docs/schemes.md for detailed information about available optimization schemes and their use cases.

Installation

pip install llmcompressor

Get Started

End-to-End Examples

Applying quantization with llmcompressor:

User Guides

Deep dives into advanced usage of llmcompressor:

Quick Tour

Let's quantize TinyLlama with 8 bit weights and activations using the GPTQ and SmoothQuant algorithms.

Note that the model can be swapped for a local or remote HF-compatible checkpoint and the recipe may be changed to target different quantization algorithms or formats.

Apply Quantization

Quantization is applied by selecting an algorithm and calling the oneshot API.

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor import oneshot

Select quantization algorithm. In this case, we:

* apply SmoothQuant to make the activations easier to quantize

* quantize the weights to int8 with GPTQ (static per channel)

* quantize the activations to int8 (dynamic per token)

recipe = [ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]), ]

Apply quantization using the built in open_platypus dataset.

* See examples for demos showing how to pass a custom calibration set

oneshot( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", dataset="open_platypus", recipe=recipe, output_dir="TinyLlama-1.1B-Chat-v1.0-INT8", max_seq_length=2048, num_calibration_samples=512, )

Inference with vLLM

The checkpoints created by llmcompressor can be loaded and run in vllm:

Install:

Run:

from vllm import LLM model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8") output = model.generate("My name is")

Questions / Contribution