GitHub - CASE-Lab-UMD/LLM-Drop: The official implementation of the paper "Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR)". (original) (raw)
Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li
π Project Page β’π° News β’βοΈ Installation β’π¦ Layout β’π§° Models β’π Benchmark β’π Citation
This is the official implementation for the paper Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR). Early version: What Matters in Transformers? Not All Attention Is Needed.
π Introduction
This project studies architectural redundancy in Transformer-based LLMs and provides practical pipelines for:
- Block Drop
- Layer Drop (Attention/MLP)
- Joint Layer Drop
- Post-training quantization (AWQ/GPTQ)
The dropping pipeline is built on LLaMA-Factory. Quantization support is built on AutoAWQ and AutoGPTQ.
π° News
- Feb 2026: This paper is published in Transactions on Machine Learning Research (TMLR).
- May 2025: π Awarded the Qualcomm Innovation Fellowship (QIF) North America for the proposal βLess Attention, Much Faster: Toward a Future of Efficiency-Optimized Transformer Architectures.β
- Nov 2024: Added support for more model families (Gemma2, Baichuan, DeepSeek, Yi, Solar).
- Sep 2024: Released dropped-model checkpoints in this Hugging Face collection.
- Jun 2024: Released arXiv preprint and code.
βοΈ Installation
conda create -n llm-drop python=3.10 -y conda activate llm-drop
git clone https://github.com/CASE-Lab-UMD/LLM-Drop.git cd LLM-Drop
Core dropping pipeline
pip install -e .
Quantization dependencies (optional)
cd src/llmtuner/compression/quantization/AutoAWQ pip install -e .
cd AutoAWQ_kernels pip install -e .
cd ../../AutoGPTQ pip install -vvv --no-build-isolation -e .
cd ../../../../../..
π¦ Repository Layout
src/compress.py: main entry for dropping/compression workflow.scripts/dropping/*.sh: example scripts for block/layer dropping.scripts/benchmark/benchmark_lm_eval.sh: LM-Eval benchmark script.scripts/benchmark/benchmark_speed.sh: speed benchmark wrapper.src/benchmark_speed.py: speed benchmarking implementation.scripts/quantization/*.sh: AWQ/GPTQ quantization examples.
π§° Prepare Models
- Download a base model from Hugging Face (for example
mistralai/Mistral-7B-v0.1). - Add
auto_mapin the modelconfig.jsonso Transformers can load custom dropped-model classes. - Set drop lists in
config.json:
- Drop attention layers:
"drop_mlp_list": [], "drop_attn_list": [25, 26, 24, 22]
- Drop MLP layers:
"drop_mlp_list": [26, 27, 25, 24], "drop_attn_list": []
- Drop full blocks:
"drop_mlp_list": [26, 25, 24, 27], "drop_attn_list": [26, 25, 24, 27]
Example auto_map for Mistral:
"auto_map": { "AutoConfig": "configuration_dropped_mistral.MistralConfig", "AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM" }
See model files under src/llmtuner/compression/prune/models.
π Run Dropping
Block Drop
bash scripts/dropping/block_drop.sh
Layer Drop
bash scripts/dropping/layer_drop.sh
Joint Layer Drop
bash scripts/dropping/layer_drop_joint.sh
These scripts estimate module importance, select layers/blocks to drop, and generate updated model configs/checkpoints.
π Benchmark
π§ͺ 1) Task Performance
bash scripts/benchmark/benchmark_lm_eval.sh
Notes:
- This benchmark depends on EleutherAI/lm-evaluation-harness.
- For strict reproduction, the repo uses this fork: s1ghhh/lm-evaluation-harness.
- Use modeling files in
src/llmtuner/modelwhen loading Mistral/Llama with dropped configs.
β‘ 2) Inference Speed
bash scripts/benchmark/benchmark_speed.sh
Before running, edit placeholders in scripts/benchmark/benchmark_speed.sh:
model_pathsave_filemodel_type
π§ 3) Quantization
bash scripts/quantization/awq.sh bash scripts/quantization/gptq.sh
Before running, edit placeholders in those scripts (model_path, quant_path) and ensure CUDA-compatible package versions.
π Citation
@article{ he2026uncovering, title={Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping}, author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2026}, url={https://openreview.net/forum?id=1I7PCbOPfe}, note={} }
π¬ Contact
- Shwai He:
shwaihe@umd.edu - Guoheng Sun:
ghsun@umd.edu