GitHub - Cooperx521/PyramidDrop: (CVPR 2025) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction (original) (raw)

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction (CVPR 2025)

🎯 News

[2025.3.6] [UpdateπŸ”₯] Please place any models that need to be compatible with pdrop in this issue

[2025.2.27] πŸš€ Our paper has been accepted by CVPR 2025!!!

[2024.10.24] πŸš€ Our paper has been featured as #1 Paper of the day on HuggingFace Daily Papers.

[2024.10.23] πŸš€ We release the paper at ArXiv and Huggingface!

πŸ’‘ Highlights

πŸ‘¨β€πŸ’» Todo

πŸ”§ Install

  1. Clone this repository and navigate to PyramidDrop folder

git clone https://github.com/Cooperx521/PyramidDrop.git cd PyramidDrop

  1. Install Package

conda create -n pdrop python=3.10 -y conda activate pdrop pip install --upgrade pip # enable PEP 660 support pip install -e .

  1. Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

⭐️ Quick Start

1. Core Implementation

The main implementation of PyramidDrop is in llava/model/modeling_llama_pdrop.py. The efficient forward process of LLM with PyramidDrop is in function pdrop_forward, and Rank & Drop is implemented in function pdrop_rank_drop. To prepare for PyramidDrop, the function prepare_inputs_labels_for_multimodal_pdrop collects the following parameters: the length of the text, the image placeholder position, and the number of image tokens. The length of the text is used only during inference to determine the position of the last instruction token. The entire multimodal forward process is encapsulated within the LlavaLlamaForCausalLM_PDrop class.

2. Compatibility

The current code implementation is based on transformers-4.37.2. If you want to run it on other versions of transformers, you only need to find modeling_llama.py and make some simple modifications. This includes adjusting the pdrop_forward and pdrop_rank_drop functions according to the specific transformer version.

If you want to use PyramidDrop on your own model, and if the LLM is based on llama, you can directly add these functions based on Core Implementation to run it easily. If the LLM is not based on llama, you will need to adjust these functions according to the forward function of your LLM.

Efficient Training

We can use PyramidDrop during the training process of LLaVA and Open-LLaVA-NeXT. Firstly, please prepare the data of pretraining and finetuning following the instructions of the two repositories above. Then you can directly use PyramidDrop to reduce training cost, code can be found in scripts/v1_5/pdrop_train and scripts/v1_6/pdrop_train.

Options to note:

Efficient Inference

We follow the original evaluation in LLaVA for most of benchmarks. For MMStar, DocVQA, InfoVQA, ChartQA OCRVQA we use VLMEvalKit.

See Evaluation.md to prepare for inference.

If you want to use PyramidDrop to operate efficient inference on original llava1.5 and llava-next, evaluation code can be found in scripts/v1_5/pdrop_eval and scripts/v1_6/pdrop_eval.

❀️ Acknowledgments