omni-research/Tarsier2-7b-0115 · Hugging Face (original) (raw)

Tarsier Model Card

Introduction

We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.

Compared to Tarsier-7B, Tarsier2-7B is comprehensively upgraded in base model (Qwen2-VL-7B) and training data & stage:

Model details

**Model date:**Tarsier2-Recap-7b was trained in December 2024.

Paper or resources for more information:

Performace

Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.


Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.

License

Qwen/Qwen2-VL-7B-Instruct license.

Intended use

**Primary intended uses:**The primary use of Tarsier is research on large multimodal models, especially video description.

**Primary intended users:**The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

How to Use

see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.

Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

Citation

If you find our work helpful, feel free to cite us as:

@misc{yuan2025tarsier2advancinglargevisionlanguage,
      title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding}, 
      author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
      year={2025},
      eprint={2501.07888},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.07888}, 
}