Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding (original) (raw)

{adjustwidth}

-0.1-0.1

Liping Yuan∗ Jiawei Wang∗ Haomiao Sun∗ Yuchen Zhang∗ Yuan Lin†

ByteDance Research
{yuanliping.0o0,wangjiawei.424,sunhaomiao,zhangyuchen.zyc,linyuan.0}@bytedance.com
Project Site: https://github.com/bytedance/tarsier

Abstract

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6% performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.

Refer to caption

Benchmark Previous SOTA
DREAM-1K[105] Tarsier-7B[105]
MVBench[57] InternVL2.5-8B[20]
TVBench[25] IXC-2.5 7B[124]
TOMATO[94] Qwen2-VL-7B[106]
Vinoground[123] LLaVA-OV-7B[53]
TempCompass[69] Qwen2-VL-7B[106]
Video-MME[31] NVILA-7B[70]
LongVideoBench[110] Apollo-7B[130]
TemporalBench[12] LLaVA-Video-7B[127]
MLVU[128] InternVL2.5-8B[20]
MMBench-Video[30] MiniCPM-V-2.6 [119]
VideoHallucer[109] Qwen2-VL-7B[106]
EventHallusion[122] Tarsier-7B[105]
E.T. Bench[67] E.T. Chat[67]

Figure 1: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o. We report the overall average scores for benchmarks with multiple subtasks/metrics.

11footnotetext: ∗*∗Equally contributed. ††{\dagger}†Corresponding author.

Contents
  1. 1 Introduction
  2. 2 Related Work
  3. 3 Approach
    1. 3.1 Pre-training
    2. 3.2 Supervised fine-tuning
    3. 3.3 Direct Preference Optimization
  4. 4 Experiments
    1. 4.1 Quantitative Results
      1. 4.1.1 Video Captioning
      2. 4.1.2 Short-Video Question Answering
      3. 4.1.3 Long-Video Question Answering
      4. 4.1.4 Hallucination
      5. 4.1.5 Video Grounding
      6. 4.1.6 Embodied Question Answering
    2. 4.2 Ablation Study
      1. 4.2.1 Pre-training
      2. 4.2.2 SFT
      3. 4.2.3 DPO
    3. 4.3 Video Recaptioning using Tarsier2
  5. 5 Conclusion
  6. A Training hyper-parameters
  7. B Public datasets of pre-training stage
  8. C Annotation process for SFT data
  9. D Detail setting of DPO training
  10. E Detailed results of individual datasets at different stages
  11. F Tarsier2-Recap-585K Data Composition
  12. G Qualitative Comparison of the SFT Process
  13. H DREAM-1K cases

1 Introduction

With the rapid advancements in large vision-language models (LVLM) [21, 56, 61, 62, 105, 106], significant progress has also been made in video understanding. Leading proprietary models, such as GPT-4o [41] and Gemini-1.5-Pro [102], have achieved state-of-the-art (SOTA) performance across a variety of video understanding tasks. Additionally, several open-source models [61, 114, 23, 52, 20, 53, 23] also demonstrate strong performance on several video understanding benchmarks [25, 57, 67, 109, 128], although they still lag behind proprietary models, particularly in complex, open-ended generation tasks. Despite these advancements, current models remain behind human-level video understanding [78, 86, 19], mainly due to persistent challenges such as accurately perceiving temporal dynamics, spatial-temporal reasoning, and model hallucinations.

In this paper, we introduce Tarsier2, a 7B-parameter LVLM model that can outperform both GPT-4o and Gemini-1.5-Pro in generating detailed video descriptions, a fundamental challenge in video understanding. Beyond video description generation, Tarsier2 also achieves SOTA performance across various video question-answering (VQA) benchmarks at the same model size, surpassing or closely matching the performance of proprietary models on these VQA benchmarks. Figure 1 provides a comprehensive comparison between Tarsier2, GPT-4o and previous SOTA results for open-source LVLMs with the same scale. Figure 2 presents examples illustrating Tarsier2’s video understanding capability across different tasks.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption

Figure 2: Overview of Tarsier2 capabilities. Based on its strong ability for detailed video description, Tarsier2 excels in a variety of video-centric tasks. Click the play buttons to view the videos.

Tarsier2 employs a simple model architecture consisting of a vision encoder, a vision adaptor, and a large language model (LLM). We meticulously design a three-stage training procedure: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL). In comparison with Tarsier [105], Tarsier2 features several key improvements that significantly enhance its performance:

We conduct extensive experiments to evaluate Tarsier2 against both proprietary and open-source LVLMs. For video description, Tarsier2 outperforms all other models, surpassing both proprietary and open-source LVLMs in evaluations on DREAM-1K [105] and E.T. Bench-Captioning [67]. In human side-by-side evaluations, Tarsier2-7B shows a +7.8% improvement over GPT-4o and a +12.3% advantage over Gemini-1.5-Pro. It also significantly outperforms the leading open-source model, Tarsier-34B, with a +51.4% advantage. Furthermore, Tarsier2-7B proves to be a versatile generalist model, setting new SOTA results on public benchmarks for video question-answering [25, 94, 123], hallucination test [122], video grounding [67] and embodied QA [93]. Finally, we present extensive ablation studies to identify the key factors contributing to the model’s strong performance. We also release a recaptioning dataset, Tarsier2-Recap-585K, and demonstrate its effectiveness in enhancing the capabilities of existing LVLMs for video description and general video understanding.

Video-LLMs

Recently, research on Video LLMs has surged [56, 76, 75, 121, 61, 6, 104, 114, 52, 62, 127, 106, 54, 27, 2, 72, 20, 130], with efforts focusing on model architectures and video-text data collection. On the architecture side, current studies emphasize visual representation [114, 106, 130], visual token resampling [114, 20, 115, 58], and the integration of Vision Transformers (ViT) with LLMs [106, 55, 65, 8]. Tarsier2 adopts a simple architecture composed of a visual encoder, a visual adaptor, and an LLM. Despite its simplicity, we demonstrate that a meticulously designed training strategy enables Tarsier2 to achieve strong video understanding capabilities.

In terms of video-text data, while many efforts aim to collect datasets for training Video LLMs, their quantity and quality remain limited. For example, LLaVA-Video [127] is trained on just 1.3 million video-text pairs, and several open-source models, such as InternVL2.5 [20], Aria [54], and VILA-1.5 [62], are trained on fewer than 5 million pairs. Although larger datasets like HowTo100M [81], HD-VILA [116], Panda-70M [18], and InternVid-10M [108] exist, they either cover limited domains or contain overly simplistic or low-quality text. Furthermore, some studies do not disclose the volume of video data used [106, 130, 27, 54].

To address these challenges, our work focuses on improving the quantity and quality of video-text data. We newly collected 20 million video-text pairs, spanning a wide range of video genres. In total, 40 million pairs are used in the final pre-training stage. Additionally, we annotated 150K fine-grained video descriptions for the SFT stage.

Video Description

Video description, a foundational task in video understanding, has long been a central focus of research. Early work [112, 117, 17] typically involved pre-training video-language models and fine-tuning them on datasets such as MSVD[14], MSR-VTT[113], and VATEX[107], which provide single-sentence video summaries.

Recent advancements in LVLMs have improved video description, enabling more detailed outputs beyond simple summarization. However, generating comprehensive video descriptions presents challenges beyond model architecture. While multi-frame processing and temporal modeling are crucial, large-scale and rich annotated ¡video, description¿ datasets are equally important. Existing alignment datasets, such as HD-VILA [116] and HoTo100M [81], provide concise descriptions, limiting detailed video understanding. To address this, datasets such as ShareGPT4Video[16] uses a pipeline where LVLMs (e.g., GPT-V[5]) annotate frames, and LLMs (e.g., GPT-4[1]) aggregate them. This improves detail but often leads to verbosity and hallucinations. Recent works [127, 99] uses proprietary Video-LLMs, such as GPT-4o[41] and Gemini-1.5[102], for annotation, but their high cost limits application to smaller datasets.

For Tarsier2, we collect a large dataset of video-text pairs. In particular, we automatically build meaningful video-text pairs from online commentary videos. These commentaries include both low-level (atomic actions) and high-level (plot) visual elements, enhancing the model’s understanding across various granularity. In addition to data collection, Tarsier2 also uses a meticulously designed three-stage training process, where DPO training after SFT further refines description accuracy and detail.

3 Approach

We initialized Tarsier with Qwen2-VL[106] weights and employed a three-stage training strategy. First, we pre-trained Tarsier2 on 40 million large-scale video-text pairs. Next, we fine-tuned the model on moderate-sized, curated, human-annotated datasets in two phases: one targeting video descriptions with fine-grained grounding and the other focusing on natural, instruction-following video descriptions. Finally, we applied Direct Preference Optimization[89] using automatically generated preference data to further enhance the quality of the video descriptions. The training process is detailed below; for a comprehensive list of hyper-parameters, please refer to Appendix A.

3.1 Pre-training

The pre-training stage encompasses a variety of tasks, including video captioning, video question answering, action recognition, action grounding, (multi-)image understanding, and text generation. The training data consists of 20 million public datasets and 20 million newly collected in-house datasets. Figure 3 illustrates the composition of the pre-training data, with a detailed breakdown presented in Appendix B. Our findings indicate that the in-house data significantly enhances model’s performance, complementing the public datasets. In the following, we describe the pipeline used for in-house data collection.

Refer to caption

Figure 3: Summary of datasets used in the pre-training stage of Tarsier2.

We collected a large group of videos from the Internet, spanning diverse genres such as animation, movies, TV series, short videos, stock footage, games and so on. The videos are categorized into three types:

Commentary videos represent a significant portion of the pre-training data. Unlike traditional video-text datasets, such as HowTo100M [81], which rely on ASR transcripts, commentary data demonstrates stronger alignment between video and text. This commentary not only describes low-level visual elements, such as atomic actions, but also highlights high-level information like plot details. This type of data can substantially enhance the model’s visual understanding at varying levels of granularity.

In addition to video caption data, we incorporate large-scale synthetic datasets for tasks such as object tracking, frame order prediction, image retrieval, video question-answering, and image captioning during pre-training.

Overall, our pre-training dataset consists of 40 million samples. We trained Tarsier2 on this dataset using 128 H100 GPUs, with all components of Tarsier2 set to be trainable. For each video, we sampled between 16 and 128 frames, depending on its duration. In total, the pre-training stage of Tarsier2 processed approximately 200 billion tokens.

3.2 Supervised fine-tuning

During the SFT phase, our primary objectives are to further improve the model’s accuracy and comprehensiveness in video descriptions and ensure the outputs are human-like: well-structured, appropriately detailed, and capable of generating accurate long-form descriptions. To achieve this, we collected 150K video clips and conducted SFT in two stages.

Refer to caption

Figure 4: An example of a video description with fine-grained temporal grounding. “<<<frame: i𝑖iitalic_i-j>𝑗absentj>italic_j >” indicates that the following event is inferred from frames i𝑖iitalic_i to j𝑗jitalic_j. Events are distinguished by color, with corresponding frames and descriptions marked in the same color to indicate their association.

In the first stage, each video clip in the SFT dataset is annotated with a detailed description with fine-grained temporal grounding. As shown in Figure 4, the annotations specify the frames corresponding to each event in the description. The annotation process is detailed in Appendix C. This fine-grained frame-event alignment enhances the model’s ability to accurately identify and describe events by focusing on temporal and visual cues, complementing traditional video-caption alignment. Our experiments demonstrate that this approach mitigates the omission of key events in generated video descriptions.

In the second stage of SFT, we refined the model’s output to achieve a more human-like style. We observed that the data used in the initial stage of SFT often fragmented complete events into multiple steps due to event-grounding requirements. For instance, the action of pouring wine might be divided into steps like opening the bottle, lifting it, and pouring. To address this, we incorporated more natural and human-like video description data. Specifically, in this stage, we designed diverse description instructions to reflect real-world variations in language, granularity, and style requirements. We then annotated each video’s description to align with its corresponding instruction, as detailed in Appendix C. This data allowed the model to better interpret varying instructions and generate more accurate and diverse video descriptions.

The training data for SFT-1 contains 150k video description pairs, while SFT-2 comprises 50k diverse instructions and 150k refined video-description pairs. Each pair includes a video description aligned with one of the instructions. We trained Tarsier2 on this dataset using 32 H100 GPUs and set all components of Tarsier2 to trainable. For each video, we sampled 16 frames for training. The global training batch size was set to 64, and Tarsier2 was trained for 5000 iterations in each of the two phases. In addition, we used 2e-5 and 2e-6 as the learning rate of the model during the two-stage SFT respectively to obtain further performance improvement.

3.3 Direct Preference Optimization

In this subsection, we introduce a novel automated method for collecting preference data for video description. By performing DPO [89] training on this data, we can further improve the model’s ability to generate high-quality, detailed video descriptions.

Refer to caption

Figure 5: Preference data construction pipeline for DPO training.

Negative sampling

Existing works often conduct multiple times sampling on the same input (video and text prompt) to acquire preference pair candidates[111, 126, 100]. In practice, however, we found that 1) Low-temperature sampling produces minimal variation in responses; 2) High-temperature sampling often leads to uncontrollable or abnormal generations. To address these issues, we propose a new automated preference data collection approach that enhances controllability and consistently yields high-quality preference data.

In reinforcement learning (RL) terms, the VLM serves as a policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, typically initialized from the SFT model. Given an input prompt x𝑥xitalic_x, consisting of N𝑁Nitalic_N frames sampled from a video, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT generates an video description y𝑦yitalic_y. Then, the video frames are modified to produce a corrupted prompt x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG through one of the following perturbations:

The corrupted prompt x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG is input into πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, generating a new description y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG. The resulting preference data is represented as {x,yw=y,yl=y~}formulae-sequence𝑥subscript𝑦𝑤𝑦subscript𝑦𝑙~𝑦\{x,y_{w}=y,y_{l}=\tilde{y}\}{ italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_y , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = over~ start_ARG italic_y end_ARG }. The first two perturbations are designed to induce negative descriptions with temporal errors, while the latter two are designed to induce incomplete descriptions. Consequently, through DPO training, the model can be enhanced to produce descriptions with improved accuracy and completeness.

Figure 5 provides an example to illustrate the preference data construction pipeline. From a raw video, we first generate a positive response using the current model. Next, a corrupted video, created through clip-switching, is fed into the model to obtain a negative sample, which contains two hallucinations (highlighted in red).

Preference data filtering

Given a prompt x𝑥xitalic_x, response y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG is generally more negative compared to y𝑦yitalic_y. However, an effective filter mechanism for valid preference data remains essential, as y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG is not always strictly worse than y𝑦yitalic_y222An obvious counter example is that a low-dynamic video will not be significantly affected by the down-sampling perturbation.. As shown on the right side of Figure 5, we utilize AutoDQ [105], an automatic method for evaluating the quality of video description, using two metrics, D⁢QR𝐷subscript𝑄𝑅DQ_{R}italic_D italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and D⁢QP𝐷subscript𝑄𝑃DQ_{P}italic_D italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT333Given a reference description (dr⁢e⁢fsubscript𝑑𝑟𝑒𝑓d_{ref}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT) and a description to be assessed (dp⁢r⁢e⁢dsubscript𝑑𝑝𝑟𝑒𝑑d_{pred}italic_d start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT), AutoDQ scorer outputs the recall score (D⁢QR𝐷subscript𝑄𝑅DQ_{R}italic_D italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT: the ratio of events in dr⁢e⁢fsubscript𝑑𝑟𝑒𝑓d_{ref}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT that are entailed by dp⁢r⁢e⁢dsubscript𝑑𝑝𝑟𝑒𝑑d_{pred}italic_d start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT) and the precision score (D⁢QP𝐷subscript𝑄𝑃DQ_{P}italic_D italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT: the ratio of events in dr⁢e⁢fsubscript𝑑𝑟𝑒𝑓d_{ref}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT that are entailed by dp⁢r⁢e⁢dsubscript𝑑𝑝𝑟𝑒𝑑d_{pred}italic_d start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT).. A preference pair {x,yw=y,yl=y~}formulae-sequence𝑥subscript𝑦𝑤𝑦subscript𝑦𝑙~𝑦\{x,y_{w}=y,y_{l}=\tilde{y}\}{ italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_y , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = over~ start_ARG italic_y end_ARG } is considered valid if the following conditions are met:

Δ⁢D⁢QR≥0andΔ⁢D⁢QP≥0andΔ⁢D⁢QR+Δ⁢D⁢QP≥δ,formulae-sequenceΔ𝐷subscript𝑄𝑅0andformulae-sequenceΔ𝐷subscript𝑄𝑃0andΔ𝐷subscript𝑄𝑅Δ𝐷subscript𝑄𝑃𝛿\Delta DQ_{R}\geq 0{\rm\quad and\quad}\Delta DQ_{P}\geq 0{\rm\quad and\quad}% \Delta DQ_{R}+\Delta DQ_{P}\geq\delta,roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ≥ 0 roman_and roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ≥ 0 roman_and roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ≥ italic_δ , (1)

where Δ⁢D⁢QRΔ𝐷subscript𝑄𝑅\Delta DQ_{R}roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and Δ⁢D⁢QPΔ𝐷subscript𝑄𝑃\Delta DQ_{P}roman_Δ italic_D italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT denotes the difference of AutoDQ recall and precision scores between the y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. δ𝛿\deltaitalic_δ serves as an adjustable threshold to fine-tune the filtering criteria.

During the DPO training phase, we utilize videos from the same training dataset, 𝒟𝒟\mathcal{D}caligraphic_D, as in the SFT phase, to construct preference data. The policy model is then optimized by minimizing the DPO loss, expressed as:

| ℒD⁢P⁢O=−𝔼(x,yw,yl)∼𝒟⁢[log⁡σ⁢(β⁢log⁡πθ⁢(yw|x)πref⁢(yw|x)−β⁢log⁡πθ⁢(yl|x)πref⁢(yl|x))],subscriptℒ𝐷𝑃𝑂subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝒟delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\mathcal{L}_{DPO}=-\mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}}\left[\log\sigma% \left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\rm ref}(y_{w}|x)}-\beta\log% \frac{\pi_{\theta}(y_{l}|x)}{\pi_{\rm ref}(y_{l}|x)}\right)\right],caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] , | (2) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------- | --- |

where πrefsubscript𝜋ref\pi_{\rm ref}italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT denotes the model obtained during the SFT phase.

We conducted DPO training on a dataset with 20k preference pairs produced by the above data collection approach, with all parameters set to be trainable. For each video, we sample 16 frames as same as the SFT phase. We trained Tarsier2 for 1,000 steps in total with 64 H100 GPUs and each GPU loaded one pair at each training step, resulting in a global batch size of 64. See Appendix D for more details of DPO training.

4 Experiments

In this section, we first evaluate the model’s performance on various video understanding benchmarks, comparing it to several baselines. We highlight Tarsier2’s advantages not only in video description but also across other video understanding tasks. We then present an ablation study to examine key components of our approach.

4.1 Quantitative Results

4.1.1 Video Captioning

We evaluate Tarsier2 on two video captioning benchmarks: DREAM-1K[105] and E.T. Bench-Captioning[67]. DREAM-1K is a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. E.T Bench-Captioning is composed of four dense video captioning tasks, requiring key event localization and summary generation for segments in long-form videos.

Model Video Categories Overall
Live-action Animation Stock YouTube Shorts
Proprietary models
GPT-4V [5] 34.8/39.2/31.3 27.4/31.9/24.0 40.7/46.7/36.1 33.8/40.1/29.2 34.8/46.1/28.0 34.4/40.8/29.7
GPT-4o [41] 39.8/42.1/37.8 35.8/39.1/33.1 44.0/46.6/41.7 35.9/41.5/31.7 39.9/47.9/34.2 39.2/43.4/35.7
Gemini-1.5-Flash [102] 34.8/36.4/33.3 29.2/32.5/26.5 39.4/39.7/39.1 34.3/38.6/30.9 35.6/42.4/30.7 34.8/37.9/32.1
Gemini-1.5-Pro [102] 36.4/36.4/36.4 30.7/31.8/29.7 42.2/40.7/43.8 34.0/36.7/31.6 37.0/42.4/32.7 36.2/37.6/34.8
Open-source models (>>>10B)
PLLaVA-34B [114] 29.3/34.9/25.2 20.9/32.0/15.6 35.1/42.5/29.9 28.9/40.8/22.3 25.6/41.9/18.4 28.2/38.4/22.3
VideoLLaMA2-72B [23] 27.3/29.3/25.6 19.7/21.7/18.1 33.9/37.0/31.3 27.7/33.0/23.8 26.5/33.1/22.1 27.1/30.8/24.2
LLaVA-OV-72B [53] 31.7/32.8/30.7 27.7/30.6/25.2 38.0/39.6/36.6 34.1/34.7/33.5 33.8/41.8/28.4 33.2/35.9/30.9
LLaVA-Video-72B [127] 33.5/36.3/31.1 28.6/31.7/26.1 39.3/41.1/37.6 32.8/34.7/31.1 35.7/42.8/30.6 34.0/37.3/31.3
Qwen2-VL-72B [106] 32.1/33.7/30.6 27.6/32.6/23.9 41.1/41.2/41.1 32.0/38.1/27.7 32.1/41.0/26.4 33.2/37.3/29.9
InternVL2.5-78B [20] 25.3/31.5/21.1 21.8/28.8/17.6 33.5/38.1/29.9 31.0/38.5/25.9 31.1/41.7/24.8 28.6/35.7/23.9
Tarsier-34B [105] 38.5/39.6/37.5 32.2/35.8/29.2 41.7/46.4/37.8 34.5/41.1/29.7 34.0/44.1/27.7 36.3/41.4/32.4
Open-source models (<<<10B)
Video-LLaVA-7B [61] 19.4/24.3/16.2 15.3/21.2/11.9 27.0/33.5/22.7 21.2/31.9/15.8 18.5/29.4/13.5 20.4/28.1/16.0
VideoLLaMA2-7B [23] 25.1/28.7/22.2 20.4/25.5/17.0 32.6/35.5/30.2 27.5/33.5/23.4 24.5/34.1/19.2 26.2/31.5/22.4
LLaVA-OV-7B [53] 31.2/33.2/29.3 26.8/29.0/25.0 38.1/39.1/37.1 30.6/32.1/29.2 31.4/38.3/26.6 31.7/34.3/29.4
LLaVA-Video-7B [127] 31.4/35.2/28.4 27.6/32.9/23.8 36.7/39.7/34.1 33.0/39.5/28.3 33.4/42.5/27.5 32.5/37.9/28.4
Qwen2-VL-7B [106] 27.7/32.5/24.2 22.2/28.0/18.4 37.0/36.1/38.0 30.7/35.5/27.0 29.1/37.6/23.8 29.6/33.9/26.3
InternVL2.5-8B [20] 26.6/32.0/22.8 21.3/28.9/16.9 32.7/37.2/29.1 27.9/35.4/23.0 28.9/39.9/22.7 27.6/34.7/22.9
Tarsier-7B [105] 36.6/38.5/34.8 29.3/34.6/25.5 39.6/44.7/35.5 33.0/39.2/28.4 33.6/44.6/26.9 34.6/40.3/30.2
Tarsier2-7B 44.4/41.9/47.3 39.3/39.5/39.1 45.7/45.4/46.0 36.0/38.4/33.9 43.7/48.9/39.4 42.0/42.8/41.1

Table 1: Evaluation results on DREAM-1K. We report F1/Precision/Recall scores for each category and for the overall dataset. For open-source models, all results are tested with their official checkpoint and inference code under recommended setting. SOTA results of comparable scale (<<<10B) are bolded and overall best results are underlined.

Refer to caption

Figure 6: Human side-by-side evaluation results of Tarsier2 versus other models.

As shown in Table 1, Tarsier2-7B outperforms all open-source models in both precision and recall across all categories in DREAM-1K, demonstrating its ability to generate more comprehensive and less hallucinatory video descriptions. Notably, Tarsier2-7B achieved an overall F1 score of 42.0%, surpassing the strongest proprietary model, GPT-4o (39.2%). It is also the first model to exceed a 40% overall recall score, highlighting its sensitivity to dynamic actions and events.

Figure 6 further presents the human side-by-side evaluation results of Tarsier2 versus the previous SOTA Tarsier-34B and two strong proprietary models, GPT-4o and Gemini 1.5 Pro. We randomly sampled 250 videos (50 videos for each category) from DREAM-1K, and asked experienced annotators to compare the descriptions generated by two different models, collecting their preferences. Each pair of descriptions was randomly shuffled to ensure that the annotators were blind to the description sources. Compared to Tarsier-34B, Tarsier2 has a slightly negative advantage rate (15.8%), but wins in a significant percentage of cases (42.8%). Compared to Gemini, Tarsier2 still maintains a significant advantage (45.6% vs 20.7%). Despite being tied with the strongest proprietary model, GPT-4o, on 40% cases, Tarsier2 still gains a slight advantage (8.6%), demonstrating the outstanding performance of Tarsier2 in detailed video description. For a comparison of generated descriptions from different models on DREAM-1K, see Appendix H.

Table 2 shows the evaluation results of dense video captioning on E.T. Bench-Captioning. Tarsier2-7B outperforms all open-source models with comparable settings (similar model scale, fine-tuned on E.T. Instruct 164K [67]) across all metrics, except for the SLCF1 score, which is slightly lower than Qwen2-VL-7B (24.6% vs 25.7%). These results highlight Tarsier2’s strengths in generating fine-grained descriptions for short videos and providing coarse-grained summaries for long videos.

Model E.T. Bench-Captioning [67]
DVCF1 DVCSim SLCF1 SLCSim AvgF1 AvgSim
Proprietary models
GPT-4V [5] 16.1 19.4 21.9 13.5 19.0 16.4
GPT-4o [41] 46.9 22.3 23.1 14.9 35.0 18.6
Gemini-1.5-Flash [102] 31.6 14.9 16.5 13.3 24.1 14.1
Gemini-1.5-Pro [102] 24.0 17.5 5.8 9.8 14.9 13.7
Open-source models (>>>10B)
PLLaVA-34B [114] 13.3 10.6 9.7 11.8 11.5 11.2
LLaVA-OV-72B [53] 41.9 16.3 25.6 13.9 33.8 15.1
LLaVA-Video-72B [127] 37.0 15.7 20.4 13.5 28.7 14.6
Qwen2-VL-72B [106] 15.3 13.9 11.0 12.8 13.2 13.4
Open-source models (≤\leq≤10B)
VideoLLaMA2-7B [23] 0.6 14.5 0.0 15.2 0.3 14.8
Video-LLaVA-7B [61] 28.0 15.0 0.9 8.3 14.4 11.7
LLaVA-OV-7B [53] 22.0 15.1 9.5 10.6 15.8 12.8
LLaVA-Video-7B [127] 20.6 14.7 6.5 13.4 13.6 14.1
E.T. Chat [67] † 38.4 19.7 24.4 14.6 31.4 17.1
Qwen2-VL-7B [106] † 44.3 25.3 25.7 15.6 35.0 20.4
Tarsier-7B [105] † 42.8 19.1 23.7 15.2 33.2 17.1
Tarsier2-7B † 46.5 28.8 24.6 16.4 35.5 22.6

Table 2: Evaluation results on E.T. Bench-Captioning. Results marked in gray are tested on a subset. ††{\dagger}† denotes the model is fine-tuned on E.T. Instruct 164K. All results are transcribed from the official benchmark, except for LLaVA-OV, LLaVA-Video and Qwen2-VL, which are our evaluation using the official checkpoint and inference code.

4.1.2 Short-Video Question Answering

Model MVBench[57] PerceptionTest[86] TVBench[25] TOMATO[94] Vinoground[123] TempCompass[69]
test val test test Text/Video/Group mc/yn/cm/cg
Proprietary models
GPT-4o [41] 57.5 - 39.6 37.7 54.0/38.2/24.6 71.0/73.7/80.8/70.8
Gemini-1.5-Pro [102] - - 46.5 36.1 35.8/22.6/10.2 63.9/70.3/77.5/57.9
Open-source models (>>>10B)
LLaVA-OV-72B [53] 59.4 66.9 45.9 28.6 48.4/35.2/21.8 67.6/72.6/78.2/52.6
LLaVA-Video-72B [127] 64.1 74.3* 50.0 28.2 52.0/35.6/20.8 69.9/73.0/80.9/54.4
Qwen2-VL-72B [106] 73.6 66.5 52.7 37.9 50.4/32.6/17.4 76.0/75.9/84.6/58.6
Tarsier-34B [105] 67.6 60.4 53.8 34.3 37.8/32.0/15.0 69.8/74.0/73.0/60.9
Open-source models (≤\leq≤10B)
LLaVA-OV-7B [53] 56.7 57.1 45.6 25.5 41.6/29.4/14.6 64.8/69.7/73.8/49.9
LLaVA-Video-7B [127] 58.6 67.9* 45.6 24.9 36.8/29.0/12.8 56.3/68.7/76.8/53.0
Qwen2-VL-7B [106] 67.0 - 43.8 31.5 40.0/23.4/12.4 68.5/72.8/77.3/54.2
Tarsier-7B [105] 62.6 53.9 45.8 28.6 29.8/22.2/8.6 58.7/58.0/54.2/55.3
Previous SOTA 72.0 [20] 70.0* [72] 51.6 [124] 31.5 [106] 41.6/29.4/14.6 [52] 68.5/72.8/77.3/54.2 [106]
Tarsier2-7B 71.5 71.6* 54.7 42.0 65.8/38.0/28.8 75.3/75.1/80.6/66.6

Table 3: Evaluation results on short video question answering benchmarks. * indicates that the training set has been observed in the training data mixture.

We evaluate Tarsier2-7B on several short-video question answering benchmarks to assess its ability to comprehend and reason about visual content. As shown in Table 3, Tarsier2-7B outperforms both proprietary and open-source models across various benchmarks, achieving state-of-the-art results. Tarsier2-7B exhibits exceptional performance in MVBench [57] and PerceptionTest [86], with scores of 71.5% and 71.6%, respectively.

Furthermore, Tarsier2-7B demonstrates significant performance improvements on benchmarks featuring temporal reasoning, such as TVBench [25], TOMATO [94], and Vinoground [123]. Tarsier2-7B achieves strong results with 54.7% on TVBench, 42.0% on TOMATO, and 65.8%/38.0%/28.8% on Vinoground’s Text/Video/Group tasks, respectively. These results surpass both open-source and proprietary models, including GPT-4o and Gemini-1.5-Pro.

At last, Tarsier2-7B also excels on the TempCompass benchmark [69], which evaluates temporal perception in ten aspects and four task formats. Tarsier2-7B achieves impressive scores of 75.3%/75.1%/80.6%/66.6% on TempCompass’ mc/yn/cm/cg tasks, respectively, outperforming both open-source models and larger proprietary models in most cases. This performance further underscores Tarsier2’s advanced ability to process and interpret temporal information in video content.

4.1.3 Long-Video Question Answering

Model Video-MME[31] LongVideoBench[110] TemporalBench[12] MLVU[128] MMBench-Video[30]
w/o subs val Binary Accuracy M-Avg val
Proprietary models
GPT-4o [41] 71.9 66.7 73.2 64.6 1.87
Gemini-1.5-Pro [102] 75.0 64.0 66.4 - 1.30
Open-source models (>>>10B)
VILA-1.5-40B [62] 60.1 - - 56.7 1.61
LLaVA-Video-72B [127] 70.5 61.9 72.4 74.4 1.71
Qwen2-VL-72B [106] 71.2 - 70.2 - 1.70
InternVL2.5-78B [20] 72.1 63.6 - 75.7 1.97
Tarsier-34B [105] 52.3 54.2 66.7 58.2 1.46
Open-source models (≤\leq≤10B)
LLaVA-Video-7B [127] 63.3 58.2 63.6 70.8 1.60
Qwen2-VL-7B [106] 63.3 55.6 62.0 - 1.44
InternVL2.5-8B [20] 64.2 60.0 - 68.9 1.68
Tarsier-7B [105] 42.2 39.8 56.9 49.3 -
Previous SOTA 64.2 [70] 60.0 [20] 63.6 [127] 70.9 [130] 1.70 [119]
Tarsier2-7B 64.5 (128f) 58.6 (128f) 65.3 (128f) 67.9 (256f) 1.82 (128f)

Table 4: Evaluation results on long-video question answering benchmarks. We list the number of frames used for each benchmark during evaluating Tarsier2.

We evaluate Tarsier2 on long-video question answering benchmarks by uniformly sampling 128 or 256 frames, depending on the video length. Comparison results with other proprietary and open-source models are presented in Table 4. Despite our training set not including many long video data, Tarsier2, compared with others under 10 billion parameters, still achieves SOTA on three benchmarks and competitive performance on several other benchmarks.

4.1.4 Hallucination

Model VideoHallucer [109] EventHallusion [122]
Yes/No QA Yes/No QA Desc GPT
Basic/Hallucinated/Overall Entire/Interleave/Misleading/Overall Entire/Interleave/Misleading/Overall
Proprietary models
GPT-4o [41] 75.1/74.2/53.3 65.8/90.7/92.2/84.1 34.9/54.9/83.2/56.2
Gemini-1.5-Pro [102] 83.6/42.3/37.8 70.2/77.7/96.1/80.2 38.5/40.9/80.0/49.6
Open-Source models (>>>10B)
Qwen2-VL-72B [106] 87.1/79.4/70.2 33.3/77.7/56.4/60.0 16.5/25.4/70.2/33.6
LLaVA-OV-72B [53] 88.3/62.6/55.2 47.4/26.9/90.1/48.3 24.8/34.7/71.3/40.7
LLaVA-Video-72B [127] 88.2/73.5/64.6 57.9/11.9/96.0/45.6 32.1/35.8/75.5/44.2
InternVL2.5-78B [20] 82.5/82.5/67.8 57.9/67.9/88.2/70.2 45.0/43.0/76.8/51.6
Tarsier-34B [105] 84.8/80.0/67.7 49.1/92.7/69.6/74.8 38.5/40.4/83.2/50.1
Open-Source models (≤\leq≤10B)
LLaVA-OV-7B [53] 81.1/69.6/53.8 46.5/67.4/86.1/66.2 22.0/26.4/73.4/36.4
LLaVA-Video-7B [127] 82.4/70.6/56.0 61.4/48.7/96.0/64.0 27.5/32.6/75.5/41.4
Qwen2-VL-7B [106] 85.0/70.8/59.3 35.1/94.3/57.4/68.6 14.7/16.1/67.0/27.8
InternVL2.5-8B [20] 72.7/78.3/53.6 46.5/69.2/90.2/68.2 23.9/20.7/60.0/31.0
Tarsier-7B [105] 76.4/60.8/41.4 43.9/82.4/79.4/70.9 35.8/29.5/72.6/41.6
Tarsier2-7B 86.5/78.3/67.0 60.5/93.3/95.1/84.6 54.6/53.1/93.7/63.3

Table 5: Evaluation results on hallucination benchmarks.

We evaluate Tarsier2 on two video hallucination benchmarks: VideoHallucer [109] and EventHallusion [122]. The results are summarized in Table 5. For VideoHallucer, Tarsier2-7B achieves an overall score of 67.0%, outperforming all comparable baselines of similar model scale and even proprietary models like GPT-4o and Gemini-1.5-pro. In EventHallusion, for video question-answering task, Tarsier2-7B achieves 84.6%, surpassing GPT-4o’s score of 84.1%, while outperforming all other baselines. For the detailed description matching task, which directly assesses video description hallucinations by prompting GPT-4 to answer questions based on each model’s generated video description, Tarsier2-7B demonstrates superior performance, even surpassing GPT-4o by 7.1% in terms of Overall score.

4.1.5 Video Grounding

Model E.T. Bench-Grounding [67]
TVGF1 EPMF1 TALF1 EVSF1 VHDF1 MeanF1
Proprietary models
GPT-4V [5] 27.0 1.8 18.0 28.6 55.1 26.1
GPT-4o [41] 40.4 4.5 20.0 17.6 56.9 27.9
Gemini-1.5-Flash [102] 43.9 5.4 27.0 5.4 60.8 28.5
Gemini-1.5-Pro [102] 43.1 6.2 33.8 7.9 47.0 27.6
Open-source models (<<<10B)
LITA [39] 22.2 4.6 18.0 29.7 23.9 19.7
VTG-LLM [37] 15.9 3.7 14.4 26.8 48.2 21.8
TimeChat [91] † - - - - - 24.3
E.T. Chat [67] † 38.6 10.2 30.8 25.4 62.5 33.5
Tarsier-7B [105] † 39.6 9.0 25.0 25.4 47.6 30.9
Qwen2-VL-7B [106] † 39.7 7.0 26.9 17.1 66.9 33.5
Tarsier2-7B † 38.4 11.0 31.8 19.4 66.8 35.5

Table 6: Evaluation results on E.T. Bench-Grounding. Results marked in gray are tested on a subset. ††{\dagger}† denotes the model is fine-tuned on E.T. Instruct 164K.

We evaluate the video grounding capability of models on E.T. Bench-Grounding, which combines various grounding tasks from multiple datasets, including QVHighlights [51], Charades-STA [32], THUMOS’14 [42], and Ego4D-NLQ [35], among others. The results, shown in Table 6, indicate that Tarsier2-7B achieves the highest mean F1 score of 35.5%, outperforming all baselines and highlighting its superior temporal perception capabilities.

4.1.6 Embodied Question Answering

Model EgoTaskQA
Exact Match
Human 80.0
HCRN [50] 42.2
GF [9] 44.3
EgoVLPv2 [88] 46.3
Tarsier2 77.5
Model RoboVQA
BLEU-1/2/3/4
LLaMA-AdapterV2 [33] 27.8/16.0/10.9/8.1
LLaVA-OV-7B [53] 38.1/33.6/31.8/31.0
RoboMamba [66] 54.9/44.2/39.5/36.3
MLCD [3] 73.2/66.4/60.6/56.6
Tarsier2 77.1/67.4/61.5/56.8
Model OpenEQA
GPT-4
Human 86.8
GPT-4V [5] 55.3
Gemini-1.5-Pro [102] 44.9
MLCD [3] 48.8
Tarsier2 58.7

Table 7: Evaluation results on embodied question-answering tasks, including EgoTaskQA, RoboVQA and OpenEQA.

We evaluate Tarsier2 on embodied question answering to assess its performance in real-world robotic scenarios, using three benchmarks: EgoTaskQA [44], RoboVQA [93], and OpenEQA [77]. To align with the baselines, Tarsier2 is fine-tuned on the training sets for EgoTaskQA and RoboVQA, while for OpenEQA, it is evaluated in a zero-shot setting. The results, presented in Table 7, include exact match accuracy for EgoTaskQA, BLEU score for RoboVQA, and the correctness score evaluated by GPT-4-1106-preview [1] for OpenEQA. Tarsier2 achieves top-tier performance across all three benchmarks, outperforming both generalist and specialist models. Notably, on EgoTaskQA, its performance approaches human-level accuracy, highlighting the model’s significant potential in embodied intelligence.

4.2 Ablation Study

We conduct a comprehensive ablation study to evaluate key components at different stages of the training process. The study is based on three tasks: 1) Caption: This includes the DREAM-1K dataset, the caption generation task from TempCompass (TempCompass-cg), and the caption matching task from Vinoground (Vinoground-Text) to assess captioning performance. 2) Video QA: This encompasses short-video QA, measured by the average accuracy on MVBench, TVBench, and TOMATO, and long-video QA, measured by the average accuracy on Video-MME, LongVideoBench, and TemporalBench. It evaluates the model’s video understanding capabilities. 3) Hallucination: We use the average score of two sub-tasks from EventHallusion to assess hallucination in the model. The following subsections present the results for each task, with detailed results for individual datasets provided in the Appendix E.

4.2.1 Pre-training

Model Caption Video QA Hallucination
DREAM-1K TempCompass-cg Vinoground-Text Short Long
Tarsier1-7B 34.6 55.3 29.8 45.6 46.3 56.3
Tarsier1-7B-Qwen upgrading model 38.4 (↑↑\uparrow↑3.8) 59.3 (↑↑\uparrow↑4.0) 48.6 (↑↑\uparrow↑18.8) 52.4 (↑↑\uparrow↑6.8) 57.6 (↑↑\uparrow↑11.3) 62.1 (↑↑\uparrow↑5.8)
Tarsier2-7B upgrading model+data 40.8 (↑↑\uparrow↑6.2) 60.1 (↑↑\uparrow↑4.8) 60.2 (↑↑\uparrow↑30.4) 55.3 (↑↑\uparrow↑9.7) 64.1 (↑↑\uparrow↑17.8) 63.5 (↑↑\uparrow↑7.2)

Table 8: Results of the ablation study for pre-training. Tarsier1-7b-Qwen stands for the model where the base model is upgraded to Qwen2-VL, while the pre-training dataset remains the same as Tarsier1. Tarsier2 is trained from Qwen2-VL with an expanded pre-training dataset, growing from 13 million in Tarsier1 to 40 million samples.

In this section, we evaluate the impact of several factors during pre-training, including the base model, pre-training data and training steps. For the caption task, we report results after the SFT stage, which aligns the model’s responses with the desired style. For other tasks, we report results after pre-training stage.

Compared to Tarsier1, two key improvements are made in the pre-training phase: upgrading the base model to Qwen2-VL and expanding the training dataset from 13 million to 40 million samples. Table 8 illustrates the additive contributions for each improvement, showing that both enhancements consistently and significantly boost the model’s performance in caption generation, video QA, and hallucination reduction. Specifically, these enhancements lead to accuracy improvements of 9.7%, 17.8%, and 7.2% for short-video QA, long-video QA, and hallucination tests, respectively. For video description, the F1 score on the DREAM-1K dataset improves by 6.2%.

Refer to caption

Figure 7: Model performance against training tokens. The results at the initial step reflect the performance of Qwen2-VL-7B.555For consistency across all checkpoints, we evaluate the Qwen2-VL-7B model using the same frame sampling strategy applied to other checkpoints. This may differ from the official sampling strategy in some benchmarks. For instance, the official setting of Video-MME uses 768 frames, while we sample 128 frames.

To better understand the effect of the number of training tokens on pre-training performance, we plot the model’s performance as a function of token count during the pre-training stage, as shown in Figure 5. The results show that model performance improves with an increase in the number of training tokens, reaching convergence after 160 billion tokens. This suggests that a large volume of data is essential for optimal video understanding performance.

4.2.2 SFT

Model Caption Video QA Hallucination
DREAM-1K TempCompass-cg Vinoground-Text Short Long
Tarsier2-7B-SFT 40.8 60.1 60.2 56.2 63.2 71.9
w/o SFT 35.2 (↓↓\downarrow↓5.6) 50.5 (↓↓\downarrow↓9.6) 57.2 (↓↓\downarrow↓3.0) 55.3 (↓↓\downarrow↓0.9) 64.1 (↑↑\uparrow↑0.9) 63.5 (↓↓\downarrow↓8.4)
w/o grounding 37.4 (↓↓\downarrow↓3.4) 50.2 (↓↓\downarrow↓9.9) 60.6 (↑↑\uparrow↑0.4) 55.9 (↓↓\downarrow↓0.3) 61.9 (↓↓\downarrow↓1.3) 68.6 (↓↓\downarrow↓3.3)

Table 9: Ablation study of temporal grounding dataset during the SFT phase. Tarsier2-7B-SFT refers to the model after the SFT phase. w/o SFT refers to the model after pre-training; w/o grounding refers to the model fine-tinued without grounding information.

The key factor in the SFT phase is fine-grained alignment. To investigate its impact, we conduct an ablation study, with the results presented in Table 9. When the video description data, which includes fine-grained temporal grounding information, is excluded (i.e., without grounding), model performance significantly deteriorates. Specifically, the F1 score on DREAM-1K decreases by 3.4%, accuracy on TempCompass-cg drops by 9.9%, accuracy on long-video QA falls by 1.3%, and accuracy on the hallucination test declines by 3.3%.

Furthermore, the SFT phase leads to substantial improvements, highlighting the importance of high-quality manually labeled data. It boosts the F1 score on DREAM-1K by 5.6%, accuracy on TempCompass-cg by 9.6%, accuracy on Vinoground-Text by 3.0%, and accuracy on the hallucination test by 8.4%, demonstrating the SFT phase’s role in enhancing the model’s fine-grained video understanding and mitigating hallucinations.

4.2.3 DPO

Model Caption Video QA Hallucination
DREAM-1K TempCompass-cg Vinoground-Text Short Long
Tarsier2-7B 42.0 66.6 65.8 56.1 62.8 74.0
w/o DPO 40.8 (↓↓\downarrow↓1.2) 62.1 (↓↓\downarrow↓6.5) 60.6 (↓↓\downarrow↓5.6) 56.2 (↑↑\uparrow↑0.1) 63.2 (↑↑\uparrow↑0.4) 71.9 (↓↓\downarrow↓2.1)
w/o NS 41.5 (↓↓\downarrow↓0.5) 61.1 (↓↓\downarrow↓5.5) 59.8 (↓↓\downarrow↓6.0) 56.1 (↓↓\downarrow↓0.0) 62.8 (↓↓\downarrow↓0.0) 72.9 (↓↓\downarrow↓1.1)
w/o PF 40.5 (↓↓\downarrow↓1.5) 65.1 (↓↓\downarrow↓1.5) 67.6 (↑↑\uparrow↑1.8) 56.0 (↓↓\downarrow↓0.1) 62.3 (↓↓\downarrow↓0.5) 74.2 (↑↑\uparrow↑0.2)

Table 10: Ablation study for DPO training phase, negative sampling (NS) and preference data filtering (PF) strategies.

We conduct ablation experiments to evaluate the DPO phase, negative sampling (NS) and preference data filtering (PF) strategies. Specifically, we test the following settings: 1) w/o DPO: SFT model without DPO training. 2) w/o NS: Preference pairs generated by sampling the same video twice, without negative sampling. 3) w/o PF: Responses from negative sampling are treated as rejections, without utilizing AutoDQ Scorer to perform preference data filtering. For a fair comparison, the training data size and hyper-parameters for the latter two settings are kept consistent with the default setting, as detailed in Appendix D.

As shown in Table 10, Tarsier2 benefits a lot from the DPO training phase with significant improvement on caption tasks, especially TempCompass-cg (6.5%) and Vinoground-Text (5.6%). The hallucination capability also drops by 2.1% without DPO, while the performance on video QA is not obviously affected. When further ablating dataset construction strategy of DPO, negative sampling plays an important role, without which the model results on most of the tasks are degraded to be almost the same as the SFT model (“w/o DPO”), and the hallucination capability drops by 1.1%. Additionally, preference data filtering with AutoDQ scorer has a significant impact on maintaining the quality of DPO datasets. As shown in Table 10, “w/o PF” leads to degradation on more than a half of the tasks, and especially the DREAM-1K F1 score is even worse than the SFT model.

4.3 Video Recaptioning using Tarsier2

Model Caption Video QA Hallucination
DREAM-1K TempCompass-cg Vinoground-Text Short Long
Qwen2-VL-7B [106] 31.2 54.2 40.0 49.4 60.3 51.9
+ Original FT 35.2 (↑↑\uparrow↑4.0) 49.9 (↓↓\downarrow↓4.3) 39.0 (↓↓\downarrow↓1.0) 46.9 (↓↓\downarrow↓2.5) 55.4 (↓↓\downarrow↓4.9) 43.0 (↓↓\downarrow↓8.9)
+ Recaption FT 39.5 (↑↑\uparrow↑8.3) 67.7 (↑↑\uparrow↑13.5) 55.0 (↑↑\uparrow↑15.0) 52.5 (↑↑\uparrow↑3.1) 56.8 (↓↓\downarrow↓3.5) 68.5 (↑↑\uparrow↑16.6)

Table 11: The experimental results of recaptioning. “Recaption FT” represents fine-tune the model on the Tarsier2-Recap-585K dataset. “Original FT” represents fine-tune the model with the same videos as Tarsier2-Recap-585K but taking their original labels as target output.

In this section, we utilize Tarsier2 as a captioner to generate detailed descriptions for a diverse set of 1M videos sourced from public datasets, resulting in the recaptioning dataset Tarsier2-Recap-585K666Tarsier2-Recap-585K is available on HuggingFace.. Details of the dataset composition are provided in Appendix F.

We fine-tune Qwen2-VL-7B [106] on Tarsier2-Recap-585K and present the evaluation results in Table 11. Fine-tuning on Tarsier2-Recap-585K significantly enhances the model’s performance on detailed video description, achieving improvements in DREAM-1K (+8.3%), TempCompass-cg (+13.4%), and Vinoground-Text (+15.0%). Moreover, it achieves an improvement of 16.6% in hallucination test and an improvement of 3.1% in short video-QA.

In comparison, fine-tuning on the same 585K videos with original captions improves only the DREAM-1K F1 score (+4.0%), while other metrics show significant declines. It indicates that the performance gains from Tarsier2-Recap-585K are primarily due to its high-quality and detailed captions rather than the additional training data volume.

Table 17 in Appendix E provides detailed benchmark results corresponding to Table 11. These findings demonstrate that Tarsier2 can generate high-quality, detailed descriptions that offer fine-grained alignment information to help LVLMs to achieve significant improvements across various tasks.

5 Conclusion

In this paper, we introduce Tarsier2, a state-of-the-art large vision-language model that outperforms existing proprietary and open-source models in generating detailed and accurate video descriptions. Furthermore, Tarsier2 sets new benchmarks across a wide range of video understanding tasks. Our ablation studies demonstrate that Tarsier2 ’s advancements are driven by scaling the volume and diversity of the training dataset, fine-grained temporal alignment, and DPO training.

Looking ahead, we outline several promising directions for future research. First, extending Tarsier2 to handle longer video durations by developing more efficient model architectures and expanding the training dataset. Second, enhancing real-time video processing to improve the model’s ability to analyze and describe videos as they stream. Third, exploring richer interactions between video, audio, and text to create more comprehensive and context-aware video understanding systems.

References

Appendix A Training hyper-parameters

Table 12 shows the training hyper-parameters in pre-training, SFT-1&2 and DPO stage. We apply a layer-wise learning rate decay of 0.9 for visual encoder training [22].

Configuration Pre-training SFT-1 SFT-2 DPO
VLM init. Qwen2-VL-7B Tarsier2-Pre-trian Tarsier2-SFT-1 Tarsier2-SFT-2
Optimizer name AdamW
Optimizer β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.90.90.90.9
Optimizer β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.9990.9990.9990.999
Optimizer eps 1⁢e−61superscript𝑒61e^{-6}1 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Learning rate 2⁢e−52superscript𝑒52e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2⁢e−52superscript𝑒52e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2⁢e−62superscript𝑒62e^{-6}2 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1⁢e−61superscript𝑒61e^{-6}1 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Learning rate schedule cosine
Training steps 200,000 5,000 5,000 1,000
Warm-up steps 1,000 250 250 100
Weight decay 0.01
Gradient clip 1.0
Dropout rate 0.0
Global batch size 384 64 64 64
Max pixels 460,800
Frames per video [8,128] 16 16 16
Numerical precision bfloat16

Table 12: Training hyper-parameters of Tarsier2

Appendix B Public datasets of pre-training stage

Table 13 presents the pre-training datasets, which collectively include approximately 20 million public data and 20 million in-house data. Most of the public datasets are the same as Tarsier1, except we additionally gathered some newly released open-source data and OCR-releated data. For WebVid-10M, we used 2.9 million video-text pairs, selecting samples that are more likely to feature dynamic events. We have also incorporated some latest long video understanding datasets, such as MovieStory101[38] and LLaVA-Video-178K [127]. This greatly enhances the model’s ability to understand long videos.

Video Captioning
WebVid [10] (2.9M) LSMDC [92] (109K) TGIF [59] (105K) ActivityNet [47] (38K)
Charades [97] (16K) Charades-Ego [96] (6K) YouCook2 [129] (9K) TACoS [90] (18K)
Ego4D [35] (1.1M) Spoken Moments [82] (493K) Multi-Moments [83] (997K) TREC-VTT [7] (64K)
ShareGPT-4o-video [26] (2K) MovieStory101[38] (11K) GPT4o-labeled Caption† (2.5M) Human-labeled Caption† (145K)
Film&TV Commentary† (11.5M)
Action Recognition
HMDB [49] (5.8K) COIN [101] (10K) SSV2 [34] (169K) Kinetics-700 [13] (537K)
FineAction [68] (82K) RareAct [80] (2K) 20BN-jester [79] (46K)
Video QA
CLEVRER [120] (83K) TGIF-QA [43] (72K) EgoQA [29] (5K) VideoInstruct [76] (89K)
LLaVA-Video-178K [127] (165K) M4-Instruct-video [52] (255K) GPT4o-labeled QA† (16.2K)
Grounding
DiDeMo [4] (82K) AVA [36] (28K) E.T. Instruct 164K [67] (147K) Object Tracking† (745K)
Video Self-Supervised Training
Frame Order Prediction† (825K)
Intent Recognition
Oops! [28] (15K)
Multi-Image Understanding
VIST [40] (38K) MMDU [71] (45K) M4-Instruct-image [52] (616K) Image Retrival† (533K)
Single-Image Understanding
ShareGPT4V [15] (95K) LLaVA-1.5 [64] (643K) ShareGPT-4o-image[26] (57K) MS COCO [63] (566K)
Flicker [87] (145K) LLaVA-ReCap-CC3M [52] (2.9M) Visual Genome [48] (759K) SBU Captions [84] (860K)
GPT4o-labeled Caption† (1.13M)
Image OCR
RCTW-17 [95] (8K) LSVT [98] (430K) ReCTS [125] (20K) Art [11] (5.6K)
COCOTextV2 [103] (16K) CORD-v2 [85] (1K) HierText [73] (10K) MSRA-TD500 [118] (465)
IC03 [74] (499) SynthDoG-en [46] (100K) SynthDoG-zh [46] (100K)
Text Generation
OpenOrca [60] (995K) ShareGPT [24] (80K)

Table 13: Datasets and their sizes used in Tarsier2 pre-training. ††\dagger† indicates in-house datasets.

Appendix C Annotation process for SFT data

In the first stage of SFT, we annotated each video clip with detailed descriptions that included fine-grained temporal grounding. Each clip first underwent manual annotation, where annotators described dynamic information such as character actions, events, scene transitions, and camera movements, while avoiding unnecessary static elements. Annotators are also required to map the dynamic information in their descriptions to the corresponding frame numbers. We performed quality inspections on the annotated data and returned any data not meeting quality standards for re-annotation. We discarded any data that might involve copyright risks.

In the second stage of SFT, we utilized GPT-4o to generate a variety of instruction tuning samples based on manual annotations. We provided GPT-4o with 16 uniformly sampled frames from the video and the original manual annotations. Figure 8 shows the prompt for re-annotation in this stage.

The re-annotation prompt for diverse instruction data (SFT-2). Character You are an excellent video analyst. Utilizing your incredible attention to detail, you provide clear, sequential descriptions for video. You excel in identifying and conveying changes in actions, behaviors, environment, states and attributes of objects, and camera movements between video frames. Prompt Here are 16 frames from a video and a short video caption in Chinese. You need to process a two step tasks: First, establish a set of guiding principles to control the style of the video description. These principles should include one or more of the following aspects: 1. Specify the length constraints of the description, including the number of paragraphs and total word count. 2. Define the level of detail for human or creature appearance, non-creature appearance, and background. 3. Determine the granularity of the event information. 4. Decide on the output format, such as plain text, JSON, lists, narrative, poetry, etc. 5. Choose the output language, such as Chinese, English, Japanese, French, and so on. 6. Decide on the text style, such as fluent, concise, professional, or just using simple words and phrases. Next, generate the corresponding video description based on these guiding principles and the input video clip, and rephrase the guiding principles into natural language as part of the output question. Input Origin Short Video Caption in Chinese: {Manual Labeled Chinese Caption} Requirement Return in JSON format: {“qustion”: xxx,“answer”: xxx}

Figure 8: The re-annotation prompt in SFT-2.

Appendix D Detail setting of DPO training

As a default setting, we leveraged the negative sampling and preference pair filtering strategy as introduced in Section 3.3 to construct the DPO training set. We set top_p as 0.7 and temperature as 0.7 when running both positive sampling and negative sampling on our 150K SFT dataset. The threshold δ𝛿\deltaitalic_δ of preference pair filtering was set as 0.3. We finally randomly sampled 20K preference pairs for DPO training. For the “w/o NS” setting, we kept other parameters and process unchanged but replaced the negative sampling with an additional positive sampling. For the “w/o PF” setting, we omitted the process of preference pair filtering and directly sample 20K pairs from all preference pair candidates. We utilized the vanilla DPO training objective (Equation 2), and set β𝛽\betaitalic_β as 0.1. See the “DPO” column of Table 12 for all the other hyper-parameters.

Appendix E Detailed results of individual datasets at different stages

In this section, we provide detailed results for individual datasets in our ablation study. Table 14, 15 and 16 list the results for pre-training, SFT and DPO respectively. Table 17 lists the results for the recaptioning experiment. We report F1/Precision/Recall for DREAM-1K and accuracy for other benchmarks.

Capability Benchmark Tarsier1-7B Tarsier1-7B-Qwen Tarsier2-7B
Caption DREAM-1K 34.6/30.2/40.3 38.4/40.6/36.4 40.8/42.5/39.3
TempCompass-cg 55.3 59.3 60.1
Vinoground-Text 29.8 48.6 60.2
Video QA Short MVBench 62.6 69.8 72.8
TVBench 45.8 51.0 53.5
TOMATO 28.6 36.5 39.5
Video QA Long Video-MME 42.2 58.9 65.3
LongVideoBench 39.8 52.1 58.3
TemporalBench 56.9 61.9 68.7
Hallucination EventHallusion-Y/N 70.9 75.6 77.8
EventHallusion-Desc 41.6 48.6 49.1

Table 14: Detailed results of the ablation study for pre-training. For the captioning task, results are reported after the SFT stage. For other tasks, results are reported after the pre-training stage.

Capability Benchmark pre-train Tarsier2-7B SFT w/o grounding SFT
Caption DREAM-1K 35.2/36.8/33.7 37.4/38.6/36.3 40.8/42.5/39.3
TempCompass-cg 50.5 50.2 60.1
Vinoground-Text 57.2 60.6 60.2
Video QA Short MVBench 72.8 71.9 72.5
TVBench 53.5 54.5 54.2
TOMATO 39.5 41.3 41.9
Video QA Long Video-MME 65.3 64.0 64.7
LongVideoBench 58.3 54.7 58.2
TemporalBench 68.7 66.9 66.6
Hallucination EventHallusion-Y/N 77.8 80.1 84.4
EventHallusion-Desc 49.1 56.2 59.4

Table 15: Detailed results of the ablation study for SFT.

Capability Benchmark Tarsier2-7B w/o DPO w/o NS w/o PF
Caption DREAM-1K 42.0/42.8/41.1 40.8/42.5/39.3 41.5/44.5/39.0 40.5/39.9/41.1
TempCompass-cg 66.6 60.1 62.1 65.1
Vinoground-Text 65.8 60.2 60.6 67.6
Video QA Short MVBench 71.5 72.5 72.2 71.7
TVBench 54.7 54.2 54.9 54.6
TOMATO 42.0 41.9 41.3 41.8
Video QA Long Video-MME 64.5 64.7 64.3 64.4
LongVideoBench 58.6 58.2 58.6 57.4
TemporalBench 65.3 66.6 65.4 65.2
Hallucination EventHallusion-Y/N 84.6 84.4 85.1 84.8
EventHallusion-Desc 63.3 59.4 60.7 63.5

Table 16: Detailed results of the ablation study for DPO.

Capability Benchmark Qwen2-VL-7B [106] +++ Original FT +++ Recaption FT
Caption DREAM-1K 29.6/33.9/26.3 35.2/44.8/29.0 39.5/41.7/37.6
TempCompass-cg 54.2 49.9 67.7
Vinoground-Text 40.0 39.0 55.0
Video QA Short MVBench 67.0 59.8 66.8
TVBench 43.8 47.2 51.1
TOMATO 31.5 33.6 39.5
Video QA Long Video-MME 63.3 56.1 57.0
LongVideoBench 55.6 51.4 51.9
TemporalBench 62.0 58.7 61.4
Hallucination EventHallusion-Y/N 68.6 39.6 80.7
EventHallusion-Desc 27.8 46.3 56.2

Table 17: Detailed results of the recaptioning experiment.

Appendix F Tarsier2-Recap-585K Data Composition

Table 18 lists the data composition details of Tarsier2-Recap-585K. We mainly took video caption datasets into account when picking the target datasets, together with two action recognition datasets (Kinetics-700 [13] and SSV2 [34]), which contain video clips of durations of 5∼10similar-to5105\sim 105 ∼ 10 seconds about human actions, and a special intent recognition dataset (Oops [28]) to help models learn rare actions and unexpected events. For most of the datasets, we utilized all the original video clips of the selected splits (usually train and val set), except for:

Dataset Original Label Type Split Avg Duration (s) # Sampled Clips Proportion (%)
WebVid-10M [10] Video Caption - 15.2 177,909 30.38
LSMDC [92] train/val/test 4.1 108,271 18.49
TGIF [59] train/test 12.3 94,775 16.18
Ego4D [35] - 4.1 50,000 8.54
ActivityNet [47] train/val/test 35.7 35,960 6.14
VATEX [107] train/val/test 10.0 22,435 3.83
TREC-VTT [7] train/val 6.3 14,199 2.42
Charades [97] train/test 29.8 7,985 1.36
Charades-Ego [96] train/test 30.2 6,161 1.05
Kinetics-700 [13] Action Recognition train/val/test 8.9 50000 8.50
SSV2 [34] train/val/test 3.7 10000 1.71
Oops [28] Intent Recognition train/val 9.8 7,948 1.36
Sum - - 1,972 hours 585,643 100.00

Table 18: Data composition of Tarsier2-Recap-585K. The “Split” column lists the original dataset partitioning, and we use bold to mark the parts which we sampled the video clips from to conduct recaptioning.

Refer to caption

Figure 9: Qualitative comparison of our model at different stages.

Appendix G Qualitative Comparison of the SFT Process

Figure 9 illustrates a qualitative comparison of our model at different stages, where we mark the differences in the prediction results of different models. From these differences, it can be seen that introducing temporal localization information in the first SFT stage significantly reduces the problem of hallucination in the model. However, the introduction of temporal localization information may also result in certain events being subdivided into finer actions. To address this issue, the second stage of training further improved the accuracy of the model description and optimized the output style.

Appendix H DREAM-1K cases

Figure 10∼similar-to\sim∼14 display the detailed video descriptions generated by Tarsier2-7B and other models (GPT-4o, Gemini-1.5-Pro and LLaVA-Video-7B-Qwen2) for different video categories in DREAM-1K. Click the play button on the first frames to view the raw video. The correct descriptions of key objects/actions/events are marked in green, and the incorrect descriptions are marked in red.

Refer to captionRefer to caption

Figure 10: Qualitative comparative analysis of various Video-MLLMs on Dream-1K dataset (Live-action Subset).

Refer to captionRefer to caption

Figure 11: Qualitative comparative analysis of various Video-MLLMs on Dream-1K dataset (Animation Subset).

Refer to captionRefer to caption

Figure 12: Qualitative comparative analysis of various Video-MLLMs on Dream-1K dataset (Stock Subset).

Refer to captionRefer to caption

Figure 13: Qualitative comparative analysis of various Video-MLLMs on Dream-1K dataset (Youtube Subset).

Refer to captionRefer to caption

Figure 14: Qualitative comparison of different Video-MLLMs on Dream-1K dataset (Shorts Subset).