Kwai Keye-VL 1.5 Technical Report (original) (raw)

Abstract

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

Refer to caption

Figure 1: Benchmark Performance of Kwai Keye-VL-1.5. Keye-VL-1.5-8B establishes new state-of-the-art performance among models of similar scale, demonstrating superior results on video-centric benchmarks while maintaining competitive performance on general multimodal and reasoning tasks. Compared to Keye-VL-Preview, this version shows significant improvements across all three evaluation dimensions, validating the effectiveness of our training approach.

Contents
  1. 1 Introduction
  2. 2 Model Architecture
    1. 2.1 Vision Encoder with Native-Resolution
    2. 2.2 Visual Encoding
  3. 3 Pre-Training
    1. 3.1 Data Pipeline
      1. 3.1.1 Image Caption Data
      2. 3.1.2 OCR & VQA Data
      3. 3.1.3 Grounding & Counting Data
      4. 3.1.4 Interleaved Text-Image Data
      5. 3.1.5 Video Data
    2. 3.2 Training Recipe
  4. 4 Post-Training
    1. 4.1 Non-Reasoning Stage: SFT + MPO
    2. 4.2 Keye-Reward Model
    3. 4.3 LongCoT Cold-Start
      1. 4.3.1 Data Construction Pipeline
      2. 4.3.2 Model Merging with Domain Specific Experts
    4. 4.4 Iterative General RL
      1. 4.4.1 General RLVR Training
      2. 4.4.2 Progressive Hint Sampling
      3. 4.4.3 Iterative General RL & Cold-Start Enhancement
    5. 4.5 Alignment RL
      1. 4.5.1 Reward System Design
      2. 4.5.2 Data Construction
  5. 5 Training Infrastructure
  6. 6 Evaluation
    1. 6.1 Zero-shot Image Classification of ViT
    2. 6.2 SlowFast Video Encoding Strategy Discussion
    3. 6.3 Public Benchmarks
    4. 6.4 Internal Benchmarks
    5. 6.5 Evaluation Results
    6. 6.6 Ablation Studies and Findings
      1. 6.6.1 Effects of SFT, MPO, and Long CoT Cold Start
      2. 6.6.2 Effectiveness of Expert Models and Model Merging
      3. 6.6.3 Effectiveness of Alignment Reinforcement Learning
      4. 6.6.4 Effect of Partial Solutions During RL Phase
      5. 6.6.5 Impact of Rejection Sampling on SFT and RL Performance
  7. 7 Conclusion and Discussion
  8. A Case Study
  9. B Authors (Alphabetical order)

Introduction

In recent years, Large Language Models (LLMs)(Grattafiori et al. (2024); Abdin et al. (2024); Team (2025a); Wang et al. (2024a)) have experienced rapid development, ushering in a new era of artificial intelligence with their powerful capabilities in understanding (FaceBook (2025); Team (2025b)), generation (Yang et al. (2025); Seed et al. (2025)), and linguistic reasoning (Guo et al. (2025a); Liu et al. (2024a)). This wave has also driven the rapid advancement of Multimodal Large Language Models (MLLMs) OpenAI (2025); Chen et al. (2024a; b); Hurst et al. (2024); Team et al. (2025a); Feng et al. (2024); Fu et al. (2025a); Han et al. (2024); Li et al. (2023); Luo et al. (2023); Guo et al. (2025b); Team et al. (2025b); Zhang et al. (2025a)), which extend powerful language capabilities to the visual domain, enabling the execution of complex tasks such as visual question answering (Li et al. (2024); Chen et al. (2024c)), detailed image description (Luo et al. (2024a); Rang et al. (2025); Li et al. (2025a)), object localization (Bai et al. (2025); Ma et al. (2025)), and visual reasoning (OpenAI (2025); Su et al. (2025); Hu et al. (2025a)).

Despite significant progress in static image understanding, video understanding remains a major challenge. Video content is inherently more dynamic and information-dense than static images, requiring models to process temporal relationships and sequential information while managing the fundamental trade-off between temporal coverage and spatial resolution. Existing approaches typically employ uniform frame sampling under fixed resolution constraints, which leads to suboptimal performance when fine-grained visual details and temporal consistency are required for content understanding (Shen et al. (2025); Lin et al. (2023); Luo et al. (2024b); Team et al. (2025c); Bai et al. (2025)).

To address these limitations, we propose Keye-VL-1.5, an 8-billion parameter multimodal foundation model that achieves state-of-the-art performance in video understanding while maintaining robust capabilities in general vision-language tasks. Our contributions span three key areas: architectural innovations for efficient multimodal processing, progressive pre-training strategies, and comprehensive post-training methodologies.

Architecture and Slow-Fast Video Encoding: We propose a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity. Key frames with significant visual changes are processed through the Slow pathway at higher resolution, while relatively static frames are processed through the Fast pathway at lower resolution but with higher temporal coverage. This adaptive approach, guided by patch-based similarity functions, effectively addresses the trade-off between spatial detail and temporal breadth.

Progressive Pre-training with Long Context Extension: Our pre-training methodology comprises four carefully designed stages that progressively build multimodal capabilities. Beginning with cross-modal alignment and multi-task learning, we systematically extend the model’s context length from 8K to 128K tokens during the annealing phase, enabling it to process longer videos and more complex visual content. This progressive approach ensures stable training while maximizing the utilization of the extended context window to enhance video understanding capabilities. The final model fusion stage combines models trained with different data mixtures to improve robustness and reduce bias.

Post-training for Reasoning and Human Preference Alignment: Our post-training process focuses on two critical aspects: enhancing reasoning capabilities and aligning with human preferences. We develop a comprehensive pipeline with three key components. First, we design a 5-step chain-of-thought reasoning data construction pipeline to generate high-quality cold-start data. Second, we employ the GSPO algorithm for verifiable reward-based reinforcement learning training. This includes progressive prompt sampling to handle difficult samples. Specifically, for samples where the model consistently fails during multiple rollouts, we provide varying levels of hints in the prompt to improve the efficiency of the rollouts. We use the RL model to generate better SFT data, and then perform the next round of RL training based on the SFT model, continuously iterating. Finally, we conduct alignment reinforcement learning training to enhance instruction following, response formatting, and preference alignment. This systematic approach ensures that Keye-VL-1.5 achieves excellent benchmark performance while providing responses that align with human expectations and preferences.

Through evaluation on public benchmarks and rigorous internal human assessment, we validate that Keye-VL-1.5 demonstrates significant improvements compared to existing models, particularly in video understanding tasks. Our work provides practical solutions for building next-generation multimodal models capable of complex video understanding and reasoning.

Refer to caption

Figure 2: The Kwai Keye-VL-1.5 model architecture is based on the Qwen3-8B language model and incorporates a vision encoder initialized from the open-source SigLIP. It supports SlowFast video encoding and native dynamic resolution, preserving the original aspect ratio of images by dividing each into a 14x14 patch sequence. A simple MLP layer then maps and merges the visual tokens. The model uses 3D RoPE for unified processing of text, image, and video information

Model Architecture

Figure 2 gives a high-level overview of our Keye-VL-1.5, which follows a classic MLLM architecture that includes three key components: a Vision Transformer (ViT), a MLP projector, and a language decoder. For ViT component, we apply the open-source SigLIP-400M-384-14 111https://huggingface.co/google/siglip-so400m-patch14-384 as our vision encoder to extract vision information. For LLM component, we employ the widely used Qwen3-8B as our language decoder, to provide the universal world semantic knowledge understanding capabilities. For the projector, we randomly initialize its parameters and fully pre-training it at the Stage 1. In the following sections, we provide our key upgrades, data pipeline and training recipes.

2.1 Vision Encoder with Native-Resolution

In past years, many MLLMs efforts have adopted the well-trained fixed-resolution ViTs as their vision encoders, such as ViT-bigG (Cherti et al. (2023)), SigLIP-400M (Zhai et al. (2023)) and others. However, unlike pre-trained CLIP-based ViTs (Radford et al. (2021)) that only handle coarse-grained image-caption matching task during training, MLLMs often tackle various finer-grained generation tasks, existing a large gap between them. Therefore, we anticipate that our ViT will possess the following capabilities: during processing, images and videos maintain their structural integrity and all details are preserved.

To this end, there are some pioneer MLLMs exploring native-resolution ViT in recent years, such as Qwen2.5-VL, Seed-VL-1.5, Kimi-VL, etc. In Keye-VL-1.5, we also implement a native-resolution ViT, to naturally process images at original resolution, avoiding some complex and redundant image splicing/splitting operations (e.g., MiniCPM2 (Yao et al. (2024))). Specifically, our ViT is initialized by the SigLIP-400M-384-14, a fixed-resolution variant with absolute learnable position embeddings to inject the spatial information. According to it, we first employ interpolation techniques to extend fixed-length learnable position embeddings into resolution-adaptive position embeddings, enabling our basic native-resolution modeling while preserving the pretrained workflow. Afterwards, to further enhance extrapolation capabilities for positional encoding along visual dimensions, we introduce 2D Rotary Position Embedding (RoPE) to strengthen the visual information modeling. In our trial experience, we observe that incorporating 2D RoPE significantly improves the model’s performance on high-resolution image. Finally, building upon the two types of position embeddings, we incorporate the NaViT packing with FlashAttention techniques to continue training our ViT across images with varying resolutions.

During the ViT pre-training procedure, we optimize our native-resolution modifications via SigLIP loss function (the text tower is also from SigLIP-400M-384-14). We use the same distribution data as the downstream MLLM for training, including a total of 500B Tokens from open source data DataComp (Gadre et al. (2023)), LAION (Schuhmann et al. (2022)), CC12M (Changpinyo et al. (2021)), PD12M (Meyer et al. (2024)), COCO (Lin et al. (2014)) and other in-house data.

Refer to caption

Figure 3: A SlowFast video encoding demonstration: the Slow processes a smaller number of frames at higher resolution, while the Fast handles more frames at lower resolution.

2.2 Visual Encoding

To guarantee that our language decoder can perceive enough visual signals to understand images and videos in detail, we devise different modeling strategies for them:

Pre-Training

In this section, we first describe the construction of the pre-training dataset, followed by an overview of the overall training pipeline and configuration.

3.1 Data Pipeline

In our data construction pipeline, we have assembled a diverse, high-quality corpus with exceeding 1 trillion tokens to support our models training, sourced from both public datasets and proprietary in-house data. Generally, our training data encompasses six primary categories: Image Caption, OCR & VQA, Grounding & Counting, Interleaved, Video Understanding and Pure Text data. To ensure these overall data quality, we have designed customized filtering mechanisms tailored to the characteristics of each data category. For large volumes of medium-quality data, we employ CLIP (Radford et al. (2021)) scores for preliminary filtering. For smaller amounts of high-quality data, we utilize open-source MLLMs as discriminators for data selection. Additionally, we also conduct rigorous image-based deduplication operation, to avoid the potential data leakage between our training corpus and evaluation benchmarks (Dixit et al. (2021)). Specifically, we identify highly similar images, then remove these near-duplicates from the dataset. In the following sections, we provide detailed descriptions of each category of our data.

3.1.1 Image Caption Data

Image caption task provides the fundamental world knowledge to establish a mapping relationship between visual features and linguistic concepts by pairing image with textual descriptions. Based on large-scale caption data, our model gains the ability to perceive and comprehend a broad, rich spectrum of world knowledge, such as real-world physical principles and cultural conventions. Although we can public access many diverse Chinese and English open-source caption data source, such as LAION (Schuhmann et al. (2022)), DataComp (Gadre et al. (2023)) and Coyo (Byeon et al. (2022)), the quality of such data is often unreliable, as it typically only undergoes simple crawler-based matching.

To alleviate such data noise, we conduct strict similarity-based filtering pipeline to control the data quality, e.g., scoring the raw rigorous image-caption pair by a CLIP model. In practice, to ensure data quality, we retain high-similarity image-caption pairs (e.g., CLIP score ¿ 0.9) while leveraging filtered low-quality open-source image data and our in-house image data through a re-captioning pipeline. During the re-caption, we utilize several MLLMs (Qwen2.5-VL 72B (Bai et al. (2025)), Tarsier2 (Yuan et al. (2025)), GPT-4o (Hurst et al. (2024)), Gemini1.5-pro (Team et al. (2023)) and others) to generate the synthesis caption for vary resolution images and image category information. In our experience, we find that recaption data generated by different MLLMs can be very helpful for fine-grained image understanding.

Further, to avoid our model degenerate into a caption generators and hurt its instruction-following and complex reasoning abilities. We implemented a data augmentation strategy with multiple-caption/question-answering pair to maintain our model’s general conversation and instruction capabilities:

Besides, to improve our model robustness and faithfulness, we proactively inject some ‘trap questions’ that refer to non-existent or contradictory questions. These counterfactual data would encourage the model to ground its responses more accurately in visual content rather than textual priors.

3.1.2 OCR & VQA Data

Optical Character Recognition (OCR) and Visual Question Answering (VQA) are vital tasks to encourage our model to distinguish the details of images. By integrating OCR capabilities, the model can accurately extract and interpret textual information within images, while VQA task enables our model to comprehend and reason about visual content in a context-aware manner. In order to build our capabilities in OCR and VQA, we have collected a large number of open-source data, such as Latex-Formula, hand-write text, real-world street views, charts, rich-text documents, multi-image OCR and so on. Since most of the open-source datasets are in English, to further enhance the model’s capability in Chinese OCR & VQA tasks, we introduce multiple techniques for synthesizing in-house Chinese data:

3.1.3 Grounding & Counting Data

Object grounding is one of the fundamental abilities of MLLMs( Bai et al. (2025); Seed et al. (2025)), which enables our model to establish a direct connection between temporal/visual information and text semantics, as shown in the Table 1. In Keye-VL-1.5 objective grounding, we primarily utilize three object localization forms: center points, bounding boxes, and polygons. Their coordinates are strictly typed as integers and normalized to the range [0, 1000) for different resolution images, . In general, we mainly employ the RefCoCo (Kazemzadeh et al. (2014)), VisualGenome (Krishna et al. (2017)), TolokaVQA (Ustalov et al. (2023)) as our grounding data source, and the PixMo (Deitke et al. (2024)) as our counting data source. For the in-house grounding data generation, we use other MLLMs (e.g., Gemini 2.5 Pro, Qwen-2.5-72B) to extract the answer area bounding boxes of corresponding document questions. To filter the incorrect, missing, or ambiguous annotation grounding data, we utilize the CLIP and Qwen-2.5-7B to select the higher-score points/boxes/polygons as our training data, i.e., extracting the corresponding grounding area from the image to compute its similarity with the target objective text.

For temporal grounding data, we construct a three-step coarse-to-fine-grained data synthesis pipeline based on our massive short-videos base. In the first step, we employ the TEMPURA Cheng et al. (2025a) to process a given short-video as several event video clips with their temporal captions. Next, to alleviate the “repetitive collapse” in redundant or meaningless descriptions issue of raw TEMPURA outputs, we apply the SOTA MLLMs as a filter to identify and remove such low-quality, repetitive event video clips to obtain reliable temporal grounding captions. At last, according to those captions, we further utilize the Gemini 2.5 Pro to enrich our database to generate a series of logical question-answering pairs about timestamps, which could empower our model’s understanding of temporal causality relationships. In this way, our pipeline ensures our model not only describes what happens in a video, but also understands and reasons about when and why.

objective center points
Example ¡—point_start—¿[[x1, y1]]¡—point_end—¿
Description The [x1, y1] is the center point of queried objective.
Example ¡—point_start—¿[[x1, y1], [x2, y2]]¡—point_end—¿
Description Supporting multiple points for a single queried objective.
Example ¡—object_ref_start—¿obj¡—object_ref_end—¿¡—point_start—¿[[x1, y1]]¡—point_end—¿
Description The [x1, y1] is the center point of ‘obj’.
objective bounding boxes
Example ¡—box_start—¿[[x1, y1, x2, y2]]¡—box_end—¿
Description The coordinates [x1, y1]/[x2, y2] denote the top-left and bottom-right point of box of queried objective.
Example ¡—box_start—¿[[x1, y1, x2, y2], [x3, y3, x4, y4]]¡—box_end—¿
Description Supporting multiple boxes for a single queried objective.
Example ¡—object_ref_start—¿obj¡—object_ref_end—¿¡—box_start—¿[[x1, y1, x2, y2]]¡—box_end—¿
Description Detecting the ‘obj’ and its corresponding box.
Example ¡—ocr_text_start—¿text¡—ocr_text_end—¿¡—box_start—¿[[x1, y1, x2, y2]]¡—box_end—¿
Description Identify the OCR results and its corresponding box.
objective polygons
Example ¡—object_ref_start—¿obj¡—object_ref_end—¿¡—polygon_start—¿[[[x1, y1], [x2, y2], [x3, y3]]]¡—polygon_end—¿
Description The coordinates [x1, y1], [x2, y2], … represent polygon vertices of ‘obj’, which arranged in clockwise order.
Example ¡—ocr_text_start—¿text¡—ocr_text_end—¿¡—polygon_start—¿[[[x1, y1], [x2, y2], [x3, y3]]]¡—polygon_end—¿
Description Supporting the OCR results.
temporal caption
Example ¡—clip_time_start—¿[t1, t2]¡—clip_time_end—¿ event-caption.
Description [t1, t2] represents time grounding duration of the corresponding event.

Table 1: Grounding Label Assembling of Keye-VL-1.5.

3.1.4 Interleaved Text-Image Data

Instead of the learning task surrounding the single images, we also introduce a large amount of interleaved data to enhance our language decoder’s longer multi-modal context modeling ability and longer sequence adaptation, e.g., 128K context modeling. Actually, beyond modeling multi-image correlations, the interleaved data could contribute several critical advantages in pre-training: (1) Preservation of General Knowledge: It contains a wealth of universal knowledge, ensuring that the LLM module’s core capabilities are not degraded during training, (2) Enhanced Vision-Language Alignment: By leveraging in-context learning, it helps the model better align visual and semantic signals in language model side, (3) Improved Generalization: The diverse and interleaved nature of the data strengthens the model’s ability to reason across modalities and generalize to unseen tasks. Besides the open-source interleaved data, we also build a large-scale in-house interleaved data generation pipeline. Specifically, we focus on the two type of raw rich-text documents processing, the academic PDF data and structured knowledge data, especially the Science, Technology, Engineering, and Mathematics (STEM) data. We collect a substantial amount of academic and knowledge-based PDF/structured data to render the text content into plain text format and insert the corresponding images at their original positions within the text. In such a process, we conduct rigorous data protection strategies to ensure high-quality outputs. Our pipeline includes: (1) Garbled character recognition: identifying and removing garbled characters, (2) Low-resolution/broken image filtering: ensuring image quality, (3) Text-image similarity validation: ensuring semantic alignment between interleaved image-text.

3.1.5 Video Data

As a short-video and live-streaming service provider, the video understanding ability is the most important point of Kwai, such as understanding the video details, generating summaries, and expressing interesting implications. To reach the goal, our video data are collected from multiple sources, including diverse open-source datasets (ShareGPT4V, Pandas and others) and a large-scale high-quality in-house video data. Based on these videos, we conduct the following key pipelines to guarantee our data quality:

In addition to OCR and video captioning/QA tasks, we have designed a series of reasoning-enhanced tasks to help the model better understand contextual relationships in short videos. These include:

Refer to caption

Figure 4: The Kwai Keye-VL-1.5 pre-training pipeline, featuring a four-stage progressive strategy: Image-Text Matching, ViT-LLM Alignment, Multi-task Pre-training, and Annealing with model merging.

3.2 Training Recipe

We employ a four-stage progressive training strategy to build a powerful multi-modal foundation model with strong vision-language alignment capabilities. The training pipeline, illustrated in Figure 4, is meticulously designed to ensure that each stage has a clear and interconnected objective.

The Vision Transformer (Dosovitskiy et al. (2020)) (ViT) is initialized with weights from the siglip-so400m-patch14-384 model and undergoes continuous pre-training using the SigLIP (Zhai et al. (2023)) contrastive loss function. This stage focuses on adapting the vision encoder to our internal data distribution. We incorporate native dynamic resolution processing (akin to NaViT (Dehghani et al. (2023))), which preserves the original aspect ratio of images to the greatest extent possible. Additionally, 2D Rotary Position Embeddings (Su et al. (2024)) (RoPE) are integrated to enhance the model’s extrapolation capabilities when processing images of varying resolutions.

Stage 1: cross-modal alignment:

The language model is initialized from Qwen3-8B (Yang et al. (2025)). During this stage, the parameters of both the vision and language models are frozen. Training is focused on optimizing the projection MLP layer. With large-scale datasets, we establish a robust alignment between cross-modal features, laying the groundwork for the subsequent learning phase.

Stage 2: multi-task pre-training:

All model parameters are unfrozen for end-to-end optimization using a diverse set of multi-task training data. The data in this stage encompasses a wide range of common vision-language tasks, including Image Captioning, Optical Character Recognition (OCR), Grounding, Visual Question Answering (VQA), and interleaved image-text data. This process significantly enhances the model’s fundamental visual understanding capabilities.

Stage 3: annealing:

This stage involves an annealing phase where the model is fine-tuned on a curated set of high-quality data. The primary goal is to address the issue of insufficient exposure to high-quality samples during the large-scale, broader training of Stage 2. Through optimized learning strategies and data mixtures, we further refine the model’s nuanced understanding and capabilities.

Sequence Length Extension to 128K:

In Stage 1 and Stage 2, we limit the sequence length of each sample to 8,192 (8K), where Data Parallelism is adopted to effectively create large batch sizes. Zero-2 optimization strategy is applied to reduce memory overhead. In the final annealing stage, we extend the context length of the model from 8,192 (8K) to 131,072 (128K). The RoPE inverse frequency of LLM side is reset from 1,000,000 to 8,000,000. The training data is concurrently enriched with high-quality long-context modalities, including long videos, long texts, and large-scale images. Additionally, we switch optimization strategy to Zero-1 and adopt Context Parallelism and Pipeline Parallelism to support long-context training. Under the 128K context length, our controlled experiments show that allocating 24% of tokens to videos, 50% to images, and the remaining 26% to text strikes a good balance between visual capabilities (image and video understanding) and text capabilities.

Post-Training

Refer to caption

Figure 5: Post-Training Pipeline: The post-training process includes non-reasoning stage and reasoning stage. The non-reasoning stage is composed of SFT and MPO training. The reasoning stage consists of three key steps: CoT Cold Start (we construct a five-step construction pipeline to generate high-quality CoT Cold-Start Dataset and apply model merging to refine model performance), General RL (we concentrate on improving Keye-VL-1.5’s reasoning ability, applying GSPO, we propose progressive hint sampling to fully take advantage of hard problems and iteratively improve the cold-start and general RL model), and Alignment RL (improving Keye-VL-1.5’s instruction following, format adherence, preference alignment and RAG ability with our reward system, we construct instruction following data, reasoning data and RAG data for RL training in this step).

4.1 Non-Reasoning Stage: SFT + MPO

The SFT data candidate pool contains over 7.5 million multimodal QA samples. We employ the following construction methods to balance comprehensiveness and data quality.

The training strategy involves a dynamic learning rate. In the later phases of training, the model undergoes an annealing process at a lower learning rate. Evaluations show this annealing step contributes approximately a 1% performance improvement across both open-source and internal benchmarks.

Following SFT, the model undergoes MPO to continuously refine its performance. The MPO dataset includes 250k open-source samples Wang et al. (2024b), 150k text-only samples, and 26k human-annotated samples Zhang et al. (2025b). We perform multiple samplings using Keye-VL-1.5 on the above dataset, and construct multiple pairs of high-quality and low-quality samples using the reward model scores and human annotations. The training strategy for this stage applies the MPO algorithm, utilizing the constructed paired preference data to optimize Keye-VL-1.5’s overall performance.

4.2 Keye-Reward Model

Recognizing the importance of reward modeling for data quality evaluation and model training, we train our reward model based on the Keye-VL-preview for data filtering and reinforcement learning training. We adapt the Keye-VL-preview model to the reward modeling task with the SFT+RL training process.

Data format: The model input consists of the query, response A, and response B, along with the task definition guiding the model in evaluating the quality of response A and response B. Similar to Keye-VL-1.5’s mix reasoning mode, our training data is composed of two formats: think and no_think. For the no_think mode, the model directly outputs the final judgment based on the input information. In the think mode, the model needs to evaluate the quality of response A and response B separately according to the predefined nine dimensions (such as Credibility, Correctness, Redundancy, Relevance, etc.), then generate a comprehensive evaluation. The mix reasoning mode enables our reward model to reason in terms of efficiency, accuracy, and interpretability.

SFT recipe: The SFT data includes open sourced preference datasets R1-Reward (Zhang et al. (2025c)), MMPR (Wang et al. (2024b)), and manually labeled Keye-VL-preview sampling results. After SFT, we apply data where good responses are shorter than bad responses for annealing to overcome the reward model’s preference for longer responses.

RL recipe: The RL data includes preference data consisting of wrong cases from Keye-VL-preview in the SFT dataset and right cases generated by larger MLLMs, as well as data from MMPR. In this stage, we carefully filter out data with excessively large length differences between positive and negative samples, using format reward and outcome reward as training signals.

We take our reward model to evaluate the quality of Keye-VL’s sampling results, which are applied to update the training data and provide reward signals.

4.3 LongCoT Cold-Start

After large-scale SFT and MPO, we construct high-quality Long Chain-of-Thought (LongCoT) data for cold-start reasoning training, aiming to enhance Keye-VL’s long CoT reasoning ability, serving as the starting point for subsequent reinforcement learning.

4.3.1 Data Construction Pipeline

To address the challenge of acquiring high-quality training data for cold-start, we propose a comprehensive five-step automated pipeline for generating LongCoT data, as illustrated in Figure 6. Our approach strategically leverages existing MLLMs to create diverse, high-quality reasoning chains while maintaining both scalability and cost-effectiveness. The pipeline systematically integrates automated generation, rigorous quality assessment, targeted human enhancement, and adaptive data utilization to ensure optimal training data quality across diverse domains and reasoning complexity levels.

Refer to caption

Figure 6: Overview of our five-step automated LongCoT data generation pipeline. The pipeline begins with (a) sampling from data and prompt pools using MLLMs to generate thinking processes and logit information, followed by (b) quality assessment using MLLM as judge to evaluate both outcomes and reasoning processes with step-wise scoring, (c) categorization into three quality tiers (A: high quality, B: middle quality requiring human review, C: low quality to be discarded), (d) human augmentation for Category B samples and suspected redundant Category A samples, and (e) final MLLM review with dynamic quality scoring (1-5 scale) to determine optimal data utilization strategies. This comprehensive approach ensures both scalability and quality control in generating training data.

Multi-Source Data Collection and Enhancement: Our data generation process begins with the systematic collection of multimodal QA data spanning multiple challenging domains. These domains include mathematical reasoning problems, STEM, OCR and document understanding tasks, visual grounding and object localization, counting, GUI scenarios, and domain-specific business applications. This comprehensive coverage ensures that our generated dataset captures the full spectrum of multimodal reasoning capabilities required for practical applications.

To enhance the complexity and diversity of the collected data, we employ proprietary MLLMs to perform sophisticated question rewriting and task merging operations. The rewriting process transforms simple, straightforward questions into more challenging variants that require deeper reasoning and multi-step problem solving. Additionally, we systematically combine related sub-tasks into comprehensive multi-task instructions, creating scenarios where models must demonstrate proficiency across multiple capabilities simultaneously. This enhancement strategy significantly increases the pedagogical value of each training sample while maintaining natural question flow and coherence.

Multi-Path Reasoning Generation with Confidence Quantification: For each enhanced QA pair, we generate multiple reasoning trajectories leveraging existing MLLMs. A pivotal component of our generation pipeline is the systematic extraction and quantification of model confidence at both the step-wise and holistic response levels. We compute granular confidence scores that capture the model’s certainty in individual reasoning steps as well as the final answer. This confidence metadata serves as a crucial signal for downstream quality assessment and sample prioritization workflows, enabling us to systematically identify the most reliable and coherent reasoning chains from the generated candidate pool. Throughout the multi-round sampling process, we strategically select samples that exhibit diverse logical pathways while maintaining correctness, thereby enriching the diversity of reasoning patterns. Simultaneously, we implement a confidence-prioritized selection strategy, systematically favoring reasoning chains with higher logit-based confidence scores to optimize training sample quality.

Comprehensive Two-Level Quality Assessment: We implement a rigorous two-level quality assessment framework using proprietary MLLMs. This dual assessment strategy operates simultaneously on both answer correctness and reasoning process validity. At the answer level, our assessment framework incorporates flexible matching patterns specifically tailored to different task types and domains. The system supports sophisticated fuzzy matching capabilities and equivalent expression recognition, accommodating variations in phrasing, mathematical notation, and unit representations. For instance, mathematical answers are evaluated considering formula equivalence and unit conversion, while text-based responses account for semantic similarity and paraphrasing.

At the reasoning level, we conduct a granular step-by-step evaluation for each reasoning chain. Every individual reasoning step undergoes scrutiny for logical consistency with preceding steps, factual accuracy against established knowledge, and relevance to the original question. This meticulous evaluation process identifies not only outright errors but also subtle issues such as logical gaps, unsupported assumptions, and irrelevant tangential reasoning. Based on the comprehensive dual assessment results, we categorize all generated samples into three distinct quality tiers.

Human-in-the-Loop Quality Enhancement: For Category B samples and potentially redundant Category A samples, we implement a systematic human-guided refinement process designed to enhance reasoning quality while preserving valuable training data. Our comprehensive human review protocol encompasses several critical enhancement dimensions:

This human-in-the-loop approach ensures that samples falling into intermediate quality categories undergo systematic improvement rather than wholesale discarding. This methodology strikes an optimal balance between data preservation and quality assurance, thereby enhancing the overall effectiveness of our reasoning dataset for downstream model training.

Dynamic Quality Scoring and Data Utilization Strategy:To optimize data utilization, we implement a comprehensive five-point quality scoring system that evaluates samples across multiple dimensions:

Based on these quality scores, we implement an adaptive data utilization strategy where higher-quality samples are used more frequently during training. Specifically, samples scoring four or five points are repeated multiple times in the training dataset to reinforce high-quality reasoning patterns, while lower-scoring samples are used sparingly to avoid reinforcing suboptimal behaviors. This strategic approach ensures that the model’s learning process is dominated by the most valuable and challenging examples while maintaining overall dataset diversity.

The entire automated pipeline demonstrates remarkable efficiency and consistency, processing large volumes of input data while maintaining stringent quality standards across diverse domains and task types. The systematic integration of automated generation, rigorous quality assessment, targeted human enhancement, and adaptive utilization creates a comprehensive framework for producing high-quality training data suitable for effective multimodal model cold-start scenarios.

4.3.2 Model Merging with Domain Specific Experts

We conduct a comprehensive analysis of the LongCoT cold start model’s performance across various benchmarks using the aforementioned training data, with the objective of identifying and addressing model deficiencies prior to the RL phase. Our analysis reveals concentrated weaknesses in three primary domains: pure text processing, mathematical reasoning, and OCR. To address these limitations, we develop a systematic approach involving specialized data collection and expert model training, followed by model merging to enhance Keye-VL-1.5’s foundational capabilities.

OCR Capability Enhancement: Beyond standard OCR datasets, we address specific weaknesses in specialized recognition tasks including license plates, street signage, and official seals. Our enhancement strategy involves three key components: First, we systematically gather OCR datasets targeting identified weak areas, ensuring annotation accuracy through rigorous quality control processes. Second, we develop an automated data pipeline that utilizes images paired with verified OCR annotations to generate relevant OCR questions through other MLLMs, with original annotations serving as ground truth answers to guarantee correctness. Finally, we conduct SFT on the cold-start model using both general-purpose OCR data and our specialized weak-area datasets to create an OCR expert model.

Model Merging:We employ model merging (Li et al., 2025b; Wei et al., 2025) to integrate domain-specific expert models and the LongCoT cold start model into a general model for enhanced performance.

4.4 Iterative General RL

Based on the cold-start model, we design our General RL process to further enhance Keye-VL-1.5’s reasoning ability, which applies the GSPO (Zheng et al., 2025) (Group Sequence Policy Optimization) algorithm for RLVR (Reinforcement Learning with Verifiable Rewards) training, and employs a cyclical iterative approach to collaboratively enhance both the RL model and the cold-start model.

4.4.1 General RLVR Training

Training data:We select data from domains including mathematics, science & technology problem, logical reasoning & puzzle problems, code, chart question answering, visual grounding, spatial relationships, and counting to construct the RLVR training set. Each data point contains a verifiable answer used for rule-based reward calculation. We sample data from different domains according to ablation experiments, analyzing the impact of domain-specific data on model metrics. We then increase the proportion of data from domains that contribute positively to performance improvements.

Training Algorithm:Based on sequence-level importance weight, GSPO employs the following sequence-level optimization objective:

| 𝒥GSPO​(θ)=𝔼x∼𝒟,{yi}i=1G∼πθold(⋅|x)​[1G​∑i=1Gmin⁡(si​(θ)​A^i,clip​(si​(θ),1−ϵ,1+ϵ)​A^i)]\mathcal{J}_{\text{GSPO}}(\theta)=\mathbb{E}_{x\sim\mathcal{D},\{y_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\text{old}}}(\cdot|x)}\left[\frac{1}{G}\sum_{i=1}^{G}\min\left(s_{i}(\theta)\hat{A}_{i},\text{clip}(s_{i}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i}\right)\right] | (1) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

where the group-based advantage estimation is defined as:

A^i=r​(x,yi)−mean​({r​(x,yi)}i=1G),std​({r​(x,yi)}i=1G)\hat{A}_{i}=r(x,y_{i})-\text{mean}\left(\{r(x,y_{i})\}_{i=1}^{G}\right),\quad\text{std}\left(\{r(x,y_{i})\}_{i=1}^{G}\right) (2)

and the importance ratio based on sequence likelihood si​(θ)s_{i}(\theta) is defined as:

| si​(θ)=πθ​(yi|x)πθold​(yi|x)wheresi​(θ)=exp⁡(1|yi|​∑t=1|yi|log⁡(πθ​(yi,t|x,yi,<t)πθold​(yi,t|x,yi,<t)))s_{i}(\theta)=\frac{\pi_{\theta}(y_{i}|x)}{\pi_{\theta_{\text{old}}}(y_{i}|x)}\quad\text{where}\quad s_{i}(\theta)=\exp\left(\frac{1}{|y_{i}|}\sum_{t=1}^{|y_{i}|}\log\left(\frac{\pi_{\theta}(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|x,y_{i,<t})}\right)\right) | (3) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

4.4.2 Progressive Hint Sampling

During the training process, we find that the model struggles to generate correct responses for some difficult samples, reflecting a deficiency in the model’s capabilities. To make full use of these challenging samples and enhance Keye-VL-1.5’s reasoning ability, we apply the progressive hint sampling method to improve the success rate of sampling difficult samples.

We first identify the hard cases in the RLVR dataset where Keye-VL-1.5 consistently fails across multiple attempts, then select data with reliable reference answers, sufficient difficulty, and appropriate challenge level as samples for progressive hint sampling.

Unlike the approach of partitioning hints by step, we follow the Minimal Intervention principle to design a hierarchical hint system, aiming to provide the model with the minimal information necessary to solve the problem. We divide the hints into five levels, from abstract concepts to specific reasoning steps:

For each hard case, we place the hint information after the query and progressively provide hints from low level to high level. When Keye-VL-1.5 can generate correct response based on a particular level of hint, we consider the hint at that level as the minimal information required to help Keye-VL-1.5 solve the hard case. The responses generated based on this minimal information is then applied to update the policy. In Table 9, we report the impact of different levels of hints on the sampling success rate of Keye-VL-1.5 in hard cases, aiming to demonstrate the rationality of our hierarchical hint system and the effectiveness of hints in improving the utilization efficiency of hard cases.

4.4.3 Iterative General RL & Cold-Start Enhancement

To improve the learning efficiency on reasoning data and break through the performance bottleneck of the SFT model, we design a multi-round iterative paradigm that collaboratively enhances both the cold-start model, which serves as the starting point for General RL, and the model after General RL.

Our iterative pipeline is as follows:

    1. Apply the cold-start model as the initial model and perform General RL training.
    1. Apply the model after General RL for rejection sampling on the Cold Start dataset, score the samples with our reward model. If the sampled results are better than the ground truth, update that data point by replacing the ground truth with the sampled results.
    1. Take the updated cold-start data to train a new cold-start model, which serves as the initial model for the next round of General RL.
    1. Take the updated cold-start model to filter the General RL dataset, selecting data with sampling accuracy between 0 and 1 for next round General RL training.

4.5 Alignment RL

After General RL, we perform Alignment RL to comprehensively improve the Keye-VL-1.5’s performance in real-world application scenarios. We have developed a diversified task system and reward modeling framework to enhance the model’s capabilities in the following dimensions:

4.5.1 Reward System Design

The reward system we employ is composed of three main categories:

This reward system helps guide the model towards producing accurate, ethical, and contextually appropriate outputs across various tasks.

4.5.2 Data Construction

For instruction-following task, we design 25 types of hard constraints, including “keywords inclusion,” “punctuation,” “pronunciation,” “output format,” etc., as well as 20 types of soft constraints, such as text style and semantics. We construct a query set consisting of 17k multimodal data and 23k pure text data, with each query assigned 2 to 6 types of constraints as inputs. Hard and soft constraints are rewarded through rule-based rewards and generative rewards, respectively.

For reasoning task, we construct 12k mathematical and logical reasoning queries, with 3 to 5 problem-solving steps designed for each query. The model is required to solve the problem following the prescribed steps. We use rule-based rewards to calculate the correctness of the outcome, and generative rewards to assess whether the reasoning process follows the predefined steps.

For RAG task, we collect a series of instances based on the latest news that require internet searches to obtain answers. We encourage the model to use search and summary behaviors during the think process, ultimately generating the correct answer. We take generative rewards to evaluate the effectiveness of the search behavior in resolving the query, the correctness of the summary behavior, and the consistency of the final answer. We still take GSPO algorithm to optimize our model during Alignment RL.

Training Infrastructure

To efficiently train MLLMs, we make in-depth infrastructure optimization to address three major challenges: architectural heterogeneity, load imbalance, and I/O bottlenecks.

Heterogeneous Hybrid Parallel Strategy: The training bottleneck of MLLMs stems from computational imbalance caused by architectural heterogeneity. The computational characteristics and resource demands of ViT and LLM are vastly different, and unified parallel strategy leads to significant resource wastage. To address this, we design a heterogeneous hybrid parallel strategy: for the relatively fixed computational pattern of the ViT component, we only use data parallelism (DP) to maximize throughput; whereas for the highly parameter- and memory-intensive LLM, we adopt a hybrid parallelism strategy that combines pipeline (PP), tensor (TP), and data parallelism (DP). This refined strategy is a decisive technical prerequisite for achieving 128K ultra-long sequence training of Keye-VL-1.5.

Dynamic Load Balancing Mechanism: Multimodal data inherently leads to load imbalance, primarily due to the correlation between computational load in the visual encoding phase and the input samples. For instance, processing a high-resolution video incurs significantly more computational cost than a static image. In data parallel training, this leads to GPUs processing complex visual input consumes a longer time while other GPUs finish earlier and waits. To address this, we pre-estimate the time complexity of each sample and then use a greedy algorithm to allocate the samples across different GPUs, thereby balancing the total step duration across all GPUs and improving overall hardware utilization.

Flexible and Scalable Dataloader: To fundamentally resolve I/O bottlenecks, we design a flexible and scalable dataloader that deeply senses the topology of parallel training. In terms of data parallelism (DP), each process only loads a shard of the global dataset; in terms of pipeline parallelism (PP), only the first stage (PP0) is responsible for data acquisition and preprocessing; and in tensor parallelism (TP/CP), the data is first fetched by a single process within the group and efficiently broad-casted across processes. Furthermore, we implement an I/O server architecture to offload CPU-intensive tasks such as video decoding from the training nodes, effectively resolving CPU bottlenecks caused by complex media processing. Finally, we implement a instance-level perfect resume mechanism, ensuring that tasks can seamlessly resume from the last successfully processed sample after an interruption, significantly improving the stability and efficiency of large-scale training.

Models ImageNet-1K ImageNet-V2 ImageNet-A ImageNet-R ImageNet-S ObjectNet
Base (SigLIP-400M-384-14) 83.08 77.34 82.22 95.78 74.59 76.99
+ 1D interpolation 82.02 75.96 80.92 94.50 70.74 67.58
+ 1D interpolation + 2D RoPE 82.65 76.80 83.26 95.22 72.59 78.70

Table 2: Comparison of ViT variants on the ImageNet benchmarks: The highest scores are marked in bold and the second highest are underlined.

Refer to caption

(a) Frames.

Refer to caption

(b) FPS.

Figure 7: SlowFast (Keye-VL-1.5-Base) and 2D convolution (Qwen-2.5 VL) video encoding strategies were compared on VideoMME across different video lengths. Keye-VL-1.5-Base exhibits strong visual understanding capabilities across various settings, e.g., diverse frame numbers and FPS.

Evaluation

6.1 Zero-shot Image Classification of ViT

The evaluation covers six benchmark datasets: ImageNet-1K, ImageNet-V2, ImageNet-A, ImageNet-R, ImageNet-S and ObjectNet, and its results are shown in Table 2. From it, we have the following observations: (1) Compared with base SigLIP model, our 1D interpolation position embedding native-resolution model variant has slightly performance degeneration, the reason might be the interpolated 1D position encoding cannot uniquely identify the underlying 2D patch arrangement. For instance, a sequence of 196 patches may correspond to multiple distinct spatial configurations (e.g., 14×14, 7×28, or 28×7), leading to ambiguous spatial localization during feature projection. (2) With 2D RoPE modification, our ViT could clearly perceive the shape of the image, and showing competitive results with Base SigLIP performance (the best and runner-up results). We think the reason maybe our continued pretraining corpus sharing the same distribution with our MLLMs, rather than the Image-Text matching task.

6.2 SlowFast Video Encoding Strategy Discussion

In this section, to verify that our SlowFast strategy can capture fine-grained video information, we conduct a comparative analysis between Keye-VL-1.5-Base and Qwen-2.5-VL. Keye-VL-1.5-Base is a pre-trained model equipped with our SlowFast technique, while Qwen-2.5-VL employs a 2D convolution merging technique for video compression.

For a fair comparison, we evaluate both models on the VideoMME benchmark under different settings. Specifically, we test with fixed frame numbers ranging from 32, 64, 128, up to 768, and FPS values from 1 to 4 in increments of 1. Meanwhile, different with linear token budget increasing of 2D convolution along with frame amount, our slowFast strategy has highly adaptive token budget for different videos with different information density. Combining the two factors, we show the prediction performances and the LLM-side visual token budgets across different video category (i.e., short/medium/long and overall) at the Figure 7. According to it, we have the following observations:

6.3 Public Benchmarks

Benchmark Keye-VL-1.5 Keye-VL-Preview Qwen2.5-VL InternVL3 MiMo-VL GPT-4o Claude 3.7
8B-Thinking 8B-Thinking 7B 8B 7B-RL 2508 Sonnet
General
OpenCompass 79.5 77.4 70.970.9 73.673.6 75.2 72.072.0 70.170.1
MMMUval 71.4 71.4 58.658.6 62.762.7 69.4 70.770.7 69.869.8
AI2D 89.5 86.7 83.983.9 85.2 87.1 82.682.6 81.481.4
MMBench 92.0 92.0 82.282.2 82.1 86.8 86.086.0 79.779.7
BLINKval 54.954.9 52.0 56.4 55.555.5 62.2 60.060.0 62.362.3
ZeroBenchsub 16.2 15.2 0.0 0.0 18.2 - -
VisuLogic 23.1 25.6 20.0 26.1 24.5 - -
RealWorldQA 73.5 67.7 68.2 70.6 71.0 - -
SimpleVQA 42.9 41.6 41.4 35.1 44.9 - -
MMStar 80.5 75.5 64.9 68.4 73.7 - -
MMVP 80.7 79.0 78.0 78.3 81.7 - -
HallusionBench 62.7 67.0 55.7 49.4 65.2 - -
OCRBench 86.686.6 85.1 89.7 88.0 82.282.2 84.384.3 80.680.6
Video
Video-MMEw/o​sub.{}_{\mathrm{w/o\leavevmode\nobreak\ sub.}} 73.0 67.7 65.165.1 66.366.3 68.9 71.971.9 -
Video-MMMU 66.0 57.6 47.447.4 48.9 59.5 - -
TempCompass 75.5 71.5 68.368.3 70.8 - - -
LongVideoBench 66.0 62.8 59.359.3 63.9 64.9 - -
MMVU 68.3 66.3 45.5 39.4 - - -
MATH
MathVision 46.8 46.0 26.2 28.8 48.7 31.2 -
MathVistaMINI 81.2 80.7 66.8 70.7 79.0 63.8 -
MathVersevision 68.7 59.8 44.9 32.4 74.8 49.9 -
OlympiadBench 47.5 54.8 19.4 25.9 56.4 25.9 -
WeMath 67.5 60.7 37.7 38.5 65.2 50.6 -
LogicVista 58.8 54.8 44.5 43.6 63.5 54.4 -
DynaMath 39.7 37.3 20.1 23.9 48.7 54.4 -

Table 3: Comparison of Keye-VL-1.5 in Thinking mode with Keye-VL-Preview and other models on diverse visual-language benchmarks: The best results among open-source models are bolded and the second-best results are underlined.

In this section, we evaluate Keye-VL-1.5 across various benchmarks. For general vision-language tasks, we select OpenCompass (Contributors (2023)), MMMU (Yue et al. (2024)), AI2D (Kembhavi et al. (2016)), MMBench (Liu et al. (2024b)), BLINK (Fu et al. (2024)), ZeroBench (Roberts et al. (2025)), VisuLogic (Xu et al. (2025b)), RealWorldQA (X (2025)), SimpleVQA (Cheng et al. (2025b)), MMStar (Chen et al. (2024d)), MMVP (Tong et al. (2024)), HallusionBench (Guan et al. (2024)) and OCRBench (Liu et al. (2024c)). For public Video tasks, we select Video-MME(Fu et al. (2025b)), Video-MMMU (Hu et al. (2025b)), TempCompass (Liu et al. (2024d)), LongVideoBench (Wu et al. (2024)), and MMVU (Zhao et al. (2025)). For MATH tasks, we select MathVision (Wang et al. (2024c)), MathVistaMINI (Lu et al. (2023)), MathVersevision (Zhang et al. (2024)), OlympiadBench (He et al. (2024)), WeMath (Qiao et al. (2024)), LogicVista (Xiao et al. (2024)), and DynaMath (Zou et al. (2024)).

We compare the performance of Keye-VL-1.5 in Thinking mode with Keye-VL-Preview and other state-of-the-art models of a similar scale, including Qwen2.5-VL 7B, InternVL3-8B (Zhu et al. (2025)), MiMo-VL-7B-RL 2508 (Xiaomi (2025)), and proprietary models such as GPT-4o and Claude-3.7-Sonnet.

On general vision-language tasks, Keye-VL-1.5 demonstrates competitive performance across most benchmarks, often achieving SOTA or near SOTA results and outperforming other models overall. On the large-scale general benchmarks OpenCompass, MMMUval\text{MMMU}_{\text{val}} and AI2D, Keye-VL-1.5 obtains scores of 79.5% 71.4% and 86.7% respectively, surpassing all other models. On MMBench and MMStar, Keye-VL also achieves the best performance. In mathematical reasoning tasks, Keye-VL-1.5 significantly outperforms Qwen2.5-VL 8B and InternVL3-8B, achieving comparable results with MiMo-VL 7B-RL.

In video-centric scenarios, Keye-VL-1.5 demonstrates superior capabilities compared to other open-source models. Our evaluations indicate that an accurate understanding of video content is Keye-VL-1.5’s core advantage. On public video benchmarks, Keye-VL-1.5 significantly outperforms other models, particularly on Video-MMMU, with an absolute improvement of 6.5%.

6.4 Internal Benchmarks

Despite extensive evaluations on a wide array of public video benchmarks, these benchmarks exhibit numerous limitations that necessitate a focused effort on developing a proprietary, internal evaluation suite. The primary issues are as follows:

Therefore, we construct a rigorous internal video evaluation benchmark. The video sources include both internal and external platform content, as well as artificially constructed videos, with resolutions ranging from 360p to 1440p, effectively avoiding overlap with existing training data. The questions are categorized into several dimensions to provide comprehensive coverage: Visual Element Recognition for assessing visual element identification capabilities, Reasoning Ability for evaluating logical reasoning skills, Temporal Info Understanding for measuring temporal information comprehension, Knowledge-based QA for testing knowledge-grounded question answering, Description Ability for evaluating descriptive capabilities, Robustness for testing model stability, Creative Ability for assessing creative thinking, and Domain Expertise for evaluating specialized domain knowledge.

The scoring methodology employs comparative evaluation across multiple model results and GSB (Good, Same, Bad) preference selection. The baseline models can be either GPT-4o or Gemini 1.5 Pro. The specific evaluation approach involves two methods. First, the scoring method uses multiple models (typically 2) to generate results that are evaluated separately on a 1-5 scale. Three annotators score the answers based on the video content and reference annotation guidelines, providing both fine-grained and overall scores. Second, the GSB method involves direct comparison between two model results using Good-Same-Bad preference selection. When two answers have significantly different scores, the higher-scoring answer is preferred. When the scores are similar, the selection is based on annotation rules and subjective judgment to determine which answer is better. If no clear distinction can be made, the selection reflects whether both answers are equally good, equally poor, or equally average based on answer quality.

Model Average Correctness Completeness Relevance Fluency Creativity
Keye-VL-1.5-8B 3.53 3.73 4.62 4.85 4.59 3.64
MiMoVL-7B-RL-2508 3.40 3.54 4.63 4.93 4.82 3.79
Performance Comparison:
vs. MiMoVL-7B-RL-2508 +0.13 +0.19 -0.01 -0.08 -0.23 -0.15
vs. Keye-VL-Preview +0.51 +0.57 +0.25 +0.11 -0.24 -0.26

Table 4: Comprehensive capability evaluation comparison: This table presents the performance comparison between Keye-VL-1.5-8B and MiMoVL-7B-RL-2508 across multiple dimensions including correctness, completeness, relevance, fluency, and creativity. Performance differences against baseline models are also provided, with the highest scores marked in bold. Positive values indicate performance improvements, while negative values indicate performance degradation.

Model Version Visual Element Recognition Temporal Info Understanding Robustness Overall
Number of Cases 35 27 22 30 11 24 29 22 200
Keye-VL-1.5-8B 3.49 3.81 3.36 2.50 3.73 4.29 3.66 3.68 3.53
MiMoVL-7B-RL-2508 3.49 3.56 3.18 2.60 3.91 3.46 3.66 3.68 3.40
Performance Comparison:
vs. MiMoVL-7B-RL-2508 0.00 +0.25 +0.18 -0.10 -0.18 +0.83 0.00 0.00 +0.13
vs. Keye-VL-Preview +0.35 +1.00 +0.77 +0.27 +0.46 +0.41 +0.11 +0.91 +0.51

Table 5: Detailed capability evaluation across multiple dimensions: This table presents a comprehensive comparison of Keye-VL-1.5-8B and MiMoVL-7B-RL-2508 across eight core capabilities including visual element recognition, reasoning ability, temporal information understanding, knowledge-based QA, description ability, robustness, creative ability, and domain expertise. The evaluation is based on 200 test cases distributed across different capability categories. The highest scores are marked in bold, and performance differences are provided for comparative analysis.

6.5 Evaluation Results

Keye-VL-1.5-8B achieves significant performance improvements over previous versions: As demonstrated in Table 4, Keye-VL-1.5-8B establishes a substantial lead with an overall composite score of 3.53, representing a remarkable +0.51 improvement over Keye-VL-Preview. This advancement is particularly pronounced in correctness (+0.57) and completeness (+0.25), demonstrating the model’s enhanced ability to provide accurate and comprehensive responses. The model also shows notable gains in relevance (+0.11), indicating improved alignment between responses and user queries.

The model demonstrates competitive performance against industry benchmarks: In direct comparison with MiMoVL-7B-RL-2508, Keye-VL-1.5-8B achieves a higher overall score (3.53 vs. 3.40), establishing a +0.13 advantage in composite performance. The model particularly excels in correctness (+0.19) while maintaining competitive performance in completeness (-0.01). However, the evaluation reveals trade-offs in certain dimensions, with MiMoVL-7B-RL-2508 showing superior performance in fluency (+0.23), relevance (+0.08), and creativity (+0.15). This performance profile indicates that while our model achieves stronger factual accuracy, it faces challenges in language generation sophistication.

Detailed capability analysis reveals domain-specific strengths and optimization priorities: The fine-grained evaluation in Table 5 demonstrates Keye-VL-1.5-8B’s exceptional performance across multiple core capabilities. The model achieves decisive advantages in Reasoning Ability (3.81), Temporal Information Understanding (3.36), and Robustness (4.29), with the latter representing a substantial +0.83 lead over MiMoVL-7B-RL-2508. These results highlight the model’s particular strength in handling complex analytical tasks and maintaining consistent performance under challenging conditions. The model matches MiMoVL-7B-RL-2508 in Visual Element Recognition (3.49) and Creative Ability (3.66).

The model establishes a strong foundation in fundamental visual understanding capabilities: Keye-VL-1.5-8B’s performance demonstrates significant improvements in core visual processing tasks compared to previous iterations. The +0.35 advancement in visual element recognition and +1.00 improvement in reasoning ability over Keye-VL-Preview indicate substantial progress in fundamental perceptual and cognitive pathways. Particularly notable is the model’s +0.77 improvement in temporal information understanding, reflecting enhanced capability in processing sequential visual information and understanding dynamic relationships within video content. These foundational improvements provide a robust platform for handling complex multimodal reasoning tasks.

6.6 Ablation Studies and Findings

6.6.1 Effects of SFT, MPO, and Long CoT Cold Start

Table 6 presents a comprehensive evaluation of different training methodologies using varying quantities of high-quality data for SFT and MPO. The experimental results demonstrate that increasing the volume of SFT training data consistently enhances model performance across mathematical reasoning, logical inference, and OCR capabilities. Notably, our carefully curated preference dataset for MPO consistently yields additional performance improvements across all evaluated benchmarks. The implementation of Long CoT cold start training produces particularly remarkable results, with substantial performance gains observed across all benchmarks, most notably in mathematical reasoning tasks. These findings empirically validate the effectiveness of our proposed data processing pipeline and training methodology, demonstrating the synergistic benefits of combining high-quality supervised fine-tuning with preference optimization and strategic initialization approaches.

Table 6: Performance comparison of different training strategies across multiple benchmarks. The table shows evaluation results for SFT and MPO training with varying dataset sizes (15k and 128k samples), as well as the Long CoT Cold Start and RL approaches.

Model OpenCompass MMBCN MMBEN MMVet AI2D Hallusion MathVista MMMU MMStar OCR
Baselines
Qwen2.5-VL 7B 70.56 82.66 83.28 65.60 84.39 55.97 66.60 56.56 64.60 87.80
MiMO-VL-7B 75.62 81.50 83.13 77.52 83.78 61.95 80.30 65.22 70.80 83.10
Keye-VL-7B-Preview 77.43 90.71 92.03 68.62 87.18 61.98 78.70 71.67 75.00 84.90
SFT+MPO
SFT-15k 67.24 80.96 83.75 59.13 81.44 51.20 63.50 56.44 61.13 82.70
MPO-15k 69.31 80.65 83.28 62.02 83.19 52.79 67.00 61.22 63.07 83.20
SFT-128k 67.80 80.42 83.67 54.82 82.55 51.44 65.70 57.44 61.80 86.60
MPO-128k 70.34 81.27 83.44 62.34 84.52 55.76 68.40 58.33 65.33 85.70
Long CoT Cold Start
Long CoT Cold Start 75.32 88.24 88.93 62.89 86.04 61.05 76.40 68.33 73.20 86.10
RFT-SFT 76.33 89.16 91.02 67.29 86.43 61.52 77.60 67.78 74.20 85.70
RL
Keye-VL-1.5-RL 79.41 92.88 92.88 71.19 90.35 65.68 81.30 69.00 79.20 85.70
& Partial Solution 80.13 93.27 93.50 73.67 89.77 66.12 82.60 71.67 80.93 85.10

Table 7: Performance evaluation of expert models and model merging techniques on OCR-related benchmarks. The table compares baseline models with our approach, including base model, OCR expert model, and the merged configuration.

Model AVG TextVQA ChartQA InfographicVQA DocVQA OCRBench
Test Test Val Val Test
Baseline
MiMoVL-7B-RL-2508 81.41 75.57 70.00 84.93 94.35 82.20
Keye-VL-8B-Preview 79.68 75.47 86.24 66.89 84.31 85.50
Ours
Base Model 78.25 70.45 78.08 69.85 87.18 85.70
OCR Expert 83.65 79.36 84.76 74.54 93.21 86.40
Merge OCR + Base 84.51 83.40 84.88 74.26 93.33 86.70

6.6.2 Effectiveness of Expert Models and Model Merging

Table 7 demonstrates the effectiveness of our expert model approach and model merging technique, using OCR tasks as a representative case study. Our base model initially achieved an average OCR performance of 78.25%, comparable to the preview version but exhibiting notable deficiencies in specialized domains such as license plate recognition, seal/stamp identification, and street scene text extraction. To address these limitations, we develop a specialized OCR expert model trained on curated domain-specific data. The OCR expert model demonstrates substantial improvements across all evaluated OCR benchmarks, achieving an average score of 83.65%. Furthermore, the strategic merging of our base model with the OCR expert yields additional performance enhancements, reaching an average score of 84.51%. This merged configuration significantly surpasses the perceptual capabilities of MiMo-VL, with particularly notable improvements in TextVQA (83.40% vs. 75.57%) and ChartQA (84.88% vs. 70.00%).

These empirical results validate the effectiveness of our proposed technical approach, demonstrating that domain-specific expert models can be successfully integrated with general-purpose base models to achieve superior performance across specialized tasks while maintaining overall model capabilities. Additionally, our experiments reveal the following findings:

Limited Training Steps: Expert models trained with more steps continue improving within their specialized domains. However, merged model performance initially increases with expert training steps, then decreases, indicating an optimal training duration.

Limited Learning Rate: Expert models achieve better performance with smaller learning rates, and the corresponding merged models also perform better.

The parameter divergence between expert and general models significantly affects merged model performance. Small divergences limit domain-specific improvements, while large divergences lead to suboptimal merged performance, creating a critical trade-off between specialization and integration.

Table 8: Performance comparison of alignment reinforcement learning across instruction following and mathematical reasoning benchmarks. The evaluation includes both multimodal and text-only instruction following tasks, as well as comprehensive mathematical reasoning assessments. Results are presented for both Think and No-Think inference modes.

Model Name Mode Instruction Following Math Reasoning
MIA-Bench MMIFEval IFEval LiveBench WeMath MathVerse MathVision LogicVista
Baselines
Keye-VL-8B-preview Think 87.60 56.97 65.80 59.30 60.76 59.77 46.22 54.81
Keye-VL-8B-preview No-Think 89.85 56.06 73.75 53.00 -
Ours
Alignment RL Think 91.95 63.45 70.98 64.70 64.95 61.17 48.45 57.27
Alignment RL No-Think 91.06 62.87 78.37 61.70 -

6.6.3 Effectiveness of Alignment Reinforcement Learning

To validate the effectiveness of our alignment reinforcement learning approach, we conducted comprehensive evaluations starting from the Keye-VL-8B-preview baseline, focusing on instruction following capabilities and mathematical reasoning performance. Our evaluation framework encompasses both multimodal instruction following benchmarks (MIA-Bench and MMIFEval) and text-only instruction following assessments (IFEval and LiveBench). For mathematical reasoning evaluation, we selected four widely adopted benchmarks to ensure comprehensive coverage of mathematical capabilities. As demonstrated in Table 8, our alignment RL approach consistently outperforms the baseline across both inference modes. In the Think mode, substantial improvements are observed across all instruction following benchmarks, with notable gains of 4.35 points on MIA-Bench (91.95% vs. 87.60%), 6.48 points on MMIFEval (63.45% vs. 56.97%), and 5.40 points on LiveBench (64.70% vs. 59.30%). Similarly, in the No-Think mode, the model demonstrates consistent improvements, particularly achieving a 4.62-point enhancement on IFEval (78.37% vs. 73.75%). The mathematical reasoning capabilities also exhibit modest but consistent improvements across all evaluated benchmarks, with average gains ranging from 2-4 points. These results empirically validate that our alignment algorithm effectively enhances functional capabilities in instruction following while simultaneously strengthening general reasoning abilities. The consistent performance improvements across diverse evaluation metrics confirm the robustness and effectiveness of our alignment reinforcement learning methodology.

Table 9: Effect of different hint levels on model performance across multiple attempts. The table compares the percentage of completely incorrect data, average score for four attempts, and standard deviation for each level of hint provided.

Hint Percentage of Completely Incorrect Data (%) Average Score for Four Attempts Standard Deviation
no hint 25.56 1.62 1.18
level 1: conceptual 13.44 2.53 1.43
level 2: strategic 12.25 2.66 1.41
level 3: tooling 10.08 2.70 1.39
level 4: procedural 8.96 2.87 1.35
level 5: solution 0.20 3.96 0.28

6.6.4 Effect of Partial Solutions During RL Phase

To evaluate the model’s performance under different hint conditions, the success rate of solving problems across four rollout attempts serves as the primary metric. Approximately 8,000 RL data samples are selected for testing, with the following conditions:

As shown in Table 9, without any hints, approximately 25.56% of the samples fail to provide a correct solution, significantly reducing the efficiency of the RL process. As the hints approach a complete solution (level 5), the error rate decreases, and the average score for the four attempts increases, indicating more stable and accurate responses. Additionally, a comparison between performance in the RL phase with and without partial solutions in Table 6 shows improvements across various benchmarks, including an increase in the average score from 79.41 to 80.13 on OpenCompass, and a 1.3-point improvement on MathVista, further validating the effect of partial solutions.

6.6.5 Impact of Rejection Sampling on SFT and RL Performance

Refer to caption

Figure 8: Benefits of rejection sampling in the RL Phase. Starting from Keye-VL-8B-Preview, we compare the performance of direct RL and RFT-RL strategies.

In our RL iteration process, we employ rejection sampling twice. To validate the effectiveness of this approach, we conduct experiments starting with Keye-VL-8B-Preview, training it with the same RL dataset. In contrast, Keye-VL-8B-Preview-RFT-RL undergoes one round of iteration, followed by a second RL training phase. As shown in Figure 8, this iterative strategy significantly boosts RL performance, increasing the average mathematical benchmark score from 60.37 to 62.24, with similar improvements observed across general reasoning benchmarks. In Table 6, we compare the impact of various strategies, including Long CoT Cold Start, rejection sampling of SFT data using an RL model, and the subsequent selection of the best samples using a reward model for further SFT training (RFT-SFT). As a result, OpenCompass’s average score rises from 75.32 to 76.33, with consistent performance improvements across other benchmarks. Based on these findings, we adopt the SFT-RL-(RFT-SFT)-(RFT-RL) iterative model to further enhance performance.

Conclusion and Discussion

In this work, we presented Keye-VL-1.5, an advanced multimodal model that significantly enhances video understanding and vision-language tasks. By employing a novel Slow-Fast video encoding strategy, we efficiently balance temporal coverage and spatial resolution. The model’s progressive pre-training, with an extended context length, enables it to handle longer videos and complex visual content, while post-training methods focused on reasoning and human preference alignment improve instruction-following and reasoning abilities. Our evaluation demonstrates that Keye-VL-1.5 advances video understanding capabilities while maintaining strong performance on general vision-language tasks.

References

Appendix A Case Study

Video Case I: Grounding

Input VideoRefer to caption
Prompts: What is the time period during which the handbag appears in this video? Model Responses: The time period during which the bag appears in the video is from 22.3 seconds to 23.8 seconds, i.e., 22.3​s−23.8​s.

Figure 9: In this 26-second video, the creator provides a detailed introduction to the performance, advantages, and features of the power bank. The handbag only appears for about two seconds in the last of the video. Keye-VL-1.5 accurately pinpoints this time range, with a precision of 0.1 seconds.

Video Case II: Content Understanding

Input VideoRefer to caption
Prompts: What is the purpose of the big dog biting the little dog’s ear in this video? Model Responses: ¡think¿ In the video, the little dog tries to reach the food on the table, and the big dog (the ”dad”) lightly bites the little dog’s ear as a way of correcting the little dog’s improper behavior. This gentle bite is a common ”educational” method in dog packs, where adult dogs use it to tell the puppies, ”This is wrong,” and that rules must be followed. After receiving this signal, the little dog gradually sits down from the standing position and eventually lies down to comply, indicating that the big dog’s action successfully conveyed the warning of ”don’t steal food from the table.”¡/think¿ ¡answer¿The big dog bites the little dog’s ear to correct its improper behavior of trying to steal food from the table, using a gentle bite to deliver a warning and make the little dog comply.¡/answer¿

Figure 10: In this video, the caption shows that the dog dad prevents the little dog from eating the food on the dining table by holding the little dog down. The question asks about the reason behind the big dog biting the little dog’s ear. Keye-VL-1.5 accurately correlates the video content and the caption, determining that the purpose of the big dog biting the little dog’s ear is the same as holding the little dog down – both are meant to teach the little dog not to eat the food on the table.

Video Case III: Content Description

Input VideoRefer to caption
Prompts: Please describe the scene in the video in detail. Model Responses: The main scene of the video is a damp corner surrounded by dense forest. In the center of the frame is an old, broken stone staircase, with cracks in the stones and the surface covered in green moss and weeds, giving it a weathered and worn appearance. Surrounding it are tall, dark trees with branches hanging with fresh green leaves. Soft, diffused light filters through the leaves, adding a layer of misty ambiance to the scene. The ground is scattered with small white flowers or fungi, interwoven with the green moss, forming a vibrant, living ground cover. In the background, the forest stretches into the distance, with trees becoming gradually blurred, adding depth and mystery to the scene. The air seems to be filled with fine mist, making the ground and the stone steps appear damp, with occasional raindrops falling, further enhancing the humid atmosphere. The overall color tone is dominated by greens and browns, presenting a serene, natural, and slightly desolate image of a primal forest.

Figure 11: This video describes a very beautiful yet rare phenomenon: hail falling in the forest. Without prior knowledge, even human viewers could easily mistake the falling hail for petals or raindrops. Although Keye-VL-1.5 fails to identify the falling objects as hail, it mentions precipitation, and the rest of the content is correctly identified.

Appendix B Authors (Alphabetical order)

Core Contributors: Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li.

Contributors: Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang.