Kwai Keye-VL-2.0 Technical Report (original) (raw)
Abstract
We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone—which activates only 3B parameters—Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.
Figure 1: Performance Comparison of Keye-VL-2.0-30B-A3B. Our model demonstrates leading capabilities against open-source models (e.g., Qwen3.5-35B-A3B, Qwen3-VL-235B-A22B) and closed-source models (Gemini-3-Flash) across fine-grained temporal localization (ActivityNet, QVHighlights, and Charades under the TimeLens framework) and extreme long-video understanding (LongVideoBench, Video-MME-v2).
Contents
- 1 Introduction
- 2 Model Architecture
- 3 Pre-Training
- 4 Post-Training
- 5 Efficient Training and Inference Infrastructure
- 6 Comprehensive Evaluation
- 7 Conclusion and Future Work
- References
- A Case Study
- B Contribution (Alphabetical order)
Introduction
In recent years, Large Language Models (LLMs) are rapidly evolving, integrating deeper reasoning and extending into complex multimodal domains. Recent advances, such as OpenAI GPT-5.5 (openai2026gpt55), Claude Opus 4.8 (anthropic2026claude), Gemini 3.5 Flash (deepmind2026gemini), and Qwen3.7 (qwenteam2026qwen), demonstrate substantial progress in multimodal reasoning, long-context understanding, and specialized tool execution. These models increasingly exhibit the ability to decompose complex problems and process dynamic, information-dense visual streams.
Building upon the strong foundation of our previous works, Keye-VL (kwaikeye2025vl) and Keye-VL-1.5 (yang2025kwaikeyevl15), which established robust vision-language alignment and state-of-the-art short-video comprehension, we introduce Kwai Keye-VL-2.0-30B-A3B. As we push the model’s frontier from short-form visual perception to long-horizon agentic reasoning, two critical roadblocks naturally emerge. First is the infrastructural bottleneck: scaling to extreme video contexts incurs prohibitive computational and memory costs. Second is the algorithmic dilemma: integrating complex, heterogeneous agent tasks often induces catastrophic forgetting of foundational reasoning capabilities. In this report, we detail how Keye-VL-2.0 elegantly resolves these dual challenges by introducing paradigm-shifting improvements in architecture and alignment, focusing on the following two key aspects.
Extreme Context Scaling via Multimodal DSA. A key insight from extending multimodal models to 256K contexts is that standard dense attention mechanisms inevitably lead to catastrophic KV cache expansion and computational walls, forcing models to sacrifice temporal continuousness through aggressive frame subsampling. To overcome this latency and scalability limit, Keye-VL-2.0 introduces a 30-billion-parameter Mixture-of-Experts (MoE)(shazeer2017outrageously) foundation with only 3 billion active parameters, ensuring exceptional deployment efficiency. Crucially, we pioneer the application of Multimodal DeepSeek Sparse Attention (DSA)(deepseek2025v32) in the visual domain. By compressing and sparsifying video feature aggregation, DSA effectively constrains the linear growth of the KV cache. This design allows Keye-VL-2.0 to process 256K extreme long-video contexts losslessly, transforming video understanding from frame-limited perception to global-context reasoning while maintaining high inference efficiency.
Resolving Modality Conflict via Cross-Modal MOPD. Existing models often suffer from the ”Multimodal Alignment Dilemma”: directly injecting complex video understanding and tool-use capabilities often triggers catastrophic forgetting, degrading the model’s foundational STEM, mathematical, and linguistic reasoning abilities. To circumvent the ambiguity and instability of end-to-end co-optimization, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)(mimoteam2026v2flash). Through an innovative dynamic routing mechanism, MOPD leverages specialized teacher models to provide dense token-level feedback on student-generated trajectories across varying modalities and tasks during post-training. By supervising on-policy rollouts, MOPD effectively isolates task-specific expertise, which is subsequently distilled and seamlessly merged back into the unified MoE backbone. This bidirectional alignment ensures that Keye-VL-2.0 achieves substantial leaps in native agent capabilities (e.g., Code, Tool Use, Web Search) while robustly preserving its general-purpose reasoning baselines. During post-training, we further apply Context-RL(lu2026contextrl) and Video-RL, equipped with bucket advantage scaling, to stabilize long-sequence decision trees and systematically reduce visual hallucinations.
As illustrated in Figure 1, Keye-VL-2.0-30B-A3B demonstrates highly competitive performance against both open-source and closed-source state-of-the-art models. It achieves top-tier results in fine-grained temporal localization across TimeLens benchmarks (zhang2025timelens) (ActivityNet, QVHighlights, Charades), substantially outperforming leading models such as Gemini-3-Flash (deepmind2025gemini3flash) on several temporal grounding settings. Furthermore, it exhibits a non-linear capability scale-up on extreme long-context evaluations, including Video-MME-v2 (fu2026videomme2) and LongVideoBench (wu2024longvideobench), proving its robustness in processing extensive temporal information. By resolving the dual challenges of context scaling and multi-task capability conflict, we offer a powerful, efficient, and versatile multimodal foundation for the open-source community, enabling researchers and developers to explore, refine, and deploy scalable multimodal applications.
Model Architecture
Following the standard multimodal large language model (MLLM) paradigm, the model consists of four core components:
- •
Vision Encoder (ViT): inherited from Keye-VL-1.5-8B (kwaikeye2025vl) to extract visual features from images and video frames. - •
Language Decoder (LLM): built on Qwen3-30B-A3B-Thinking-2507 (qwen3), providing strong general knowledge, instruction following, and reasoning capabilities. - •
MLP Projector: randomly initialized and trained in Stage 0 (Section 3.1) to align visual features with the LLM representation space. - •
Sparse Attention Module: a GQA-compatible DSA design that combines global MQA-based indexing with grouped GQA aggregation, enabling efficient long-context multimodal modeling.
On top of this backbone, we introduce three architectural designs for high-resolution and long-context multimodal understanding: a native-resolution vision encoder, a unified visual encoding strategy for images and videos, and DSA-based sparse attention for 256K multimodal contexts.
2.1 Native-Resolution Vision Encoder
The visual encoders of MLLMs have gradually shifted from reusing fixed-resolution backbones to native-resolution modeling. Conventional fixed-resolution ViTs are usually pre-trained for contrastive image-text matching, and their coarse-grained representations are often insufficient for detail-sensitive downstream tasks such as OCR, fine-grained recognition, document understanding, and video understanding.
Two main approaches have been explored for high-resolution inputs. Dynamic tiling methods, such as InternVL3 (internvl3) and MiniCPM-V (minicpm_v), split large images into smaller crops before feeding them into fixed-resolution encoders. While effective, tiling may disrupt global structure and introduce redundant computation. Native-resolution methods, such as NaViT (navit), Qwen3.5 (qwen3.5), K2.5 (team2026kimi), and MiMO (mimoteam2026v2flash), instead preserve the original image size and aspect ratio. Following this direction, Keye-VL-2.0-30B-A3B encodes images and videos at their native resolutions, avoiding unnecessary cropping or tiling and preserving both global structure and local details.
This design is especially important for document, OCR, chart, and video scenarios, where a small local region may determine the final answer. By keeping the original aspect ratio throughout the visual pipeline, the encoder can preserve layout relations, object geometry, and fine-grained text signals that are easily distorted by fixed-size resizing.
Adaptive Position Encoding.
The visual encoder inherits the ViT backbone from Keye-VL-1.5, which is based on SigLIP-400M-384-14 (siglip). To support variable resolutions while retaining pre-training benefits, we interpolate the fixed absolute learnable position embeddings, allowing them to scale with the input size.
2D RoPE.
On top of adaptive absolute position encodings, we introduce 2D Rotary Position Embedding (2D RoPE). This improves spatial modeling and extrapolation across visual dimensions, especially for extremely high-resolution images.
Sequence Packing.
We combine adaptive position encodings and 2D RoPE with NaViT’s Patch n’ Pack mechanism and FlashAttention. Samples with different sizes and aspect ratios can therefore be packed into a single batch without padding waste, improving training throughput under variable resolutions.
Distribution-Aligned ViT Pre-training.
During ViT pre-training, we optimize the native-resolution architecture with the SigLIP loss and align it with the SigLIP-400M-384-14 text tower. To reduce the supervision-granularity gap between contrastive pre-training and downstream MLLM tasks, we train the visual encoder on the same data distribution as the downstream MLLM. The pre-training corpus contains 500B tokens from large-scale open-source datasets, including DataComp (datacomp), LAION (laion), CC12M (cc12m), PD12M (pd12m), and COCO (coco), together with high-quality internal data.
Overall, these upgrades allow the visual encoder to inherit the strong representation quality of SigLIP while gaining the resolution flexibility required by downstream multimodal reasoning tasks.
2.2 Unified Visual Encoding
To provide the language decoder with detailed visual signals, we use a unified dynamic-resolution encoding strategy for both images and videos.
- •
Dynamic-Resolution Image Encoding.Static images are encoded directly by the dynamic-resolution ViT. The number of visual tokens is allocated according to the original pixel size, avoiding lossy resizing or cropping. - •
Dynamic-Resolution Video Encoding.For video inputs with varying frame rates, resolutions, and durations, each sampled frame is treated as an independent high-resolution image and encoded by the same visual encoder. To preserve temporal information, we prepend a natural-language timestamp to the visual tokens of each frame during preprocessing. This explicit timestamp injection helps the LLM perceive temporal order, causality, and absolute timing.
Compared with designing a separate video-specific encoder, this frame-as-image formulation keeps the visual pathway simple and unified. The timestamp text provides temporal anchors in the LLM’s native language space, making temporal localization and cross-frame reasoning easier to learn from instruction and RL data. - •
Adaptive Video Pixel Budget.To balance information density and computation, we allocate a base pixel budget by dividing the global budget evenly across videos. In adaptive mode, the budget of each video is scaled according to duration, with thresholds of 256256 s, 512512 s, 10241024 s, and 20482048 s corresponding to scaling factors of 0.1250.125, 0.250.25, 0.50.5, and 1.01.0. Videos longer than 20482048 s use the full base budget. This compresses short, redundant videos more aggressively while allowing longer videos to retain more visual evidence, keeping the total token cost controllable.
2.3 DSA for Long-Context Multimodal Modeling
Traditional full attention has quadratic complexity with respect to sequence length, making 256K multimodal contexts difficult to support. To address this bottleneck, we integrate DeepSeek Sparse Attention (DSA) (deepseek2025v32) into the decoder attention pathway. Unlike most existing DSA systems that are adapted from MLA, Keye-VL-2.0-30B-A3B integrates DSA with a GQA-based MLLM backbone, providing a practical route for long-context extension in GQA models.
2.3.1 MQA-Style Lightning Indexer
DSA contains a Lightning Indexer and a fine-grained token selection mechanism. For throughput, the indexer follows the MQA key-sharing design and computes a global index score It,sI_{t,s} between the current query token hth_{t} and each preceding token hsh_{s}:
| It,s=∑j=1HIwt,jI⋅ReLU(qt,jI⋅ksI).I_{t,s}=\sum_{j=1}^{H^{I}}w_{t,j}^{I}\cdot\mathrm{ReLU}(q_{t,j}^{I}\cdot k_{s}^{I}). | (1) |
|---|
Here, HIH^{I} is the number of indexer heads, qt,jIq_{t,j}^{I} and wt,jIw_{t,j}^{I} are derived from hth_{t}, and ksIk_{s}^{I} is the shared key derived from hsh_{s}. After obtaining the global scores, the Top-kk tokens form the sparse index set:
| Ωt={s∣It,s∈Top-k(It,:)}.\Omega_{t}=\{s\mid I_{t,s}\in\mathrm{Top}\text{-}k(I_{t,:})\}. | (2) |
|---|
By sharing one key head across all query heads, the indexer substantially reduces both computation and memory traffic. Together with FP8 implementation and the ReLU-based scoring function, the Lightning Indexer remains efficient even when the multimodal sequence contains hundreds of thousands of tokens.
2.3.2 GQA Sparse Aggregation
In the GQA backbone, query heads are divided into GG groups, and heads within each group share a group-specific KV head. We apply the same sparse index set Ωt\Omega_{t} to all groups. For the gg-th group, the sparse attention output is:
| ut,g=Attn(ht,g,{cs,g∣s∈Ωt}).u_{t,g}=\mathrm{Attn}(h_{t,g},\{c_{s,g}\mid s\in\Omega_{t}\}). | (3) |
|---|
The outputs of all groups are concatenated to form the final attention representation. We set k=2048k=2048, reducing the core attention complexity from O(L2)O(L^{2}) to O(Lk)O(Lk), where LL is the sequence length and k≪Lk\ll L.
This global MQA-style indexing + GQA aggregation design preserves the representation structure of the GQA backbone while avoiding dense attention over the full context. As a result, the model can perform long-range spatiotemporal and semantic aggregation with much lower memory and compute cost.
2.3.3 Dense Warm-up and Sparse Adaptation
We train DSA with a two-stage strategy. In the dense warm-up phase, the main model keeps dense GQA while most parameters are frozen. The goal is to initialize the indexer so that its global distribution can cover the attention distributions of all GQA groups. For query token tt, we aggregate dense attention scores within each group, normalize them over visual and text tokens, and obtain the target distribution pt,:,gp_{t,:,g}. The warm-up loss is:
| ℒwarmupI=∑t∑g=1G𝔻KL(pt,:,g∥Softmax(It,:)).\mathcal{L}_{\mathrm{warmup}}^{I}=\sum_{t}\sum_{g=1}^{G}\mathbb{D}_{KL}(p_{t,:,g}\parallel\mathrm{Softmax}(I_{t,:})). | (4) |
|---|
This stage uses approximately 2B multimodal tokens.
In the sparse adaptation phase, all parameters are unfrozen and training switches to sparse mode. The indexer continues to align with the main model’s attention distribution, but the KL loss is computed only over the selected Top-kk token set
| St={s∣It,s∈Top-k(It,:)}.S_{t}=\{s\mid I_{t,s}\in\mathrm{Top}\text{-}k(I_{t,:})\}. | (5) |
|---|
The target distribution is truncated and renormalized over StS_{t}:
| ℒsparseI=∑t∑g=1G𝔻KL(pt,St,g∥Softmax(It,St)).\mathcal{L}_{\mathrm{sparse}}^{I}=\sum_{t}\sum_{g=1}^{G}\mathbb{D}_{KL}(p_{t,S_{t},g}\parallel\mathrm{Softmax}(I_{t,S_{t}})). | (6) |
|---|
The main model is optimized with the standard next-token prediction loss. To reduce gradient interference, the indexer input is detached from the computation graph. The final objective is:
| ℒtotal=ℒNTP+λℒsparseI.\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{NTP}}+\lambda\mathcal{L}_{\mathrm{sparse}}^{I}. | (7) |
|---|
This two-stage training strategy avoids forcing the sparse indexer to learn from scratch under sparse supervision. Dense warm-up first aligns the indexer with the behavior of the original attention model, and sparse adaptation then teaches the full model to rely on dynamically selected evidence. In long-context evaluations, this design preserves the capabilities of the dense model while substantially reducing inference overhead.
Pre-Training
Keye-VL-2.0-30B-A3B is pre-trained through a four-stage curriculum. Stage 0 trains only the Projector to initialize visual-language mapping. Stage 1 performs full-parameter multimodal pre-training at 32K context length. Stage 2 extends the context to 64K and injects task-oriented capabilities such as OCR, VQA, STEM, GUI, grounding, counting, coding, tool use, and search. Stage 3 further extends the context to 256K, focusing on long videos, multi-page documents, multi-document inputs, and long-range agent trajectories.
3.1 Stage 0 — Projector Initialization
Stage 0 establishes the initial connection between the visual encoder and the language model. The ViT and LLM are frozen, and only the Projector is trained to map visual features from the Keye-VL-1.5 ViT into the LLM representation space. This stage uses image-text caption data for direct semantic alignment and image-text interleaved data to expose the Projector to real document-like contexts where multiple images and text appear alternately.
Because only the Projector is updated, Stage 0 provides a low-risk initialization step before full-parameter multimodal training. It allows the LLM to receive visual features in a compatible representation space without disturbing the already learned capabilities of either the ViT or the language backbone.
3.2 Stage 1 — General Multimodal Pre-training
Stage 1 trains all parameters with a maximum sequence length of 32K on approximately 1T tokens. The goal is to establish stable vision-language alignment, image perception, video understanding, OCR recognition, and general language capability.
The training data consist of image-text captions, interleaved image-text data, interleaved video data, pure-text QA, and OCR data. For video learning, each video is split into 15-second segments, each paired with a caption and organized as an interleaved multimodal sequence. This enables the model to understand temporal continuity, scene transitions, and event evolution.
To improve caption quality from large-scale open-source corpora such as LAION, DataComp, COYO, and CC12M (laion; datacomp; coyo; cc12m), we apply two strategies: Recaption regenerates captions directly from the images using an expert captioning model to ensure high quality, while Remake builds on the original captions, correcting grammatical and expression errors without altering their semantics. OCR data cover diverse types such as stylized text, handwriting, LaTeX formulas, code, charts, and documents, giving the model an early foundation in text recognition and fundamental perception capabilities.
This diversity of data types—spanning image caption, interleaved image-text, interleaved video-text, pure-text, and OCR data—is designed to prevent the model from over-specializing in any single input format. Specifically, image caption data provide dense semantic alignment between an image and its description; interleaved image-text data expose the model to the organization of images and text within webpages and documents; interleaved video-text data introduce temporal and cross-modal structure; and OCR data establish sensitivity to visual text and structured page layouts from the outset of multimodal training.
Figure 2: The Keye-VL-2.0-30B-A3B pre-training pipeline, following a four-stage curriculum from projector initialization to 256K long-context multimodal training.
3.3 Stage 2 — Multi-Task Capability Injection
Stage 2 continues full-parameter training, extends the context length to 64K, and trains on approximately 2T tokens. Compared with Stage 1, this stage emphasizes task-oriented and verifiable supervision.
3.3.1 Image-Text Data
Advanced OCR.
We further expand the OCR data to receipts and diverse chart types, drawing on both real samples and synthetic samples generated from XML or structured templates. Augmentations such as blur, lighting variation, wrinkles, handwriting variation, and geometric distortion are applied to improve robustness for field extraction, table reading, chart understanding.
Real samples consist of images captured in real-world settings, carrying authentic noise and naturally occurring business layouts, whereas synthetic samples offer precise field-level and structure-level annotations. Combining the two improves both robustness and supervision precision.
Math & STEM.
We introduce visual problem-solving examples spanning mathematics, physics, and chemistry, with LLM-based verification to ensure the quality of questions and answers and to remove low-quality samples. This helps the model progress from recognizing formulas and diagrams to reasoning over them. The covered scenarios include geometry diagrams, function plots, experimental apparatus, formulas, and scientific charts. The goal is to enhance the model’s generalization across diverse visual scenarios.
Caption Refinement.
High-quality caption data is retained to preserve vision-language alignment. An 8B caption expert model trained with VCap (vcap) generates detailed descriptions emphasizing attributes, spatial relations, subject binding, and long-description consistency—yielding captions that are more comprehensive, ine-grained, and of higher quality than those used in Stage 1, surpassing even those produced by larger models.
GUI.
GUI tasks involve screen captures, control metadata, and interaction semantics, covering element localization, control recognition, page understanding, operation description, and GUI-QA. GUI inputs differ substantially from natural images, featuring dense small text, icons, regular layouts, and frequent state changes. Training on dedicated GUI tasks therefore lays the foundation for subsequent vision-based clicking, navigation, and task execution.
Grounding & Counting.
For localization and counting, we synthesize instance-pasting data. Candidate objects from COCO (coco) and OpenImages (openimages) are verified by an MLLM and pasted onto backgrounds with controllable quantities and positions, producing precise bounding boxes and count annotations.
General QA, E-Commerce, and Chinese Expansion.
We supplement general VQA data for object recognition, attributes, scenes, and relations, and introduce Kuaishou e-commerce data for product understanding. To improve Chinese coverage, high-quality English task data are translated into Chinese, regenerated under original answer constraints, and filtered for consistency.
3.3.2 Pure-Text Data
Stage 2 retains pure-text training to preserve the LLM’s language, reasoning, coding, tool-use, and instruction-following capabilities. The mixture spans pure-text math and STEM reasoning, code corpora for bug fixing and competitive programming, Hermes-style tool-use trajectories (hermes), and search/RAG examples covering single-document QA, multi-document summarization, evidence localization, and multi-hop reasoning over Wiki-based knowledge graphs.
The pure-text component mitigates the degradation of the base model’s language-side capabilities during multimodal fine-tuning, while supplying transferable priors for reasoning formats, code generation, tool invocation, and evidence-based answer synthesis.
3.4 Stage 3 — Long-Context Extension
Stage 3 extends the maximum sequence length to 256K while continuing full-parameter training. Long-context and short-context samples are mixed at a ratio of 1:11{:}1 to improve ultra-long sequence modeling while preserving performance on conventional inputs.
The data cover long videos, long documents, multi-document inputs, long multi-image conversations, long code contexts, and long-range agent trajectories. The objective is not only to enlarge the context window, but also to improve retrieval, aggregation, and cross-position reasoning over ultra-long multimodal inputs.
For long videos, the model must track key events and integrate evidence across many frame segments. For long documents and multi-document inputs, it must process multi-page OCR, tables, layouts, and long textual evidence. For agent trajectories, it must maintain task state across many tool calls and interaction turns.
3.5 Data Cleaning and Production Pipeline
We build a unified data cleaning, deduplication, and production pipeline. Cleaning removes low-quality text, invalid formats, unsafe content, duplicated samples, and unreliable image-text or video-text pairs. Deduplication uses a joint Hash + CLIP strategy: Hash identifies exact or near duplicates, while CLIP similarity detects semantically similar cross-modal duplicates.
For production efficiency, the pipeline supports caption generation, OCR construction, QA generation, translation, quality evaluation, one-click deployment, task monitoring, and checkpoint-based resumption. A dual-queue asynchronous mechanism decouples CPU-side preprocessing from GPU-side inference, improving production throughput by 3–5×\times in practice.
Table 1: Video training setup across pre-training stages.
| Stage | Max Video Duration | Video Tokens |
|---|---|---|
| 1 | 15s | 24K |
| 2 | 15min | 64K |
| 3 | 2h | 180K |
3.6 Video Pre-Training Curriculum
To scale from short-video understanding to high-resolution long-video reasoning, we adopt a multi-stage video curriculum, summarized in Table 1.
Stage 1 uses short 15-second clips for video-language alignment. Stage 2 increases duration and resolution, and introduces temporal video grounding (TVG) data. Stage 3 extends the maximum duration to 2 hours, requiring the model to identify sparse but critical moments and aggregate long-range evidence. In mid-training, we keep the Stage 3 setup while adding richer tasks such as video captioning, temporal QA, entity recognition, scene understanding, causal reasoning, event ordering, and video counting.
Figure 3: An example of scene-wise dense caption. Each video is decomposed into scenes annotated with timestamps, dense captions, and a global overview.
Scene-Wise Dense Caption.
We reformulate dense video captioning as structured scene-wise descriptions with start and end timestamps. As shown in Figure 3, this formulation enhances scene-boundary perception and temporal alignment between descriptions and intervals. An expert model is trained to improve annotation efficiency.
Diverse TVG Data.
Inspired by ETBench (etbench), we construct TVG data covering Referred Action Recognition, Video Highlight Detection, Extractive Video Summarization, and Temporal Event Matching. These tasks provide complementary supervision for temporal perception and reasoning.
Post-Training
4.1 Supervised Fine-Tuning and Synthetic CoT
After multimodal pre-training, we conduct SFT to convert visual perception and cross-modal alignment into stable instruction-following behavior. This stage adapts multimodal capabilities to conversational instructions, activates perception, reasoning, long-context, and agentic abilities, and mitigates language degradation during multimodal fine-tuning.
The SFT corpus contains approximately 500B tokens and is organized around modality balance, capability coverage, and long-range modeling. It includes Text NLP, Video, Perception, Reasoning, Agent, and Long-context data. About 40% of the corpus is text-only data, which anchors instruction following, knowledge QA, and textual reasoning.
4.1.1 Multimodal Instruction Mixture
To improve comprehensive instruction-following capabilities in complex multimodal scenarios, we construct a multimodal instruction mixture that covers video understanding, visual perception, cross-modal reasoning, agent-oriented tasks, and long-context modeling. The goal of this mixture is not to simply increase the proportion of visual data, but to encourage complementary capability development across perception, localization, reasoning, and task execution.
Video data are mainly used to strengthen temporal perception and evidence localization. Some samples are formulated as multiple-choice question answering tasks with clue intervals. The model is required to verify candidate temporal segments during the stage, and to output both the final answer and the supporting intervals in the format [[mm, mm], ...]. This design encourages the model to locate key evidence from continuous video content, rather than relying only on global visual impressions.
Perception data cover OCR, document understanding, chart understanding, general visual question answering, captioning, grounding, counting, and image identification. These tasks improve the model’s ability to extract, recognize, and structure fine-grained visual information. Reasoning data include K12-level exercises, STEM problems, and spatial reasoning tasks, which train the model to perform multi-step inference under both visual and textual constraints.
Agent data consist of code reasoning and tool-use trajectories, enabling the model to learn task decomposition, execution, and feedback integration. Long-context data target long documents, long videos, multi-image inputs, and long-range question answering, with the goal of improving information retention, retrieval, and cross-segment association over extended contexts.
Overall, this instruction mixture is designed to promote capability collaboration across different data sources. Video data enhance continuous scene understanding and temporal evidence localization; perception data improve fine-grained information extraction; reasoning data strengthen cross-modal multi-step inference; agent data support task decomposition and tool interaction; and long-context data improve long-range information modeling. Meanwhile, text-only instruction data are retained to preserve general instruction following and language reasoning capabilities, reducing the risk of degrading pure language performance during multimodal training.
4.1.2 Synthetic CoT
Most multimodal instruction data provide only final answers, offering limited supervision for explicit reasoning and evidence aggregation. We therefore construct Synthetic CoT data from high-quality QA pairs. A strong teacher model generates reasoning traces, which are then filtered through query-level, response-level, and process-level quality checks. For mathematical tasks, a Doubt2Clean second-pass review further cleans doubtful CoT samples across 27 datasets.
The final mixture combines long reasoning data for STEM, counting, video reasoning, and complex image-text QA with direct-answer data for general VQA, caption, OCR, and grounding. This balances explicit reasoning with concise answering and avoids unnecessary verbosity on simple perception tasks.
For continuous perception tasks, such as counting and video understanding, Synthetic CoT provides object-by-object or interval-by-interval verification chains. For example, the model may first verify local evidence in separate regions or candidate video intervals, then aggregate the evidence before producing the final answer. For STEM tasks, the reasoning trace emphasizes planning, symbolic derivation, and result verification. This gives the model learnable supervision for both ordered observation and logical reasoning.
Beyond these offline reasoning traces, we further optimize the model’s reasoning chains through strong-to-weak on-policy distillation, in which a stronger teacher supervises the student on its own on-policy rollouts.
4.2 Reinforcement Learning
4.2.1 Synthetic-Data RL
Fine-grained multimodal reasoning requires accurate visual perception, difference localization, and structured output generation. However, natural multimodal corpora rarely provide fine-grained localization signals, structured annotations, and automatically verifiable supervision simultaneously. To address this limitation, we introduce a synthetic-data RL framework based on procedurally generated difference-recognition tasks. Given two images that differ by controlled edits, the model is required to identify the changes and report them in a predefined structured format. Since the edit operations are fully known during data generation, rewards can be computed through rule-based verification without relying on an additional learned reward model.
Each sample contains an image pair (IA,IB)(I_{A},I_{B}), where IBI_{B} is obtained from IAI_{A} through a set of controlled edits ℰ=e1,…,ek\mathcal{E}={e_{1},\ldots,e_{k}}. We instantiate this framework in two forms. For localization-oriented perception tasks, the model predicts normalized bounding boxes for changed regions. For structured reasoning tasks, the model outputs a domain-specific operation-set DSL corresponding to the underlying edits. In our implementation, the structured tasks cover geometry, coordinate-aware geometry, chemical formulas, and physical circuit diagrams. The synthetic setup includes both positive samples with controlled edits and negative samples with no edits, enabling the model to distinguish true changes from unchanged inputs and suppress unsupported predictions. To prevent the model from relying on trivial pixel-level differencing, we introduce difference-irrelevant re-rendering perturbations that preserve the underlying edit set, including color jitter, layout perturbation, slot shuffling, semantic no-op changes, and viewpoint variation.
For perception tasks, predicted boxes ℬ^\hat{\mathcal{B}} are matched to ground truth ℬ\mathcal{B} using Hungarian matching based on IoU. The reward is:
| Rperc=λFFsoft+λI∑m∈ℳummax(|ℬ^|,|ℬ|)−λDNdup,R_{\mathrm{perc}}=\lambda_{F}F_{\mathrm{soft}}+\lambda_{I}\frac{\sum_{m\in\mathcal{M}}u_{m}}{\max(|\hat{\mathcal{B}}|,|\mathcal{B}|)}-\lambda_{D}N_{\mathrm{dup}}, | (8) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |
where ℳ\mathcal{M} denotes the matched set, umu_{m} is the IoU of the mm-th matched pair, FsoftF_{\mathrm{soft}} is a soft matching-based F-score, and NdupN_{\mathrm{dup}} penalizes duplicate predictions. For negative samples, empty predictions are rewarded, while unsupported predictions are penalized. For structured domains, rewards are computed by canonicalizing the predicted and ground-truth operation sets and then performing rule-based matching. For coordinate-aware tasks, we further use soft distance-based similarity to tolerate small numerical deviations. This reward design is lightweight, scalable, and independent of learned reward models, while encouraging visual grounding, structural correctness, anti-hallucination behavior, and disciplined structured output generation.
4.2.2 General RL
After SFT and distillation, General RL improves reasoning ability, answer reliability, and robustness across multimodal and text-only scenarios. Unlike preference-oriented Alignment RL, General RL focuses on tasks with verifiable ground-truth answers, including general VQA, STEM, chart understanding, mathematical reasoning, logical reasoning, and text-only QA.
Data Construction.
We collect multimodal and text-only QA data from open-source and in-house sources, including FineVision (finevision), MMK12 and MM-Eureka (mm_eureka), Thyme (thyme), mini O3 (minio3), Open-R1 (openr1), self-cognition data, and business-oriented data (kuaimod). The data pipeline includes benchmark decontamination, cross-validation with strong models and judge models, and accuracy-based filtering to remove samples already solved by the initial policy.
Benchmark decontamination removes samples that are visually or semantically too close to public evaluation instances. Cross-validation samples multiple candidate solutions and asks a judge model to select reliable positives, which are then used as reference solutions for ContextRL Reward. Accuracy-based filtering focuses RL on samples that are neither trivial nor too noisy, improving data efficiency.
Reward System.
The reward system, based on Qwen2.5-VL-72B-Instruct (qwen2_5_vl), evaluates format validity, outcome correctness, process correctness, and consistency with verified reference solutions. Additional task-specific rewards are used for self-cognition, business tasks, and structured reasoning.
Format Reward ensures that reasoning and final-answer fields can be parsed. Outcome Reward checks whether the final answer matches the ground truth across heterogeneous answer formats, including options, phrases, numerical answers, and complete solutions. Process Reward penalizes factual errors, invalid deductions, unsupported assumptions, and inconsistent intermediate steps. ContextRL Reward (lu2026contextrl) compares the sampled response with verified reference solutions to effectively reduce false positives caused by correct answers reached through flawed reasoning.
Training Algorithm.
We use Group Sequence Policy Optimization (GSPO) (zheng2025gspo). Given query xx, the old policy πθold\pi_{\theta_{\mathrm{old}}} samples responses {yi}i=1G\{y_{i}\}_{i=1}^{G}. The objective is:
| 𝒥GSPO(θ)=𝔼x∼𝒟,{yi}i=1G∼πθold(⋅|x)[1G∑i=1Gmin(si(θ)A^i,clip(si(θ),1−ϵ,1+ϵ)A^i)],\mathcal{J}_{\mathrm{GSPO}}(\theta)=\mathbb{E}_{x\sim\mathcal{D},\{y_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\mathrm{old}}}(\cdot|x)}\left[\frac{1}{G}\sum_{i=1}^{G}\min\left(s_{i}(\theta)\hat{A}_{i},\mathrm{clip}\left(s_{i}(\theta),1-\epsilon,1+\epsilon\right)\hat{A}_{i}\right)\right], | (9) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |
where
| si(θ)=exp(1|yi|∑t=1|yi|logπθ(yi,t∣x,yi,<t)πθold(yi,t∣x,yi,<t)),s_{i}(\theta)=\exp\left(\frac{1}{|y_{i}|}\sum_{t=1}^{|y_{i}|}\log\frac{\pi_{\theta}(y_{i,t}\mid x,y_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(y_{i,t}\mid x,y_{i,<t})}\right), | (10) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |
and
| A^i=ℛ(x,yi)−mean({ℛ(x,yj)}j=1G)std({ℛ(x,yj)}j=1G)+δ.\hat{A}_{i}=\frac{\mathcal{R}(x,y_{i})-\operatorname{mean}\left(\{\mathcal{R}(x,y_{j})\}_{j=1}^{G}\right)}{\operatorname{std}\left(\{\mathcal{R}(x,y_{j})\}_{j=1}^{G}\right)+\delta}. | (11) |
|---|
We further over-sample responses by enlarging the generation batch size and filter groups with zero advantage variance to improve data efficiency. This filtering keeps training focused on groups where different rollouts receive distinguishable rewards, so each update contains a stronger learning signal.
4.2.3 Specialized RL
Specialized RL strengthens image-text capabilities including grounding, spatial understanding, mathematical reasoning, counting, and OCR. All experts start from the same general RL checkpoint and use domain-specific data and rewards. Deterministic tasks use rule-based verifiable rewards: IoU and bipartite matching for grounding, symbolic equivalence for math, exact numeric matching for counting, and normalized text matching for OCR. Open-ended spatial reasoning uses a model judge with format constraints.
The purpose of this stage is not to train separate final models, but to obtain strong domain experts that can later contribute capability-specific supervision through distillation and capability consolidation.
Grounding Expert.
For grounding tasks, the reward is designed to measure target-level localization quality while suppressing redundant predictions. We first normalize all boxes to the [0,1000][0,1000] coordinate range and remove duplicate predicted boxes. For multi-object grounding, predictions are matched to ground-truth boxes through Hungarian matching using IoU as the matching criterion, ensuring a one-to-one correspondence between predicted and target boxes.
Based on the matched pairs, the reward considers both the worst-case and average localization quality. The minimum IoU is used as a thresholding signal to ensure that all targets reach a basic localization quality before the sample enters the high-reward region, while the mean IoU provides continuous feedback for optimization. A duplicate-box penalty is further introduced to reduce redundant outputs and discourage box-spamming. This reward design therefore balances strict target coverage, smooth localization feedback, and output compactness.
Spatial Expert.
The spatial expert targets spatial-relation understanding and embodied spatial reasoning. Unlike grounding or counting, many spatial tasks do not admit a simple deterministic criterion, since the answer may involve relative position, orientation, interaction, or scene-level commonsense. We therefore use a generative model judge to evaluate whether the response satisfies the required spatial relation and scene constraint. The judge assigns a discrete correctness score in {−1,0,1}\{-1,0,1\}, which is combined with a format reward to encourage both spatial correctness and legal answer structure.
Math Expert.
The math expert focuses on mathematical and STEM reasoning. For problems with deterministic answers, we adopt a symbolic-equivalence reward: the model prediction and the reference answer are parsed into canonical forms, and a positive reward is assigned when they are mathematically equivalent. The reward is gated by the required and structure, so that the model is encouraged to produce valid reasoning and final-answer formats. For open-ended solutions that are difficult to canonicalize, we fall back to a generative model judge. This design emphasizes answer correctness while avoiding overfitting to a specific reasoning trace.
Counting Expert.
The counting expert is designed for visual counting tasks, where the target output is usually a discrete integer. Since the correctness criterion is unambiguous, we use an exact numeric-match reward under format gating. A positive reward is given only when the predicted number matches the ground-truth count. This provides a clean and deterministic learning signal, encouraging the model to align visual perception with precise numerical outputs.
OCR Expert.
The OCR expert targets text recognition, document reading, and image-based text understanding. We use a normalized text-match reward, where predictions and references are compared after normalizing case, whitespace, and punctuation. This reward preserves deterministic verification while allowing minor surface-form variations. It encourages the model to improve both recognition accuracy and text completeness in OCR and document-style tasks.
4.2.4 Video RL
Video RL further optimizes temporal alignment, event tracking, and long-range information aggregation. Starting from the general RL checkpoint, we train on approximately 31K video samples with GSPO while freezing the visual encoder and vision-language projector.
The data cover temporal video grounding, temporal dense captioning (vcap), frame-level perception, video QA, temporal ordering, and event counting. TVG samples are selected from TimeIT (timeit), mainly from DiDeMo (didemo) and Charades-STA (charades_sta), and are rewarded by temporal IoU. Temporal dense captioning uses an LLM-as-Judge reward over subject recognition, action description, scene information, OCR text, temporal order, hallucination, and coverage. FrameForge synthetic videos provide rule-verifiable supervision for timestamp localization, counting, before/after reasoning, and co-occurrence reasoning. This stage improves general video benchmark performance by approximately 1 percentage point.
These tasks provide complementary video supervision. TVG directly optimizes temporal boundary localization, dense captioning encourages full-video coverage and faithful content organization, and FrameForge supplies low-noise frame-level signals that are difficult to obtain from natural videos.
4.2.5 Agentic RL
Shared Training Protocol.
Agentic RL extends post-training from single-response scoring to multi-step environment interaction. It covers code, tool-use, and search tasks whose supervision comes from executable outcomes or verifiable environment states rather than static response labels. Across these domains, we use environment-grounded rewards, trajectory-level validation, and filtering to remove invalid, unfinished, low-information, or weakly verified rollouts. Completed trajectory groups are optimized with the GSPO objective introduced above, using outcome-based rewards at the trajectory level. To improve rollout utilization under long and variable interaction horizons, we use a shared colocated partial-rollout mechanism: unfinished trajectories are cached and resumed in later rollout steps, while completed groups are immediately consumed for GSPO updates. This shared protocol keeps the optimization objective and rollout scheduling consistent, while allowing each task family to define its own environment interface and reward evidence.
Coding RL.
For Coding RL, we use both Online Judge and Software Engineering environments. Online Judge tasks evaluate isolated program synthesis through compilation and hidden test execution, with rewards determined by pass rate and execution failures such as time-limit or memory-limit errors. Software Engineering tasks evaluate repository-level issue resolution in containerized environments, where the model must inspect logs, run tests, edit files, and submit patches. Rewards are based on test-suite outcomes, regression checks, trajectory filtering, and auxiliary judge verification of the submitted patch. For repository-level tasks, candidate edits may further pass through a verification-and-integration protocol in which multiple reviewer agents inspect proposed changes, execution traces, and test outcomes, and a separate integration agent consolidates the reviews before final patch acceptance. This setup strengthens algorithmic correctness, repository navigation, iterative repair, and regression-safe code modification.
Tool Use RL.
For tool-use RL, the model interacts with multi-turn stateful environments covering more than 150 simulated API domains. Rewards are computed from valid tool invocation, argument consistency, database or environment post-conditions, and judge-based assessment of whether the final state satisfies the user intent. Unlike code execution rewards, this objective emphasizes correct state transitions, robust recovery from tool errors, and coordination with multi-turn user instructions. Randomized tool and parameter names reduce dependence on memorized API schemas and encourage general tool-calling behavior.
Search RL.
For Search RL, we train on multi-turn search tasks where the model issues retrieval actions, reads returned content, and produces a final answer after one or more retrieval rounds. These trajectories vary with the amount of query refinement, result selection, and content checking required by the task. Rewards are outcome-oriented: final-answer correctness is the primary signal, and when available we also use lightweight verification signals for intermediate search results to reduce off-track retrieval and unsupported answer synthesis. The shared partial-rollout mechanism is useful in this setting because search trajectories have variable horizons; unfinished interactions can be resumed in later rollout steps, while completed trajectory groups are used for GSPO updates.
4.2.6 Cross-Modal Multi-Teacher On-Policy Distillation
Domain-specific post-training introduces heterogeneous capabilities from pure-text, image, video, and agent tasks. Directly mixing them may cause interference, such as overly short responses after reasoning RL or excessive tool-call formatting after agent training. To consolidate these capabilities, Keye-VL-2.0-30B-A3B uses Cross-Modal Multi-Teacher On-Policy Distillation (MOPD).
MOPD maintains 13 RL-trained domain teachers. These teachers span the full capability spectrum, including safety, pure-text math, instruction following, code, visual STEM, OCR, grounding, counting, video, and tool use, among others. Each sample is routed to the teacher that best matches its modality and task type. Given prompt xix_{i}, the student first generates an on-policy response:
| yi=(yi,1,…,yi,T)∼πθ(⋅∣xi).\displaystyle y_{i}=(y_{i,1},\ldots,y_{i,T})\sim\pi_{\theta}(\cdot\mid x_{i}). | (12) |
|---|
For state si,t=(xi,yi,<t)s_{i,t}=(x_{i},y_{i,<t}), the routed teacher πTr(i)\pi_{\mathrm{T}}^{r(i)} provides token-level feedback. We use Segmented Prompt-Response Re-tokenization (SPRR) to process prompts and responses separately, ensuring strict alignment between teacher log probabilities and student response tokens.
For stable feedback, we define the top-kk overlap set:
| Ωi,t=TopK(πTr(i)(⋅∣si,t))∩TopK(πθ(⋅∣si,t)).\displaystyle\Omega_{i,t}=\mathrm{TopK}\left(\pi_{\mathrm{T}}^{r(i)}(\cdot\mid s_{i,t})\right)\cap\mathrm{TopK}\left(\pi_{\theta}(\cdot\mid s_{i,t})\right). | (13) |
|---|
When Ωi,t\Omega_{i,t} is non-empty, the raw advantage is:
| Ai,t=∑v∈Ωi,tπ¯θ(v∣si,t)[logπTr(i)(v∣si,t)−logπθ(v∣si,t)],\displaystyle A_{i,t}=\sum_{v\in\Omega_{i,t}}\bar{\pi}_{\theta}(v\mid s_{i,t})\left[\log\pi_{\mathrm{T}}^{r(i)}(v\mid s_{i,t})-\log\pi_{\theta}(v\mid s_{i,t})\right], | (14) |
|---|
where
| π¯θ(v∣si,t)=πθ(v∣si,t)∑u∈Ωi,tπθ(u∣si,t).\displaystyle\bar{\pi}_{\theta}(v\mid s_{i,t})=\frac{\pi_{\theta}(v\mid s_{i,t})}{\sum_{u\in\Omega_{i,t}}\pi_{\theta}(u\mid s_{i,t})}. | (15) |
|---|
If Ωi,t=∅\Omega_{i,t}=\emptyset, we set Ai,t=0A_{i,t}=0.
Compared with distilling over the full vocabulary, the overlap estimator focuses supervision on the local distribution region considered plausible by both teacher and student. This avoids unstable comparisons over very low-probability tokens while keeping the training on-policy, since the feedback is computed on states actually visited by the student.
The student is optimized with an advantage-weighted policy-gradient objective:
| ℒMOPD=−𝔼[1|ℳi|∑t∈ℳiA^i,tlogπθ(yi,t∣xi,yi,<t)],\displaystyle\mathcal{L}_{\mathrm{MOPD}}=-\mathbb{E}\left[\frac{1}{|\mathcal{M}_{i}|}\sum_{t\in\mathcal{M}_{i}}\widehat{A}_{i,t}\log\pi_{\theta}(y_{i,t}\mid x_{i},y_{i,<t})\right], | (16) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |
where ℳi\mathcal{M}_{i} is the valid response-token mask. We further apply token-category-aware advantage scaling to down-weight formatting tokens and up-weight perception or reasoning tokens. For long-form generation, repetition collapse is localized at position τi\tau_{i} and penalized only after the collapse point:
| A^i,t=A~i,t−λrep⋅𝕀[t≥τi].\displaystyle\widehat{A}_{i,t}=\widetilde{A}_{i,t}-\lambda_{\mathrm{rep}}\cdot\mathbb{I}[t\geq\tau_{i}]. | (17) |
|---|
Together, dynamic teacher routing, SPRR alignment, top-kk overlap estimation, token-category scaling, and localized repetition penalties allow MOPD to consolidate heterogeneous RL teachers without forcing all domains into a single response style.
Efficient Training and Inference Infrastructure
5.1 Pre-Training Systems
5.1.1 ViT–LM Heterogeneous Parallelism and Load Balancing
To remove I/O bottlenecks from video decoding and frame sampling, we introduce ExtraIO, a horizontally scalable I/O service decoupled from training through an asynchronous pipeline. We further co-design ViT–LM heterogeneous parallelism and two-level load balancing for long-video and variable-length workloads.
ViT–LM Heterogeneous Parallelism.
The ViT and LM are co-located on the same GPU group, but each module adopts its own parallel sharding strategy. This avoids the imbalance caused by binding the ViT to LM PP0. A recompute-or-offload strategy reduces ViT activation memory to nearly zero, leaving more memory for long sequences.
Load Balancing.
Because the ViT processes image/frame-level visual tokens while the LM processes sample-level multimodal sequences, both visual-token ratios and sequence lengths fluctuate. We balance load at the multimodal-token level and the LM-sample level, equalizing compute and memory across ViT DP and LM DP/PP. This improves end-to-end training throughput by approximately 20%.
This design is particularly useful for video-heavy batches, where a few long samples can otherwise create severe bubbles and leave a large fraction of devices underutilized.
5.1.2 DSA Optimization for Variable-Length Sequences
We optimize DSA with FlashInfer (flashinfer) and TileLang (tilelang), achieving more than a 2×\times speedup over a baseline adapted from open-source code.
Top-kk Memory Optimization.
Under packing, the original indexer score has shape T×TT\times T, causing memory waste because samples are mutually invisible. We reduce score storage to T×max_seqT\times\text{max\_seq} and use flashinfer.top_k_ragged_transform to compute only over valid KV regions. A chunked indexer is used as a memory-bound fallback for extreme cases.
Short-Sequence Optimization.
During long-sequence SFT, many samples are short. When the positional index satisfies i<topki<\text{top}k, the indexer backward and sparse-attention kernels iterate only over causally attendable KV entries instead of scanning a fixed top-kk range, bringing a 1.5×\times end-to-end speedup.
Indexer Loss.
Instead of storing full indexer and attention scores across layers, we keep only post-top-kk indexer scores during forward and recover attention scores inside the sparse-attention backward kernel. This reuses FlashAttention-style backward recomputation (flashattention), avoids additional compute, and releases per-layer intermediates early.
The key observation is that sparse-attention backward already recomputes the post-softmax attention matrix. By extracting and reducing the needed scores inside the backward kernel, the indexer loss can be computed without materializing layer-wise T×TT\times T intermediates.
5.2 Post-Training Systems
5.2.1 DSA Adaptation for RL
For RL training with DSA, we focus on consistency and memory. To avoid mismatch between training and inference Top-kk results, we use deterministic Top-kk computation. flashinfer.topk replaces torch.topk, achieving a 2–3×\times speedup while preserving determinism. For variable-length RL batches, a chunked DSA indexer partitions the score matrix along the sequence dimension, performs Top-kk chunk by chunk, and aggregates the results to reduce peak memory.
5.2.2 On-Policy Distillation System
The OPD system supports heterogeneous multi-expert teacher scheduling, multimodal alignment, and Top-kk distillation. Samples are routed to domain-specific teachers such as Math, Grounding, OCR, and LAN-Instruct, while domains sharing weights reuse the same server instance. To align student and teacher preprocessing, the teacher side reconstructs sequences with its native processor and verifies visual-token counts, positional encodings, and multimodal-text boundaries. Top-kk distillation supports Overlap, Student-Only, and Teacher-on-Student modes.
Strict multimodal alignment is essential because even a small mismatch in image token count, video frame sampling, chat template, or mRoPE position can move the KL signal to the wrong response token. The system therefore treats alignment verification as part of the training pipeline rather than an offline debugging step.
5.3 Efficient Inference for GQA+DSA
For ultra-long video inference, we introduce three optimizations.
- •
Chunk ViT: video frames are split into chunks, processed sequentially by the ViT, and then merged, reducing peak memory without changing model outputs. - •
Sparse Attention Optimization: adjacent queries often select similar Top-kk KV sets. We deduplicate Top-kk sets across adjacent queries and use an MMA Thread Layout-Aware Mask inside the attention kernel. With a 128K context and topk=2048\text{top}k=2048, 16 adjacent queries require only about 8K effective KV tokens. - •
Decode Optimization: DSA-specific decode optimizations reduce prefill cost by over 3×\times and decode cost by over 5×\times compared with full attention under a 128K context.
Experiments on H800 GPUs, assuming $2 per GPU-hour, demonstrate the inference efficiency shown in Figure 4.
Figure 4: Inference cost of Keye-VL-2.0-30B-A3B. DSA-specific prefill and decode optimizations reduce the cost of ultra-long video inference relative to dense attention under the same H800 pricing assumption.
Comprehensive Evaluation
Before presenting benchmark-specific analyses, Figure 5 provides a compact overview of Keye-VL-2.0-30B-A3B across video understanding, coding, agentic tool use, mathematical and scientific reasoning, instruction following, and general vision-language benchmarks. We compare with representative open-source and closed-source baselines of similar and larger scales, including Qwen3.5 (qwen3.5), InternVL3.5 (internvl35), GPT-5-mini (openai2025gpt5developers), and Qwen3-VL (qwen3vl). The following subsections give the benchmark definitions, evaluation protocols, and detailed results behind this overall view.
Figure 5: Overall evaluation summary of Keye-VL-2.0-30B-A3B. The figure summarizes representative results across video understanding, coding, agentic tool use, mathematical and scientific reasoning, instruction following, and general vision-language benchmarks. Orange scores mark leading results in each row, and “–” indicates unavailable or not directly comparable scores. Higher is better unless otherwise specified by the corresponding benchmark; detailed benchmark descriptions and citations are provided in the subsections below.
6.1 Video Understanding
We evaluate Keye-VL-2.0-30B-A3B on three complementary aspects of video understanding: long-video comprehensive understanding, fine-grained temporal grounding, and video knowledge acquisition.
6.1.1 Benchmarks
Long-Video Comprehensive Understanding.
Video-MME-v2 (fu2026videomme2) evaluates omni-modal information aggregation, long-range temporal understanding, and complex reasoning. It reports both average accuracy and a group-based non-linear score (N/4)2(N/4)^{2}, where NN is the number of correct answers in a group of four correlated questions. LongVideoBench (longvideobench) evaluates long-context video-language reasoning with referring queries. MLVU (mlvu) covers multi-task long-video understanding from 3-minute to 2-hour videos. Video-MME (video_mme) evaluates short, medium, and long videos; we report the no-subtitle setting.
These benchmarks together test whether a model can retrieve relevant segments from long videos, maintain event order, integrate cross-segment evidence, and answer questions without relying on subtitle shortcuts.
Temporal Grounding.
We use the TimeLens framework (zhang2025timelens), which re-annotates ActivityNet Captions, QVHighlights, and Charades-STA to reduce annotation noise. We report mIoU on ActivityNet-TimeLens, QVHighlights-TimeLens, and Charades-TimeLens.
The cleaned annotations make TimeLens a more reliable probe of fine-grained temporal alignment than the original legacy annotations, where noisy boundaries can distort model rankings.
Video Knowledge Acquisition.
Video-MMMU (videommmu) evaluates whether a model can learn and apply domain knowledge from instructional videos across 30 sub-disciplines and 6 major fields.
Its questions cover perception, comprehension, and adaptation levels, requiring the model to identify information in the lecture, understand the underlying concept, and apply it to new scenarios.
6.1.2 Results
Table 2: Video understanding evaluation of Keye-VL-2.0-30B-A3B against representative baselines. Bold marks the best result per row, underline marks the second-best, and “–” indicates an unavailable comparable score.
| Category | Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | InternVL3.5 241B-A28B | GPT-5-mini 2025-08-07 | Qwen3-VL 30B-A3B Thinking | Qwen3-VL 32B Thinking | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|---|---|---|---|
| Long-Video Comprehensive | LongVideoBench | 74.1 | 61.6 | 67.1 | – | – | – | 70.5 |
| Video-MME-v2 ACC (64 / 512 frames) | 35.3 / 42.4 | 32.6 / 28.5 | – | – | – | – | 33.3 / 36.8 | |
| Video-MME-v2 Non-Lin (64 / 512 frames) | 18.5 / 24.2 | 15.6 / 12.2 | – | – | – | – | 26.3 / 28.1 | |
| MLVU | 82.8 | 85.6 | 78.2 | 83.3 | 78.9 | 82.3 | 83.8 | |
| Video-MME (w/o sub.) | 78.3 | 82.5 | 72.9 | 78.9 | 73.3 | 77.3 | 79.0 | |
| Temporal Grounding (TimeLens) | ActivityNet-TimeLens | 58.5 | 53.2 | – | – | – | – | 52.1 |
| QVHighlights-TimeLens | 70.1 | 65.7 | – | – | – | – | 64.6 | |
| Charades-TimeLens | 58.4 | 49.1 | – | – | – | – | 47.8 | |
| Video Knowledge Acquisition | Video-MMMU | 80.0 | 80.4 | – | 82.5 | 75.0 | 79.0 | 80.0 |
Keye-VL-2.0 achieves the best result on LongVideoBench and Video-MME-v2 accuracy, and remains competitive on mature benchmarks such as MLVU and Video-MME. On Video-MME-v2, its strong accuracy under both 64-frame and 512-frame settings indicates that the model benefits from denser visual context without losing long-range aggregation ability. It also achieves the best mIoU on all three TimeLens subsets, validating the effectiveness of scene-wise dense captions, diverse TVG data, and tIoU-centered Video RL. In addition, for the TimeLens benchmark, the score of Qwen3-VL-235B-A22B Thinking is directly taken from the paper, where the evaluation was conducted with 2 FPS frame sampling. For Qwen3.5-35B-A3B, there is currently no official TimeLens score available. Considering that previous Qwen-series evaluations on the TVG benchmark adopted a 4 FPS setting, we evaluate the main comparison models, including Qwen3.5 35B-A3B, Gemini, and Keye-VL-2.0, using dense frame sampling at 4 FPS for a fair comparison. On Video-MMMU, it reaches 80.080.0, matching strong open-source baselines and approaching the closed-source GPT-5-mini.
6.2 Agentic Capability Evaluation
6.2.1 Code Agent Evaluation
We evaluate coding and software-engineering capabilities on LiveCodeBench v6 (livecodebench), OJBench (ojbench), and SWE-bench Verified (swe_bench_verified). LiveCodeBench v6 provides contamination-resistant programming evaluation, OJBench tests online-judge-style algorithmic correctness, and SWE-bench Verified evaluates repository-level issue resolution.
Table 3: Code agent evaluation of Keye-VL-2.0-30B-A3B against representative baselines. Bold marks the best result per row, underline marks the second-best, and “–” indicates an unavailable comparable score.
| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | InternVL3.5 241B-A28B | GPT-5-mini 2025-08-07 | Qwen3-VL 30B-A3B Thinking | Qwen3-VL 32B Thinking | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|---|---|---|
| LiveCodeBench v6 | 64.2 | 62.8 | – | 51.5 | – | – | – |
| OJBench | 71.5 | 70.2 | – | 58.7 | – | – | – |
| SWE-bench Verified | 62.0 | 63.5 | – | 55.5 | – | – | – |
Keye-VL-2.0-30B-A3B achieves 64.264.2 on LiveCodeBench v6 and 71.571.5 on OJBench, showing strong algorithmic reasoning and execution-based self-correction. It also obtains a competitive 62.062.0 on SWE-bench Verified, suggesting that the model transfers part of its coding ability from isolated algorithmic problems to repository-level software engineering.
6.2.2 Tool-Use Evaluation
We evaluate function calling and multi-turn tool use on BFCL-V4 (bfcl), τ2\tau^{2}-Bench (tau2bench), and VitaBench (vitabench). These benchmarks cover single-turn and multi-turn function calling, dual-control stateful interaction, and complex life-service scenarios with heterogeneous tools.
Table 4: Tool-use and function-calling evaluation of Keye-VL-2.0-30B-A3B against representative baselines. Bold marks the best result per row, underline marks the second-best, and “–” indicates an unavailable comparable score.
| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | InternVL3.5 241B-A28B | GPT-5-mini 2025-08-07 | Qwen3-VL 30B-A3B Thinking | Qwen3-VL 32B Thinking | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|---|---|---|
| BFCL-V4 | 65.7 | 67.3 | – | 55.5 | – | – | – |
| τ2\tau^{2}-Bench | 82.6 | 81.2 | – | 69.8 | – | – | – |
| VitaBench | 33.1 | 31.9 | – | 13.9 | – | – | – |
Keye-VL-2.0-30B-A3B achieves the best results on τ2\tau^{2}-Bench and VitaBench and ranks second on BFCL-V4. These results indicate strong tool selection, parameter filling, state tracking, and recovery behavior in multi-turn environments.
6.3 General Vision-Language Evaluation
We evaluate general vision-language capabilities on perception- and reasoning-oriented benchmarks.
Perception-oriented benchmarks cover OCR and document understanding, grounding, counting, spatial understanding, and hallucination resistance. OCRBench (ocrbench) and OmniDocBench (omnidocbench) focus on text-centric visual understanding and document intelligence. RefCOCO (refcoco) evaluates referring-expression grounding, FSC-147 (fsc_147) and PixMoCount (pixmo) measure counting, EmbSpatial-Bench (embspatialbench) evaluates egocentric spatial relations, and HallusionBench (hallusionbench) diagnoses perceptual reliability.
Reasoning-oriented benchmarks cover visual mathematics, dynamic reasoning robustness, expert-level multimodal understanding, and vision-indispensable reasoning. WeMath (wemath), MathVista (mathvista), and DynaMath (dynamath) evaluate complementary aspects of visual mathematical reasoning, while MMMU (mmmu) and MMStar (mm_star) test expert-level multimodal reasoning under reduced text-only shortcut opportunities. Overall, Keye-VL-2.0-30B-A3B maintains strong general multimodal performance while showing particularly strong results in long-video understanding, temporal grounding, hallucination resistance, and visual mathematical reasoning.
Conclusion and Future Work
We presented Kwai Keye-VL-2.0-30B-A3B, an open-source 30B-class MoE multimodal foundation model with only 3B active parameters. By bringing DeepSeek Sparse Attention into a GQA-based multimodal backbone, Keye-VL-2.0 extends effective context modeling to 256K tokens and makes hour-level video understanding practical under controllable training and inference cost. Together with native-resolution visual encoding, unified image-video processing, ViT–LM heterogeneous parallelism, DSA kernels, and Chunk ViT inference, the system is designed not only to improve benchmark accuracy, but also to support deployable long-video applications.
The post-training pipeline further addresses the capability-conflict problem that emerges when perception, reasoning, long-context understanding, and agentic behaviors are optimized together. Cross-Modal Multi-Teacher On-Policy Distillation, Context-RL, Video-RL, and specialized domain RL allow heterogeneous teachers and reward signals to be consolidated into a single MoE model without sacrificing core reasoning ability. Evaluations show that Keye-VL-2.0 achieves leading performance at its scale on long-video comprehension and fine-grained temporal localization, while remaining competitive on code, tool-use, OCR, document understanding, visual mathematics, and hallucination-resistance benchmarks. Overall, the results indicate that sparse long-context modeling and carefully staged multimodal RL can be combined in a single deployable MoE system without trading away general reasoning ability.
Future work will move beyond leaderboard-oriented optimization toward deeper deployment in real business scenarios. First, we will further integrate fine-grained long-video perception and image-text understanding into core product pipelines, including generative recommendation, content ecosystem governance, and commercial targeting. In these settings, Keye-VL is expected to provide denser semantic signals for recommendation matching, content quality assessment, and fine-grained advertising labels, turning multimodal understanding into measurable product and business value. Second, we will develop Video ×\times Agent workflows that combine precise multimodal understanding with automated orchestration. The model will evolve from passively understanding video content to actively coordinating production loops, including large-scale video retrieval, highlight segment extraction, automated editing and packaging, and marketing-copy generation for creator and commercial scenarios. Third, using Keye-VL-2.0-30B-A3B as a validated foundation, we will continue strengthening the underlying infrastructure from DSA-based compute optimization, scalable data flywheels, and Context-RL post-training toward native multimodal modeling and deeper end-to-end fusion. In this direction, benchmark gains serve as diagnostics rather than the final objective; the long-term goal is to turn long-context multimodal intelligence into reliable, scalable infrastructure for real-world applications.
References
Appendix A Case Study
The following qualitative examples show the final user-facing responses of Keye-VL-2.0 on representative text, image, video, and agentic service tasks. Since Keye-VL-2.0 uses a thinking-oriented policy by default, we omit internal reasoning traces and do not display separate thinking or answer tags in the case study.
Case I: Logical Constraint Solving
Prompt The Protocol of the Corrupted AI.A cybersecurity analyst must override a rogue AI system by entering five unique Access Keys into five slots. Each Access Key contains exactly one letter from {A,B,C,D,E}\{A,B,C,D,E\} and one digit from {1,2,3,4,5}\{1,2,3,4,5\}. No letter or digit may repeat. Five intercepted nodes describe the sequence, but exactly one node is corrupted: every statement in the corrupted node is false, while every statement in each intact node is true. Node 1. AA is immediately to the left of the slot containing digit 55. Digit 33 is located in Slot 5. Node 2. CC is paired with an even digit. The slot containing EE is exactly two slots to the right of the slot containing digit 44. Node 3. Digit 22 is adjacent to the slot containing DD. BB is neither in Slot 1 nor Slot 5. Node 4. The sum of the digits in Slot 1 and Slot 2 equals the digit in Slot 3. The digit paired with AA is strictly less than the digit paired with BB. Node 5. DD is somewhere to the left of CC. Digit 11 is paired directly with EE. Slot 4 contains an odd digit. Mission. Identify the corrupted node and determine the exact letter-digit sequence for all five slots.
Model Response Corrupted node. Node 5 is the corrupted node. All statements in Node 5 are false, while all statements in Nodes 1–4 are true. Exact sequence. Slot Letter Digit 1 C 4 2 A 1 3 E 5 4 B 2 5 D 3 Verification. Node Status and evidence Node 1 True. AA in Slot 2 is immediately to the left of digit 55 in Slot 3; digit 33 is in Slot 5. Node 2 True. CC is paired with digit 44, which is even; EE in Slot 3 is exactly two slots to the right of digit 44 in Slot 1. Node 3 True. Digit 22 in Slot 4 is adjacent to DD in Slot 5; BB is in Slot 4, so it is neither in Slot 1 nor Slot 5. Node 4 True. The Slot 1 and Slot 2 digits sum to the Slot 3 digit: 4+1=54+1=5. The digit paired with AA is 11, which is less than the digit paired with BB, 22. Node 5 False. DD is in Slot 5 and CC is in Slot 1, so DD is not left of CC. Digit 11 is paired with AA, not EE. Slot 4 contains digit 22, which is even rather than odd. The assignment satisfies the uniqueness constraint for all letters and digits and leaves exactly one corrupted node.
Figure 6: Text case for logical constraint solving. Keye-VL-2.0 solves a multi-constraint symbolic puzzle by identifying the single corrupted evidence node, constructing a valid slot assignment, and verifying every statement against the final sequence.
Case II: Spatial Reasoning in an Indoor Layout
Input Image![[Uncaptioned image]](https://arxiv.org/html/2606.10651v1/arXiv/cases/figs/room_layout_spatial_reasoning.png)
Prompt The image is a top-down view of a room, with each piece of furniture labeled by a letter. Please answer the following questions: 1. Which direction is A (the sofa) facing? 2. Which direction is C (the TV) facing? 3. Are A (the sofa) and C (the TV) facing each other? 4. When entering the room through G (the door), what furniture is on your left-hand side and what furniture is on your right-hand side? 5. What is the relative spatial relationship between D (the floor lamp) and H (the window)? 6. If B (the coffee table) is moved next to E (the dining table), in which direction should it be moved? 7. In which corner of the room is F (the bookshelf) located? Describe its position relative to the other furniture. Please answer based on the spatial layout shown in the image, paying close attention to orientations, directions, and relative positions.
Model Response Directions of A and C. A (sofa) faces toward the center of the room, i.e., to the right. C (TV) faces toward the center of the room, i.e., to the left. Therefore, A and C are facing each other, forming a typical living-room seating arrangement. Left and right when entering from G. When standing at G and entering the room, the furniture on the left-hand side is the sofa A, while the furniture on the right-hand side is the TV C. Relative position of D and H. D (floor lamp) is to the left of H (window), namely on the west side of the window. The floor lamp is positioned left and slightly forward of the window, and both are near the north wall. Moving B next to E. B (coffee table) is currently south of E (dining table), or below it in the top-down layout. To move B next to the dining table, it should be moved north. Location of F. F (bookshelf) is located in the northeast corner of the room. Relative to the other furniture, it is east of the window H, north of the TV C, and northeast of the dining table E. It is positioned near the intersection of the north and east walls. Overall, this layout shows a standard living-room configuration: the sofa and TV face each other to form a seating area, the dining table is in the middle-north section, the window and bookshelf are on the north wall, and the door is on the south wall.
Figure 7: Image case for spatial layout understanding. Given a labeled top-down indoor scene, Keye-VL-2.0 identifies object orientations, egocentric left-right relations, furniture positions, and the direction needed to move an object.
Case III: Anatomical Reasoning and Error Detection
Input Image![[Uncaptioned image]](https://arxiv.org/html/2606.10651v1/arXiv/cases/figs/heart_circulation_diagram.png)
Prompt This image is an anatomical diagram of the human heart and a schematic of blood circulation. Please answer the following questions: • What are the four chambers of the heart? Please list them according to the labels in the diagram. • What do the blue and red arrows in the diagram represent, respectively? Hint: it is related to blood oxygen content. • After blood flows from the lungs into the heart, which chambers and blood vessels does it sequentially pass through before being delivered to the rest of the body? • Where is the “Mitral Valve” labeled in the diagram located? Which two chambers does it connect? • Based on the text description in the diagram, point out one scientifically incorrect statement and correct it. Please base your answers on the labels and text information provided in the image.
Model Response Four chambers. According to the labels in the diagram, the four chambers of the heart are Right Atrium, Right Ventricle, Left Atrium, and Left Ventricle. Meaning of blue and red arrows. Blue arrows represent the flow of oxygen-poor blood with low oxygen content. Red arrows represent the flow of oxygen-rich blood with high oxygen content. This is consistent with the diagram text describing oxygen-poor blood returning and oxygen-rich blood flowing from the lungs. Pathway from the lungs to the body. The sequence is: lungs →\rightarrow left atrium →\rightarrow left ventricle →\rightarrow aorta →\rightarrow entire body. Position of the Mitral Valve. The Mitral Valve is located between the left atrium and the left ventricle. It connects the Left Atrium and the Left Ventricle. Scientific error and correction. The path description for oxygen-poor blood states: body →\rightarrow SVC →\rightarrow right atrium →\rightarrow SVC →\rightarrow right ventricle. This is incorrect because the SVC is used twice, and it carries blood from the body to the right atrium rather than from the right atrium to the right ventricle. The corrected pathway should be: body →\rightarrow SVC →\rightarrow right atrium →\rightarrow right ventricle →\rightarrow pulmonary artery →\rightarrow lungs. Additionally, there is a spelling error in the diagram: “pulmary” should be “pulmonary.”
Figure 8: Image case for anatomical diagram understanding. Given a labeled heart diagram with blood-flow arrows and explanatory text, Keye-VL-2.0 identifies cardiac chambers, interprets oxygen-rich and oxygen-poor circulation, localizes the mitral valve, reconstructs the pulmonary-to-systemic pathway, and detects an incorrect textual description.
Case IV: Scene-by-Scene Historical Video Understanding
Input Video![[Uncaptioned image]](https://arxiv.org/html/2606.10651v1/arXiv/cases/figs/history_video_preview.png)
Prompt Describe the content of the video by scene and present it in English.
Model Response Scene 1: Opening Rural Landscape and Mythical History.The video begins with a yellow flag waving against a clear sky, framed by lush green trees. The scene transitions to a vast golden wheat field where a farmer harvests crops with a sickle. Children ride bicycles along a rural road, passing by motorbikes. Subtitles state, “If this was the capital of the Xia, for the Chinese, myth would become history, for they would have found the root of the Chinese state.” Scene 2: Highway Journey and Historical Transition.A man drives on a highway, with the rearview mirror reflecting his face. Subtitles mention “around 1200 BC” and the search for China’s earliest historical rulers, specifically the Shang Dynasty, noted for leaving the “first Chinese writing.” The camera shows a roadside statue of a laughing Buddha and highway traffic. Scene 3: Traditional Street and Chinese Medicine Store.The scene shifts to a bustling traditional street with pedestrians, a child blowing bubbles, and ornate architectural details. It transitions inside an old Chinese medicine store, where staff prepare herbal packages, and a foreign man navigates the crowded pharmacy. Subtitles highlight the Shang Dynasty’s discovery as “one of the most exciting stories in world archaeology” and emphasize traditional Chinese medicine’s ancient roots. Scene 4: Wang Yirong’s Discovery of Oracle Bones.Inside the pharmacy, the foreign man recounts the story of Wang Yirong in 1899, a Beijing Hanlin Academy chancellor who, while ill with malaria, purchased medicine containing “dragon bones,” namely animal bones. Upon opening the package, he discovered inscriptions resembling ancient bronze writings, leading to the tracing of these bones to Anyang, Henan. Close-ups show him examining the inscribed bones. Scene 5: Anyang Archaeological Excavation.Black-and-white historical photos depict the Anyang excavation site, showing massive tombs of late Shang kings with human sacrifices and rows of skulls. Subtitles note that the 1928 excavation uncovered nearly 30,000 oracle bones documenting divination activities of nine Shang kings. The sequence includes images of archaeologists at work and close-ups of inscribed bones. Scene 6: Expert Discussion on Oracle Bone Divination.Two experts discuss the oracle bone divination process. One explains how Shang kings burned cracks in turtle shells or ox bones to communicate with ancestors, interpreting crack patterns to determine auspiciousness. They compare oracle bone characters with modern Chinese script, such as the character for “rain,” illustrating its evolution from pictographs to modern form. Scene 7: Evolution of Chinese Script Animation.An animated sequence illustrates the transformation from prehistoric oracle bone pictographs to modern Chinese script, showing thousands of characters linking the present to the past. Subtitles emphasize how Chinese people are uniquely connected to their history and thinking through their script, surpassing other cultures globally.
Figure 9: Video case for long-form scene-level understanding. Keye-VL-2.0 summarizes a historical documentary clip by segmenting the video into coherent scenes, tracking transitions from rural landscapes and travel shots to traditional medicine, oracle-bone discovery, Anyang excavation, expert discussion, and the evolution of Chinese script.
Case V: Daily Vlog and Equestrian Lesson Understanding
Input Video![[Uncaptioned image]](https://arxiv.org/html/2606.10651v1/arXiv/cases/figs/ks_video_preview.png)
Prompt Describe the content of the video by scene and present it in English.
Model Response The model segments a daily vlog into ten timestamped scenes, preserving the user’s school preparation, equestrian lesson, campus classes, family interaction, and evening routine. The response below is presented as a compact timeline. Time Scene summary 00:00–00:08 Living room preparation for school. In a bright living room at 11:39 AM, the woman shows a white drawstring bag and its contents, including a wallet, tablet, and water bottle. On-screen text introduces the first day of class at Columbia and the end of summer break. 00:08–00:14 Kitchen snack packing. She opens organized kitchen drawers, selects snack packets such as black sesame cakes, and packs them for a practical school day. 00:14–00:18 Changing shoes for a riding lesson. She switches into black chunky boots, prepares a white tote bag, and notes that she also has a first riding lesson that day. 00:18–00:28 Driving to the stable. She drives through an automatic gate, sees horses in paddocks, and enters a stable where a farrier is shoeing a horse. 00:28–00:38 Riding lesson preparation. With an instructor’s help, she puts on a helmet, mounts the horse, and stands ready for the lesson while expressing affection and trust toward horses. 00:38–00:52 Riding lesson. In an indoor arena, she practices balance and control while the instructor guides the horse around the ring. 00:52–00:58 Driving to school. After the lesson, she changes shoes in the car and buys coffee before class, noting that the timing works out smoothly. 00:58–01:14 Campus scenes and classes. She walks through a historic campus hallway with coffee, attends two classes, and reflects positively on her third and final semester at Columbia. 01:14–01:24 Family surprise with cat. Returning to the car, she finds that her husband has brought their cat Fengfeng in a carrier; she hugs the cat and feels supported by her family. 01:24–01:37 Evening health routine. Back home, she prepares vitamins, probiotics, fiber supplements, and green smoothies with her husband, ending the day with a toast and thanks to the viewer.
Figure 10: Video case for scene-level daily vlog understanding. Keye-VL-2.0 follows a long-form personal vlog across preparation for school, snack packing, a riding lesson, campus classes, a family surprise, and an evening health routine, preserving both temporal boundaries and fine-grained lifestyle details.
Case VI: Multi-Domain Service Agent
Prompt A user asks the assistant to arrange several services for parents visiting Quanzhou on the same day. The request evolves across multiple turns: recommend and book a comfortable hotel near the user’s home, choose a spot-removing skincare product that can arrive within one hour, pay the hotel and delivery orders, recommend a Minnan-cuisine restaurant with Ginger Duck and a rich family set meal, create the restaurant order, and finally pay it.
Model Response Personalized planning. The assistant first retrieves the user’s profile to ground the task in verified information. It identifies the home address, infers that the user’s most frequent hotel brand is Home Inn, converts the home address to coordinates, and searches for nearby hotels in Quanzhou. It recommends Home Inn (Quanzhou Tian’an Road Branch) because it is 1.2 km from the user’s home, matches the user’s historical hotel preference, provides a comfortable queen-bed room, and costs only 288 RMB within the 1000 RMB budget. Hotel booking. After the user confirms the recommendation, the assistant creates a hotel order for one night in the comfortable queen-bed room: Item Result Hotel Home Inn (Quanzhou Tian’an Road Branch) Room Comfortable queen-bed room Date 2024-12-30 Order ID OOd651df337d Price 288 RMB Delivery recommendation and timing verification. For the skincare request, the assistant searches for spot-removing skincare products, retrieves store locations, computes distances from the user’s home, and converts distances into delivery times. It recommends Yuquan Spot-Removing Serum from Yueyan Skincare Store, because the store is 412 meters away and the estimated delivery time is 28 minutes, satisfying the one-hour constraint. Cross-domain order execution. The assistant then creates and pays the delivery order, handles the hotel payment with the correct hotel-payment tool after detecting a payment-type mismatch, and searches in-store restaurant options for Minnan cuisine with Ginger Duck. It recommends Minnan Flavor Restaurant and its Traditional Minnan Family Banquet Set (3–4 people) because the set contains Ginger Duck, is suitable for a family meal, and costs 398 RMB. After user confirmation, it creates and pays the in-store order. Final confirmed arrangement. Service Selection Order ID Status Hotel Home Inn queen-bed room, 288 RMB OOd651df337d Paid Delivery Yuquan Spot-Removing Serum, 258 RMB OTad2c5d4616 Paid; arrives at 15:58 Restaurant Minnan family banquet set, 398 RMB OI3ddaed0869 Paid Agent capability demonstrated. This case shows that Keye-VL-2.0 can maintain a multi-turn service plan, use user history for personalization, call tools across hotel, delivery, geolocation, distance estimation, and in-store ordering domains, recover from an incorrect payment-tool route, and complete all user-confirmed transactions while keeping the final response concise.
Figure 11: Agent case for multi-domain service orchestration. Keye-VL-2.0 coordinates personalization, search, geolocation, delivery-time estimation, booking, payment, recommendation, and order creation across hotel, delivery, and in-store service domains.
Appendix B Contribution (Alphabetical order)
Core Contributors
Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen
Contributors
Major. Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai
Active. Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han
Supporting Contributors
Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang