Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads (original) (raw)

Yifan Li1, Xin Li2, Tianqin Li2,3, Wenbin He2, Yu Kong1, Liu Ren2
1Michigan State University
2Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI)
3Carnegie Mellon University
{liyifa11, yukong}@msu.edu, {xin.li9, tianqin.li2, Wenbin.He2, liu.ren}@us.bosch.com

Abstract

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to 4×4\times4 × while achieving comparable or even better results on ADE20K, compared to other VFM adapters. Codes are available: https://jackyfl.github.io/vitsplit.github.io/.

1 Introduction

Recent studies reveal that the foundation models have the remarkable ability to acquire prior knowledge from large-scale datasets [93], which enhances the performance in downstream tasks. For vision tasks, vision foundation models (VFMs) acquire prior knowledge from large-scale datasets through self-supervised learning [29], utilizing techniques such as masked image modeling (MIM) [35, 94, 5], contrastive learning [13, 34, 8, 28], or hybrid approaches (MIM + contrastive) [65, 2]. They also leverage vision-language alignment [69, 25] and dense prediction tasks [43, 82], among others. VFMs exhibit remarkable zero-shot and transfer learning capabilities across a variety of downstream tasks, e.g., classification, detection, segmentation, monocular depth estimation (MDE), and visual question answering (VQA), etc.

Refer to caption

(a) Previous VFM adapters.

Refer to caption

(b) Ours (ViT-Split).

Figure 1: Comparison between previous VFM adapters and ours. Previous VFM adapters integrate low-level features learned by a CNN branch into a learnable VFM through an adapter. Our method exploits VFM prior knowledge with two heads: a prior head for multi-scale prior feature learning from a frozen VFM, and a task head for task-specific feature learning, initialized by the last few layers of the VFM.

To leverage prior knowledge from VFMs, previous VFM adapters such as ViT-Adapter [14] or ViT-CoMer [79] primarily adopt a two-branch architecture (see Fig. 1(a)). Such a design enables the adapter to integrate low-level features from a convolutional neural network (CNN) with global features from a vision transformer (ViT)-based VFM. While this architecture has demonstrated promising results across various downstream tasks, certain design aspects may affect training efficiency. From Fig. 1(a), we identify two main issues of inefficiency. First, the interaction between the CNN and ViT branches across multiple stages requires gradients to be back-propagated through all layers of the model during training. This results in increased computational and memory costs as the size of the VFM grows. Second, all components need to be tuned during training to achieve optimal performance. Specifically, for tasks like segmentation, a large head such as Mask2Former [16] is tuned, and its size is nearly equivalent to that of the VFM backbone.

To address the training inefficiency issue, parameter-efficient fine-tuning (PEFT) methods are proposed to reduce training parameters. These methods include prompt-tuning approaches like VPT [40], adapter-based methods like AdaptFormer [12], and low-rank weight tuning like LoRA [36] or FacT [42]. However, these methods still encounter the issue of early-layer gradient back-propagation, as learnable parameters are appended to each layer’s visual tokens (prompt tuning), or low-rank weights are inserted into the layers (adapter-based methods) or added to the original weights (low-rank weight tuning). Moreover, these PEFT methods do not incorporate low-level features as VFM adapters do, and their performance is either slightly inferior to or generally on par with traditional fine-tuning. Furthermore, despite their proven effectiveness across various tasks [65], the pretrained prior features are not fully leveraged by either PEFT methods or the VFM adapters.

Refer to caption

Figure 2: Comparison with previous VFM adapters (ViT-Adapter [14] and ViT-CoMer [79]) on ADE20K val. The results indicate that by leveraging the potential of VFMs (DINOv2 in this task), ViT-Split can achieve competitive results compared to previous VFM adapters. Notably, ViT-Split accomplishes this with only a single linear head and a small number of trainable parameters.

To tackle the aforementioned challenges, we propose a method called ViT-Split (see Fig. 1(b)). ViT-Split is built upon the observation that the layers of a VFM like DINOv2 [65] can be divided into two components: a low-level feature extractor and a task-specific feature adapter. Consequently, an additional CNN branch for local feature extraction becomes unnecessary, allowing us to remove it to resolve the early layer gradient propagation issue. Additionally, we propose a task-specific adapter, named “task head”, tailored for downstream tasks. This adapter is initialized from the last few layers of the VFM, further avoiding gradient propagation problems in early layers. To effectively leverage prior features learned by VFM from large-scale datasets, we introduce an additional “prior head” that integrates multi-scale prior features instead of tuning the entire VFM. Such a head reduces the number of trainable parameters and helps mitigate overfitting in the task head (see Appendix). Additionally, we explore two layer selection strategies to identify the most relevant layer features. Experiments on segmentation task (see Fig. 2) demonstrate that our ViT-Split, using only a single linear head, can achieve competitive performance compared with previous VFM adapters with larger segmentation heads like Mask2former [16] or UperNet [80], while tuning fewer parameters and reducing training time (see Fig. 8).

Furthermore, ViT-Split is both adaptive and memory efficient for multiple tasks (see Fig. 7). Previous VFM adapters require separate modules (VFM+CNN+adapter+heads) for each task, leading to high computational and memory overhead. In contrast, ViT-Split shares a pre-trained VFM backbone, requiring only a task-specific adapter and the corresponding task head to be learned. Our approach introduces a new paradigm for designing both computation and memory efficient VFM adapters across multiple tasks. In summary, the contributions of this paper are threefold:

•
We observe that several VFMs, especially DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features.
•
We propose an efficient and effective adapter ViT-Split for VFMs. Specifically, ViT-Split introduces two heads, a task head and a prior head. The task head is for learning task-specific features. The prior head is a lightweight CNN for extracting multi-scale prior features from a frozen VFM. We also explore two layer selection methods for selecting prior features from all the layers: uniform sampling and sparse gate.
•
We perform extensive experiments and detailed ablations on various downstream tasks to validate the efficiency and effectiveness of our method, including segmentation, detection, MDE, and VQA.

2.1 Vision foundation models

Vision foundation models (VFMs) [3] are trained on large-scale datasets in a self-supervised, weakly-supervised, or supervised manner, making them adaptable to a wide range of downstream tasks. Benefiting from the scalability of the transformer architecture, recent ViT-based [24] VFMs demonstrate remarkable zero-shot and transfer ability across various downstream tasks. Self-supervised pretraining paradigm learns discriminative features solely from vision data at the image and pixel level, including contrastive learning (MoCo [34], SimCLR [13]), masked image modeling (BEiT [5], MAE [35], iBoT [94]) or hybrid approaches (DINOv2 [65], I-JEPA [2]). Weakly-supervised pretraining paradigm leverages text guidance, aligning visual representations with language space, such as CLIP [69], ALIGN [39], EVA2 [25], SigLip [89], etc. Supervised pretraining paradigm learns from different task labels, such as classification (DeiT [76]), segmentation (SAM [43]), and monocular depth estimation (DAM [82]), etc.

2.2 PEFT and VFM adapters

As the size of transformer-based foundation models continues to grow, such as large language models in language [7, 91], large vision models in vision [23, 84], and multi-modal large language model [15, 4] for multi-modal learning, training efficiency becomes increasingly crucial. To address this challenge, PEFT methods have gained significant popularity in recent years.

Current PEFT approaches for vision [83] generally fall into three categories: prompt tuning, adapter tuning, and parameter tuning. Prompt tuning involves learning a small number of prompt tokens, either in the first layer (CoOp [96], CoCoOp [95]) or in every layer (VPT [40]), making it lightweight and easy to implement. Adapter tuning inserts additional blocks into a frozen model either in a sequential manner (Res-adapt [72], ST-Adapter [66]) or in parallel (AdaptFormer [12], ConvPass [41], LoSA [62]), which shows good adaptability and generalizability. Parameter tuning modifies part of the model parameters, either by adjusting the weight (LoRA [36], FacT [42]) or tuning the bias (Bitfit [88]), resulting in effective and efficient tuning.

Current VFM adapters (ViT-Adapter [14], ViT-CoMer [79]) aim to enhance full fine-tuning performance by incorporating the inductive bias from the CNN branch with spatial prior. These adapters typically require tuning the whole backbone to achieve optimal performance, resulting in better performance than PEFT methods. The interaction between CNN and ViT features is achieved through cross-attention [14], self-attention [79] or mixed [90] across several layers. By contrast, our ViT-Split keeps the entire backbone frozen, introducing two lightweight heads for separate tuning, which is efficient and effective across various tasks.

3 Method

3.1 The observation in VFMs

We observe that in some VFMs, the layers can be broadly partitioned into two groups with similar features: the earlier and later layers. First, we plot the Centered Kernel Alignment (CKA) [44] across different layers for several VFMs, as shown in Fig. 3. The results reveal that features in the earlier layers are more similar to each other, as are those in the later layers, particularly in DINOv2 [65]. We attribute this phenomenon to the “encoder-decoder” architecture intrinsic to VFMs: the earlier layers function as an encoder (feature extractor) to capture features from the visual data, while the later layers act as a decoder (task-specific adapter) that generates features for downstream tasks.

Refer to caption

Figure 3: The CKA comparison of layer features across different VFMs, including a self-supervised method DINOv2-L [65], and three image-text alignment methods EVA2-L [25], CLIP-L [69] and SigLip-L [89]. For most of these VFMs, especially DINOv2, the features in the early and later layers show distinct similarities within their respective groups.

Refer to caption

Figure 4: Comparison of DINOv2-S layer features across different tasks, including pretraining (org.), segmentation (seg.), and detection (det.). Notably, the segmentation and detection models are fine-tuned from the DINOv2-S. The features within the red dotted boxes across the three tasks exhibit similar patterns, emphasizing detailed representations. In the later layers, however, the features diverge, becoming more specialized for each task.

A research question is raised: what do these two groups of layers actually learn? To answer this question, we visualize the features of each layer in DINOv2-S (Fig. 4) using the first channel of the visual tokens. To further explore feature differences across downstream tasks, we fine-tune the same DINOv2-S on segmentation and detection tasks by adding a linear head and a Mask R-CNN [33] head, respectively. As shown in Fig. 4, we observe that in the early layers (say layer 1-6), all three models exhibit similar feature patterns, focusing more on low-level features like texture and edges. This observation is also supported in [70], which demonstrates that ViT can learn low-level features through large-scale pretraining. While in the later layers, the features diverge for different tasks. Specifically, for the original DINOv2 and segmentation features (row 1 and row 2), the focus shifts towards the semantic information of objects. Whereas in the detection task, the feature attention gradually moves to the object corners or edges (row 3, L7-L12). We attribute this phenomenon to the intrinsic characteristics of each task: DINOv2’s pretraining objective is to reconstruct missing parts of the original features, which requires the semantic level understanding as the segmentation task does. In detection, the goal is to predict object bounding boxes, which necessitates focusing more on the corners. This phenomenon also highlights the difference between dense prediction and detection task.

Based on the findings, we divide layers of VFMs into two groups with similar features: a feature extractor for learning low-level features and a task-specific adapter for learning task-related features.

3.2 ViT-Split

The framework of ViT-Split is illustrated in Fig. 5, which includes three trainable components: a task head, a prior head and a fusion net. The task head, initialized with the last few layers of the VFM, is designed to learn task-specific features. The prior head integrates multi-scale prior features from the VFM, which are learned from large-scale, diverse datasets. Finally, the fusion net combines both task-specific and prior features to support various downstream tasks.

Refer to caption

Figure 5: The framework of ViT-Split. ViT-Split introduces two splitting heads, one prior head for aggregating multi-scale prior features from VFM and a task head for learning task-specific features. These features are then combined using a fusion network, enabling effective performance across various downstream tasks.

When an input image with a shape of H×W𝐻𝑊H\times Witalic_H × italic_W is fed into a frozen VFM (e.g., DINOv2), h⋅w⋅ℎ𝑤h\cdot witalic_h ⋅ italic_w vision tokens with D𝐷Ditalic_D channels will be obtained from each layer. The vision tokens from (L−Kt𝐿subscript𝐾𝑡L-K_{t}italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) layer are passed through a task head, which is copied from the last Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT layers of the VFM, where L𝐿Litalic_L is the number of the total layers. The task features are then reshaped to h×w×Dℎ𝑤𝐷h\times w\times Ditalic_h × italic_w × italic_D. Meanwhile, Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers of prior features from the frozen VFM are sampled using selection strategies, then concatenated and reshaped into a feature map of size h×w×(Kp⋅D)ℎ𝑤⋅subscript𝐾𝑝𝐷h\times w\times(K_{p}\cdot D)italic_h × italic_w × ( italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_D ). The feature map is then passed through a prior head, a two-layer CNN, resulting in a prior feature map of shape h×w×Dℎ𝑤𝐷h\times w\times Ditalic_h × italic_w × italic_D. Finally, the task and prior feature maps are concatenated along the channel dimension and fused by a fusion net, which has a similar architecture to the prior head. The final fusion feature map is provided for different downstream heads.

Task Head. Based on the observation in Sec. 3.1 that early layers of VFMs are capable of learning low-level features which are similar for different tasks, we avoid fine-tuning the entire backbone by sharing these early layers. Meanwhile, to retain the prior features of the VFM, we replicate the final Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT layers separately, utilizing them as a task-specific adapter for downstream tasks. The hyperparameter Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT controls the adapter’s size, balancing between model capacity and training efficiency.

We observe that the benefits of increasing Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT diminish, particularly for segmentation tasks, allowing us to choose a smaller Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to enhance efficiency (see hyper-parameter analysis in Appendix). Additionally, we find that a large segmentation head may be unnecessary, as the task-specific head is sufficient to capture the downstream dataset’s specific knowledge. Let the features from the (L−Kt𝐿subscript𝐾𝑡L-K_{t}italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)-th layer of the VFM be denoted as fL−Ktsubscript𝑓𝐿subscript𝐾𝑡f_{L-K_{t}}italic_f start_POSTSUBSCRIPT italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Consequently, the task-specific features are given by:

ft=gθt⁢(fL−Kt),subscript𝑓𝑡subscript𝑔subscript𝜃𝑡subscript𝑓𝐿subscript𝐾𝑡f_{t}=g_{\theta_{t}}(f_{L-K_{t}}),italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

(1)

where gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the task head. After obtaining the task feature ft∈ℝ(h⋅w+1)×Dsubscript𝑓𝑡superscriptℝ⋅ℎ𝑤1𝐷f_{t}\in\mathbb{R}^{(h\cdot w+1)\times D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h ⋅ italic_w + 1 ) × italic_D end_POSTSUPERSCRIPT, we drop the class token and reshape it from the sequence dimension to form a feature map ft′∈ℝh×w×Dsubscriptsuperscript𝑓′𝑡superscriptℝℎ𝑤𝐷f^{\prime}_{t}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT.

Prior Head. The prior features learned by VFMs have demonstrated strong performance across a range of downstream tasks [69, 65]. However, most current VFM adapters and PEFT methods modify these prior features during training. In contrast, our ViT-Split approach fully leverages the prior knowledge embedded in the multi-scale features of the VFM through a dedicated prior head. Our rationale for utilizing these prior features is to harness the knowledge learned by VFMs to enhance task-specific features while mitigating the risk of overfitting downstream tasks.

Specifically, the architecture of the prior head is shown in Fig. 6, consisting of two CNN layers, a 1×\times×1 convolution layer and a 3×\times×3 deformable convolution layer. The 1×\times×1 convolution layer is used to compress the channels of the multi-scale feature maps, providing efficiency when dealing with larger scales. Meanwhile, the deformable convolution layer [21] enhances low-level features and models geometric transformations within the feature map.

Layer Selection. How to select suitable prior features from all the VFM layers? To address this, we explore two techniques for selecting Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers from a total of L𝐿Litalic_L layers: uniform sampling and sparse gate. We delineate sparse gate in the Appendix. Uniform sampling involves selecting Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT prior features uniformly from L𝐿Litalic_L layers. This design is motivated by two factors: first, mitigating the high similarity between features of neighboring layers (see Fig. 3), and second, promoting greater diversity among the selected features. Specifically, the set of sampled indices, 𝒮𝒮\mathcal{S}caligraphic_S, is defined as follows:

| δ=L−b−1Kp−1,𝒮={b+round⁢(i⋅δ)|i=0,…,Kp−1},formulae-sequence𝛿𝐿𝑏1subscript𝐾𝑝1𝒮conditional-set𝑏round⋅𝑖𝛿𝑖0…subscript𝐾𝑝1\small\delta=\frac{L-b-1}{K_{p}-1},\mathcal{S}=\{b+{\rm{round}}(i\cdot\delta)|% i=0,...,K_{p}-1\},italic_δ = divide start_ARG italic_L - italic_b - 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 end_ARG , caligraphic_S = { italic_b + roman_round ( italic_i ⋅ italic_δ ) | italic_i = 0 , … , italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 } , | (2) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | --- |

where b𝑏bitalic_b is the starting index, used to skip the first few layers, as these layers tend to contain more noise. In most experiments, we set b=2𝑏2b=2italic_b = 2 or b=3𝑏3b=3italic_b = 3. roundround\rm{round}roman_round indicates the rounding to the nearest integer, and δ𝛿\deltaitalic_δ represents the sampling interval.

Refer to caption

Figure 6: The illustration of the CNN fusion architecture. It is used to fuse multi-scale feature maps and serves as the architecture for both the prior head and fusion net. This module consists of two CNN layers: a 1×1 convolution layer followed by a 3×3 deformable convolution layer.

After obtaining the selected prior features fpi∈ℝ(h⋅w+1)×D,i={0,…,Kp−1}formulae-sequencesubscriptsuperscript𝑓𝑖𝑝superscriptℝ⋅ℎ𝑤1𝐷𝑖0…subscript𝐾𝑝1f^{i}_{p}\in\mathbb{R}^{(h\cdot w+1)\times D},i=\{0,...,K_{p}-1\}italic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h ⋅ italic_w + 1 ) × italic_D end_POSTSUPERSCRIPT , italic_i = { 0 , … , italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 }, we drop the class tokens, reshape and concatenate them to a multi-scale prior feature map fp∈ℝh×w×(Kp⋅D)subscript𝑓𝑝superscriptℝℎ𝑤⋅subscript𝐾𝑝𝐷f_{p}\in\mathbb{R}^{h\times w\times(K_{p}\cdot D)}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × ( italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_D ) end_POSTSUPERSCRIPT. Finally, the aggregated prior map fp′∈ℝh×w×Dsubscriptsuperscript𝑓′𝑝superscriptℝℎ𝑤𝐷f^{\prime}_{p}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT can be denoted as:

fp′=gθp⁢(fp),subscriptsuperscript𝑓′𝑝subscript𝑔subscript𝜃𝑝subscript𝑓𝑝f^{\prime}_{p}=g_{\theta_{p}}(f_{p}),italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ,

(3)

where gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the prior head.

Fusion net. Fusion net is utilized to fuse prior feature map fp′subscriptsuperscript𝑓′𝑝f^{\prime}_{p}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the task-specific feature map ft′subscriptsuperscript𝑓′𝑡f^{\prime}_{t}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for different downstream tasks. This network has a similar architecture as the prior head (see Fig. 6). Let [fp′;ft′]∈ℝh×w×(2⁢D)subscriptsuperscript𝑓′𝑝subscriptsuperscript𝑓′𝑡superscriptℝℎ𝑤2𝐷[f^{\prime}_{p};f^{\prime}_{t}]\in\mathbb{R}^{h\times w\times(2D)}[ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × ( 2 italic_D ) end_POSTSUPERSCRIPT be the concatenated feature map of fp′subscriptsuperscript𝑓′𝑝f^{\prime}_{p}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and ft′subscriptsuperscript𝑓′𝑡f^{\prime}_{t}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along the channel dimension. The rationale of using concatenation to fuse two feature maps is to preserve more information (see Tab. 6). The final fused map fo∈ℝh×w×Dsubscript𝑓𝑜superscriptℝℎ𝑤𝐷f_{o}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT is given by:

fo=gθf⁢([fp′;ft′]),subscript𝑓𝑜subscript𝑔subscript𝜃𝑓subscriptsuperscript𝑓′𝑝subscriptsuperscript𝑓′𝑡f_{o}=g_{\theta_{f}}([f^{\prime}_{p};f^{\prime}_{t}]),italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( [ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ,

(4)

where gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the fusion net.

We then apply different transformations based on the type of downstream task. Specifically, for the segmentation task, we upsample fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT by a factor of 4 using two transposed convolution layers. For the detection task, we transform fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT into four scales, i.e., 4×4\times4 ×, 2×2\times2 ×, 1×1\times1 × and 0.5×0.5\times0.5 × to match the input requirements of the detection head (MaskRCNN). For the VQA task, we reshape fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT along the sequence dimension to (h⋅w)×D⋅ℎ𝑤𝐷{(h\cdot w)\times D}( italic_h ⋅ italic_w ) × italic_D for the LLM decoder.

4 Experiments

We conduct experiments on three tasks, semantic segmentation, object detection, and VQA, using well-established benchmarks, e.g., COCO [52], ADE20K [92], CityScapes [20], among others. We also present MDE results in the Appendix. Next, we perform ablation studies to further evaluate ViT-Split’s performance. A uniform selection strategy is applied to all experiments in this section, while results for the sparse gate are provided in the Appendix.

Table 1: Semantic segmentation results on the ADE20K val with 512*512 resolution image. ‡‡\ddagger‡ represents the DINOv2 initialization. “††\dagger†” denotes the use of ImageNet-22K pre-trained weight, while the default is to use ImageNet-1K pre-training.

Table 2: Compared with previous SOTA segmentic segmentation methods on ADE20K val with 896*896 resolution image. ‡‡\ddagger‡ are initialized with DINOv2. * is implemented without tuning the whole backbone [65]. “MS” means multi-scale testing. “MM” indicates multi-modal pretraining.

Refer to caption

Figure 7: Inference comparison: (a) Previous VFM adapters vs. (b) Our ViT-Split. ViT-Split is efficient during inference for multiple tasks.

4.1 Semantic segmentation

Settings. We conduct the semantic segmentation task on ADE20K [92] and Cityscapes [20], using MMSegmentation [19]. We employ AdamW [60] with a learning rate of 2e-4 and a weight decay of 1e-2. The training process uses a total batch size of 16. The learning rate for the task head is further reduced by a factor of 0.1. Unlike previous baselines, we use a simple linear head with two-layer deconvolutional blocks (×\times×4) for segmentation, with a total of 40k iterations (50k for DINOv2-g). We provide the hyper-parameter analysis of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the Appendix.

ADE20K val with 512×\times×512 image. As shown in Tab. 1, we can see that our ViT-Split surpasses all other baselines on ADE20K with 512×\times×512 resolution input image by fully leveraging the potential of the VFM. The results demonstrate the superiority of the DINOv2 compared to ImageNet pretrained models. Additionally, ViT-Split requires tuning only about 1/5 to 1/4 of the parameters and trains for just 1/4 of the iterations compared to previous baselines. The parameter efficiency is because of: 1) the efficient adaptation architecture of ViT-Split and 2) the lightweight linear head. The fast convergence speed attributes to effective utilization of the prior knowledge embedded in VFMs. Moreover, compared to fine-tuning the entire DINOv2 baseline, our ViT-Split adjusts only 1/4 to 1/2 of the parameters while achieving an average improvement of 2% across three model sizes. Since most tunable parameters come from the tuned head, which represents a small portion of the entire VFM, the overall parameter count for tuning remains low. The performance gains can be attributed to the utilization of the multi-scale prior features from the VFM.

Refer to caption

(a)

Refer to caption

(b)

Figure 8: Comparison of time complexity for VFM adapters on ADE20K using two different sizes of ViT: (a) ViT-S and (b) ViT-B. For a fair evaluation, we reimplemented the other adapters under the same conditions, i.e., 4×A6000 Ada, over 10,000 iterations.

ADE20K and Cityscapes val with 896×\times×896 image. Additionally, we also compare with other SOTA methods on ADE20K (Tab. 2) and Cityscapes (Tab. 3) using images of 896×\times×896 resolution. As shown in Tab. 2, we can see that ViT-Split achieves results comparable to current SOTA methods on ADE20K val. It is worth mentioning that ViT-Split uses only a small linear head and does not rely on extra pretraining data. For a fair comparison, we benchmark against ViT-Adapter-G, which trains only the adapter and the Mask2former head based on the DINOv2 backbone. Our ViT-Split not only delivers better performance but also requires half the training parameters and achieves faster training speed. Specifically, according to [65], training ViT-Adapter-G requires 16 V100 GPUs for 28 hours, whereas our ViT-Split-G takes only 8 A6000 Ada GPUs for 15.7 hours. Moreover, on Cityscapes dataset (Tab. 3), our ViT-Split outperforms ViT-Adapter with only around 1/6 parameters being tuned. The results suggest that a simple linear head is enough for competitive results on semantic segmentation by fully leveraging VFM prior knowledge.

Table 3: Semantic segmentation results on Cityscales val with 896*896 resolution image. “††\dagger†” indicates that the model is initialized with BEiTv2 then pretrained on the Mapillary dataset. “‡‡\ddagger‡” represents the use of DINOv2. “SS” denotes single-scale testing, and “MS” means multi-scale testing.

Method	LLM	Image	Sample Size	VQAv2	VizWiz	LLaVA-	SciQA-	MM-Vet	POPE [49]	MMB
Size	Pre	Ft	[27]	[30]	Wild [55]	IMG [61]	[85]	rand	pop	adv	[56]
BLIP-2 [46]	Vicuna-13B	2242	129M	-	65.0	19.6	19.6	61	22.4	89.6	85.5	80.9	–
InstructBLIP [22]	Vicuna-7B	2242	129M	1.2M	–	34.5	34.5	60.5	26.2	–	–	–	36
InstructBLIP [22]	Vicuna-13B	2242	129M	1.2M	–	33.4	33.4	63.1	25.6	87.7	77	72	–
Shikra [10]	Vicuna-13B	2242	600K	5.5M	77.4∗	–	–	–	–	–	–	–	58.8
IDEFICS-9B [37]	LLaMA-7B	2242	353M	1M	50.9	35.5	35.5	–	–	–	–	–	48.2
IDEFICS-80B [37]	LLaMA-65B	2242	353M	1M	60.0	36	36.0	–	–	–	–	–	54.5
Qwen-VL [4]	Qwen-7B	4482	1.4B	50M	78.8∗	35.2	35.2	67.1	–	–	–	–	38.2
Qwen-VL-Chat [4]	Qwen-7B	4482	1.4B∗	50M	78.2∗	38.9	38.9	68.2	–	–	–	–	60.6
LLaVA-1.5 [54]	Vicuna-7B	3362	558K	665K	78.5∗	50.0∗	65.4	66.8	31.1	87.3	86.2	84.2	64.3
LLaVA-1.5 + ViT-Split	Vicuna-7B	3362	558K	665K	78.2-0.3	51.7+1.7	71.1+5.7	70.4 +3.6	31.2+0.1	88.5+1.2	87.4+1.2	86.1+1.9	66.4+2.1

Table 4: Comparison with different VLLM methods on VQA benchmarks. ViT-Split is integrated into the vision encoder (CLIP-L) of LLaVA-1.5 (7B), tuning the penultimate block and utilizing prior feature from this layer. This adaptation can consistently enhance performance across most benchmarks, demonstrating the effectiveness and generalization of ViT-Split.

Time complexity analysis. As illustrated in Fig. 8, our ViT-Split achieves, on average, approximately 4× faster training speed for the small model and 3× faster for the base model compared to the other two VFM adapters. The slower training speed of the other adapters can be attributed to two factors: the early gradient backpropagation and the interaction between the CNN branch and the ViT. In contrast, our ViT-Split avoids backpropagating gradients to early layers, and reduces both the CNN branch computations and interaction overhead by fully leveraging the prior knowledge in the VFM. As shown in Fig. 7, traditional VFM adapters require training a task-specific VFM along with its corresponding adapter and head. In contrast, ViT-Split keeps the entire VFM frozen, training only a smaller adapter and the corresponding head. This design significantly reduces computational costs, making it more efficient for supporting multiple downstream tasks during inference.

4.2 Detection and Instance Segmentation

Settings. We present detection and instance segmentation results on COCO-2017 [52] in Tab. 5, using MMDetection [9]. The AdamW optimizer is employed with an initial learning rate of 1e-4 and a weight decay of 5e-2, training for 12 epochs (1×\times× schedule). The total batch size is set to 16 and we utilize a MaskRCNN [33] head for experiment. The setting of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given in the Appendix.

As shown in Tab. 5, our ViT-Split achieves comparable performance with current SOTA VFM adapter ViT-CoMer. As discussed in 3.1, the detection task may differ significantly from the original DINOv2 pretraining task, necessitating the tuning of more parameters. Despite this, our ViT-Split still involves fewer parameters and faster training speed (reducing 42% training time) than ViT-CoMer, demonstrating the efficiency of our architecture.

Table 5: Object detection and instance segmentation using Mask R-CNN on COCO val2017. “††\dagger†” indicates pre-training with ImageNet-22K, ‡‡\ddagger‡” represents the use of DINOv2 [65], while the default setting uses ImageNet-1K pre-training.

4.3 Visual Question Answering

Settings. We also present VQA results using the popular visual large language model (VLLM) [50], LLaVA-1.5 [54]. This model comprises a CLIP-L visual encoder for encoding images, an MLP connector for projecting visual tokens into the language space, and a Vicuna-based LLM [17] for generating language tokens. In our modified LLaVA, we replace the original MLP projector with our ViT-Split. To comprehensively evaluate the effectiveness of our ViT-Split, we utilize both academic-task-oriented benchmarks ( VQA-v2 [27], VizWiz [30], SciQA-IMG [61]), and instruction-following LLM benchmarks (POPE [49], MMBench [56], LLaVA-Wild [55], MM-Vet [85]). Following [54], we first pretrain our ViT-Split using 558K image-text pairs, and subsequently fine-tune both ViT-Split and the LLM with 665K mixed data pairs. For more detailed information regarding the hyperparameter settings, please refer to the Appendix.

As shown in Tab. 4, our ViT-Split enhances LLaVA-1.5 performance across most benchmarks. This improvement demonstrates that ViT-Split is also applicable to other VFMs and VQA tasks. Unlike most current VLLMs that directly utilize features from the penultimate layer, ViT-Split leverages both the prior features of the vision encoder and the task-specific features, resulting in richer visual representations that improve the LLM’s learning process. Moreover, we tune only a small portion of the vision encoder’s parameters (specifically, one layer), which ensures efficiency for both training and inference. We believe that ViT-Split will offer new inspiration for VLLM design.

4.4 Ablation Study

We conduct an ablation study for each trainable component in Tab. 6 on ADE20K. The default settings are consistent with those described in Sec. 4.1.

The effectiveness of prior head. The results in Tab. 6 show that incorporating the prior head improves performance by 2.7% and 3.6% compared to the baseline that uses only the final-layer features. This suggests that the prior head effectively leverages multi-layer prior features from the VFM to enhance overall representation quality, surpassing the use of solely the final layer’s prior features. Additionally, our module enhances 2D local representations through the use of a CNN. Furthermore, the results demonstrate that the prior features extracted from the original VFM are highly valuable, achieving performance levels nearly equivalent to those obtained through full fine-tuning.

The effectiveness of task head. As shown in Tab. 6, by tuning only the task head gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the performance nearly matches that of fine-tuning the entire model, supporting the finding in Sec. 3.1. Last few layers can learn task-specific features and achieve similar performance as tuning the entire backbone. Furthermore, the experiments demonstrate that performance can be further enhanced when combined with prior features. We attribute this improvement to the combined benefits of task-specific and prior knowledge, with the latter helping to reduce task head overfitting.

The effectiveness of fusion head. Tab. 6 shows that using fusion net gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT yields a performance improvement of 1.1% for two ViT sizes. We attribute this enhancement to our CNN-based fusion module, which retains richer feature information compared to a simple addition operation. Again, the CNN component strengthens the local feature representation, contributing to improved fusion results.

Table 6: Ablation study of the prior head (gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT), task head (gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT), and fusion net (gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT) on ADE20K, conducted with two ViT sizes: small and base on ViT-Splitu. We set Kt=3subscript𝐾𝑡3K_{t}=3italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 3 and Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 for both model sizes. The baseline model (no modules used, shown without background color) uses only the frozen features from the last layer. The baseline with a gray background indicates full fine-tuning of the entire backbone. When only gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT and gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are used, their features are combined via addition.

Table 7: Ablation study on the frozen layer selection strategies for our ViT-Split model on the ADE20K dataset, using three ViT sizes: small, base, and large. Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is same for all strategies.

The effectiveness of uniform layer selection. In Tab. 7, we evaluate the effectiveness of the selection strategy for prior features. Compared to selecting features from only the last few layers, which capture mostly task-specific prior information—uniform selection allows for a more diverse set of prior features, encompassing both low-level and task-specific characteristics. This uniform selection approach becomes increasingly impactful as the backbone size grows.

The effectiveness across different VFMs. To evaluate the generality of our ViT-Split, we present results on various VFMs in Fig. 9, leveraging the excellent VFM-benchmark codebase 111https://github.com/tue-mps/benchmark-vfm-ss. The experiments demonstrate that ViT-Split consistently enhances performance across both weakly-supervised VFMs (SAM and SigLip) and self-supervised VFMs (MAE). These results not only validate the effectiveness of ViT-Split on multiple VFMs but also suggest that our observations may hold for a broader range of VFMs.

Refer to caption

(a) mIoU.

Refer to caption

(b) Training parameters.

Figure 9: Segmentation results and parameters on ADE20K with different VFMs, including MAE-B [35], SAM-B [43] and SigLip-B [89]. We set Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 and Kt=8subscript𝐾𝑡8K_{t}=8italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 for all the VFMs.

5 Conclusion

In this paper, we introduce ViT-Split, an efficient, effective, and generalized adapter, to adapt VFMs for downstream tasks. Specifically, we introduce two heads based on a frozen VFM, a prior head for multi-scale prior feature extraction and a task head for task-specific feature adaptation. Experiments on segmentation, detection, MDE, and VQA verify the effectiveness and efficiency of our method. In the future, we aim to apply ViT-Split to more VFMs and tasks. We hope our method offers a fresh perspective for efficient and effective VFM adapter design.

\thetitle

Supplementary Material

Refer to caption

Figure 10: Illustration of our proposed layer selection methods: uniform sampling (left) and sparse gate (right). Uniform sampling selects Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers from L𝐿Litalic_L prior features, ranging from the b𝑏bitalic_b-th to L𝐿Litalic_L-th layer. The sparse gate, utilizing the STE technique (see Eq. 5), aggregates multiple layer features and filters out irrelevant ones.

Appendix A Training details

Table 8: Comparison of two layer selection methods on semantic segmentation. The results are conducted on Cityscales val with 896*896 resolution image.

Table 9: Comparison of two layer selection methods on semantic segmentation. The results are conducted on ADE20K val with 512*512 resolution image.

A.1 Hyper-parameter setting

We outline the settings for several key hyperparameters of ViT-Split in Tab. 10, including weight initialization, the number of tuning layers (Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), and the number of selected prior features (Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT), etc. We conduct experiments across four tasks: semantic segmentation, monocular depth prediction, detection, and visual question answering (VQA).

The selection guideline of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and b𝑏bitalic_b. As shown in Tab. 10, these hyperparameters vary across tasks, with their importance ranked as Kt>Kp>bsubscript𝐾𝑡subscript𝐾𝑝𝑏K_{t}>K_{p}>bitalic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > italic_b. As shown in Fig. 11, Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the most critical hyperparameter and is task-dependent. For dense prediction tasks (e.g., segmentation or monocular depth estimation), tuning smaller layers (around 1/6161/61 / 6 to 1/4141/41 / 4) yields good performance. For detection tasks, since the pretrained task differs significantly from detection (see Fig. 4), tuning more layers is necessary for better results. Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT has a smaller impact on results compared to Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 works well in most cases. Typically, we set b=2𝑏2b=2italic_b = 2 to sample prior features from both shallow and deep layers. However, for tasks like VQA, only the last-layer features are needed, as the LLM decoder benefits more from high-level features while low-level features may introduce noise.

Table 10: The settings of the important hyper-parameters of ViT-Split on different tasks, including semantic segmentation, monocular depth estimation, detection and instance segmentation, and vision question answering (VQA).

Table 11: Monocular depth estimation results on NYU-V2 with 416*544 resolution image. “‡‡\ddagger‡” represents the use of DINOv2. Other backbones are initialized with ImageNet-1K/22K weights

A.2 Sizes of various heads

We provide the sizes of the various heads used in ViT-Split for different tasks in Tab. 12, including segmentation (seg.), detection (det.) and monocular depth estimation (mde).

Table 12: The size of different heads used for ViT-Split.

A.3 Details of tuning the VLLM

LLaVA-1.5 employs a CLIP-based vision encoder for image encoding. We introduce a single-layer task head copied from CLIP’s original final layer (i.e., Kt=1subscript𝐾𝑡1K_{t}=1italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1) and utilize only the last-layer feature of CLIP as the input to the prior head (i.e., Kp=1subscript𝐾𝑝1K_{p}=1italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1). We replace the original MLP projector in LLaVA-1.5 with with our ViT-Split for two-stage training. The training follows the same hyperparameter settings as the original LLaVA-1.5.

A.4 Architecture details of various used VFMs

We provide the architecture details of various VFMs used in the main content in Tab. 13.

Table 13: The architecture details of used VFMs.

Appendix B Layer selection

B.1 Sparse gate

Another way is to learn the sparse gate Gs⁢p∈ℝL×Kpsubscript𝐺𝑠𝑝superscriptℝ𝐿subscript𝐾𝑝G_{sp}\in\mathbb{R}^{L\times K_{p}}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the dataset. This method eliminates the need for carefully tuning hyperparameters to select prior features. To remove noisy features, we enforce the sparsity in the gate by selecting top Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT scores, and normalizing the remained ones. However, directly optimizing Gs⁢psubscript𝐺𝑠𝑝G_{sp}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT is infeasible since the sparsity operation is non-differentiable. To address this issue, we employ the Straight-Through Estimator (STE) technique which allows for approximate gradient optimization. Specifically, let G∈ℝL×Kp𝐺superscriptℝ𝐿subscript𝐾𝑝G\in\mathbb{R}^{L\times K_{p}}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the learnable gate, which is continuous. From G𝐺Gitalic_G, we obtain the sparse gates Gs⁢psubscript𝐺𝑠𝑝G_{sp}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT by selecting the top Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT elements in each column. We then apply STE by optimizing the gradient of G𝐺Gitalic_G:

Gs⁢p=Gs⁢p+G−Gn⁢o⁢_⁢g⁢r⁢a⁢d.subscript𝐺𝑠𝑝subscript𝐺𝑠𝑝𝐺subscript𝐺𝑛𝑜_𝑔𝑟𝑎𝑑G_{sp}=G_{sp}+G-G_{no\_grad}.italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT + italic_G - italic_G start_POSTSUBSCRIPT italic_n italic_o _ italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT .

(5)

After obtaining the sparse gate Gs⁢p∈ℝL×Kpsubscript𝐺𝑠𝑝superscriptℝ𝐿subscript𝐾𝑝G_{sp}\in\mathbb{R}^{L\times K_{p}}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we can get the selected prior features by multiplying with the prior feature map 𝐟p∈ℝh×w×L×Dsubscript𝐟𝑝superscriptℝℎ𝑤𝐿𝐷\mathbf{f}_{p}\in\mathbb{R}^{h\times w\times L\times D}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_L × italic_D end_POSTSUPERSCRIPT from the layer dimension.

B.2 Performance on segmentation task

We present a comparison of layer selection methods on segmentation benchmarks, including Cityscapes and ADE20K, in Tab. 8 and Tab. 9. For a fair comparison, we set the same Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for both selection methods and use Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 for all sparse-gate-based experiments. Our results show that sparse gate selection achieves comparable performance to uniform sampling on segmentation tasks without requiring manual hyper-parameter selection. It indicates that sparse gate selection is a promising and versatile approach for reducing the number of hyper-parameters.

Appendix C Motivation of freezing the backbone

Freezing the backbone has three main motivations. ① Improved training and inference speed. Fig. 7 shows our ViT-Split achieves 2.4∼similar-to\sim∼5×\times×, and 2∼similar-to\sim∼6×\times× speedup over other VFM-Adapters on training and inference efficiency. Additionally, as detailed in Tab. 15, ViT-Split is 1.4∼similar-to\sim∼3×\times× faster than finetuning the entire backbone with a linear/UperNet head. ② Enhanced performance with prior features. We admit that the inference speed will decrease compared with finetuning DINOv2-linear due to the extra heads (around 30% on segmentation tasks). However, the performance can be further improved, which is also the main motivation of other VFM-adapters. Compared with these, ViT-Split achieves better training and inference efficiency. ③ Task adaptivity. ViT-Split requires storing only separate task-specific heads, rather than the entire model, making it more adaptive and memory-efficient for deployment across multiple tasks.

Appendix D Explanation of the lower performance on detection task

We acknowledge that the performance difference between ViT-Split and ViT-CoMer on Mask R-CNN (Tab. 5) is relatively small. However, ViT-Split uses only 90%–95% of ViT-CoMer’s trainable parameters, already demonstrating clear advantages in training efficiency while maintaining comparable accuracy. The primary reason ViT-Split does not significantly outperform other VFM-adapters lies in the relatively weak task alignment of the prior features from DINOv2 for object detection tasks. Unlike DETR-style models, which are pre-trained with strong detection-oriented objectives, self-supervised models like DINOv2 tend to provide less directly transferable features for detection. This necessitates using more layers in the task head (i.e., larger Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), effectively making ViT-Split rely more on fine-tuning, similar to other VFM-adapters. As self-supervised models begin to offer stronger detection-aware priors, we expect ViT-Split to better leverage them and close the gap with current SOTA DETR-style models.

Appendix E More results

E.1 An apple-to-apple comparison with other VFM-adapters on segmentation

We provide an apple-to-apple comparison with the SOTA VFM-adapters in Tab. 14, i.e., ViT-CoMer [79] and ViT-Adapter [14]. All models are trained for 40K iterations on ADE20K, using a UperNet head for the baselines and a linear head for ViT-Split. For VFM-adapters, we adopt a learning rate schedule similar to that used in detection tasks, incorporating layer-wise decay with carefully tuned rates for each baseline to ensure strong performance. Results show that with DINOv2 initialization, ViT-Split consistently outperforms other VFM-adapters across different model sizes. This highlights ViT-Split’s ability to better leverage the strong prior knowledge from DINOv2 without altering the original feature representations, which often results in suboptimal performance in other adapters.

Table 14: VFM-adapter comparison on ADE20K (40K iterations).

E.2 Hyper-parameter sensitivity analysis

We provide the analysis of two important hyper-parameters Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in our ViT-Split, which is given in Fig. 11.

Influence of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As shown in Fig. 11(a), the mIoU initially improves when tuning between one and three layers. This improvement is likely due to the task head previously underfitting the task. However, as more layers are tuned, overall performance begins to decline, suggesting that the task head starts to overfit. This experiment demonstrates that tuning additional layers does not necessarily guarantee better performance and can easily lead to overfitting. Therefore, we opt to tune three layers in this case.

Influence of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. As shown in Fig. 11(b), the mIoU peaks when selecting four prior layer features. Selecting too few layers may result in missing critical information, while selecting too many can introduce noise. Additionally, we observe that increasing the number of selected layers does not increase more training parameters, highlighting the efficiency of the prior head. As a result, we choose four prior features in this case.

Refer to caption

(a) Tuning layers Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption

(b) Frozen layers Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT

Figure 11: Parameter sensitivity analysis of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in ViT-Split. The experiments are conducted using ViT-Split-S on ADE20K.

E.3 Visualization

E.3.1 CKA analysis of other VFMs

We also present the CKA results for MAE-L [35] and SAM-L [43] in Fig. 12. The feature representations in the early layers of these VFMs exhibit similar patterns, as do those in the later layers. Based on these findings as well as those in the main paper, we hypothesize that our observation–that the layers of several VFMs can be divided into two components–may hold true for self-supervised models pretrained on large-scale dataset (e.g., DINOv2 [65], MAE [35], EVA2 [25], etc.), as well as weakly supervised ones (say CLIP [69], SigLip [89], SAM [43], etc.).

Refer to caption

Figure 12: The CKA of SAM (a) and MAE (b). (c) Training comparison between ViT-Split-s and DINOv2-s-UperNet on ADE20K.

Refer to caption

Figure 13: Further comparison of DINOv2-S layer features across original features, segmentation, and detection tasks. In each figure, the first, second, and third rows correspond to original, segmentation, and detection features, respectively. It can be observed that features from earlier layers exhibit similar patterns across different tasks, reflecting common low-level local features. However, features from deeper layers diverge significantly according to their specific downstream tasks.

[Uncaptioned image]

Refer to caption

Figure 14: Semantic segmentation and instance segmentation results based on our ViT-Split-L (left: original image, middle: semantic segmentation results, right: instance segmentation results).

E.3.2 CKA analysis of different DINOv2 sizes

We also provide the CKA visualizations of different DINOv2 sizes in Fig. 15. From these visualizations, we observe that features in the early layers are more similar across different DINOv2 sizes compared to those in the later layers. As earlier mentioned, the early layers serve as an encoder to capture low-level features, while the later layers act as a decoder to produce task-specific features.

Refer to caption

Figure 15: The CKA visualizations of different sizes of DINOv2.

E.3.3 More layer feature comparison

We present additional visualizations of DINOv2 layer features across different tasks (i.e., DINOv2 pretraining, segmentation, and detection) in Fig. 13. These results demonstrate that earlier-layer features from various tasks consistently focus on detailed, low-level information. However, deeper-layer features diverge significantly between tasks. Specifically, features from both the original DINOv2 pretraining and semantic segmentation emphasize semantic-level information of particular objects, whereas detection features tend to highlight object corners and boundaries.

E.3.4 Semantic segmentation and instance segmentation results

We present semantic segmentation and instance segmentation results based on our ViT-Split-L (DINOv2 pretrained) in Fig. 14. We utilize ADE20K and COCO2017 datasets for training these two tasks, respectively, and evaluate both on the ADE20K validation dataset.

It is worth noting that both results are obtained using the same frozen DINOv2-L backbone, meaning only the task-specific adapters and heads require training. Consequently, the overall computational cost and the number of parameters are significantly reduced compared to previous VFM-adapters, while achieving competitive or superior performance. These visualizations demonstrate the strong generalization capability of ViT-Split, highlighting its versatility, effectiveness, and efficiency across multiple downstream tasks.

E.4 Training efficiency comparison

Table 15: Training time comparison on ADE20K (tuning 10k iterations on 4*A6000Ada). DINOv2-linear and DINOv2-UperNet are finetuned end to end.

To further illustrate the training efficiency compared with different heads on segmentation task, we provide the training time comparison in Tab. 15. For fair comparison, all of these baselines (except for ViT-Split) are finetuned using the DINOv2 backbone with two different heads (linear and UperNet) for 10k iterations on 4*A6000Ada.

From Tab. 15, we observe that our ViT-Split reduces the training time on average of DINOv2-linear by approximately 42% on average while maintaining the same linear head. This improvement in training efficiency is attributed to the task-head design, which prevents gradients from propagating to the early layers of the backbone. Compared to finetuning a VFM with a larger segmentation head (DINOv2-UperNet), our ViT-Split is 2.5 times faster across three sizes on average. This highlights the huge computation overhead introduced by a large segmentation head and demonstrates the efficiency of our ViT-Split.

E.5 Longer training time

We try to increase the training time to illustrate the upper bound of ViT-Split. We conduct an experiment in Fig. 12 to explore the performance upper bound with extended training (i.e., 160K iterations). As shown in Fig. 12 (c), ViT-Split-s achieves 52.2%, improving from 51.5% at 40K iterations and surpassing DINOv2s-UperNet (51.6%) while maintaining faster training speeds. This demonstrates that ViT-Split can achieve better performance when training for longer time.

E.6 Monocular depth estimation

Settings. To further investigate the effectiveness of our ViT-Split, we also provide the results on monocular depth estimation (MDE) on NYU-V2 [74] benchmark in Tab. 11. Following [51], we utilize the AdamW optimizer with an initial learning rate of 3e-4 and a weight decay of 1e-2. We multiply 0.1 by the learning rate of the task head during training. Moreover, one cycle learning rate decay schedule is utilized for better performance. We train ViT-Split for 384K iterations with a total batch size of 16 on 4*A6000ada GPUs.

As shown in Tab. 11, our ViT-Split achieves competitive or even superior results compared to previous state-of-the-art methods, while using a minimal number of trainable parameters. Notably, ViT-Split employs only a single linear head rather than a specially designed head, highlighting the potential of our approach. Leveraging the prior knowledge embedded in vision foundation models (VFMs), we believe the size of the downstream task head (e.g., for depth prediction) can be further reduced to improve efficiency.

When compared to DINOv2-G with DPT [71], which uses the same DINOv2 initialization but a larger and more sophisticated head, our smaller ViT-Split-B version achieves similar performance with fewer parameters, demonstrating both the effectiveness and efficiency of our method. Furthermore, compared to traditional end-to-end fine-tuning approaches, ViT-Split achieves better performance by fully utilizing the prior knowledge inherent in VFMs. This also highlights the significant potential of large-scale self-supervised learning initialization over traditional supervised learning initialization.

E.7 Segmentation on Pascal Context

Settings. Apart from ADE20K and Cityscapes, we also provide the results on Pascal Context [63] in Tab. 16. We utilize the AdamW optimizer with an initial learning rate of 1e-4 and weight decay of 1e-2. We multiply by 0.1 to the task head during training. We train our model for 20K iterations, and the total batch size is set to 16.

As shown in Tab. 16, our method outperforms ViT-Adapter, achieving a 2% improvement for the base model and a 0.3% improvement for the large model, using just a simple linear head and training for only 20K iterations. The results demonstrate the strength of VFMs, with our method achieving both effectiveness and efficiency by fully utilizing the prior knowledge within the VFMs.

Table 16: Semantic segmentation results on the Pascal Context val with 480*480 resolution image. “††\dagger†” indicates the BEiT initialization and “††\dagger†” represents the use of DINOv2.

Appendix F Limitations

Currently, we have demonstrated the effectiveness of ViT-Split only on a limited set of VFMs, such as DINOv2 and CLIP, leaving its performance on a broader range of models to be explored in future work.

References

Agarwal and Arora [2023] Ashutosh Agarwal and Chetan Arora. Attention attention everywhere: Monocular depth prediction with skip attention. In WACV, pages 5861–5870, 2023.
Assran et al. [2023] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, pages 15619–15629, 2023.
Awais et al. [2023] Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
Bao et al. [2022] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In ICLR, 2022.
Bhat et al. [2021] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In CVPR, pages 4009–4018, 2021.
Brown [2020] Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
Chen et al. [2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
Chen et al. [2023a] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
Chen et al. [2022a] Qiang Chen, Qiman Wu, Jian Wang, Qinghao Hu, Tao Hu, Errui Ding, Jian Cheng, and Jingdong Wang. Mixformer: Mixing features across windows and dimensions. In CVPR, pages 5249–5259, 2022a.
Chen et al. [2022b] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, pages 16664–16678, 2022b.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
Chen et al. [2023b] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2023b.
Chen et al. [2024] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, pages 24185–24198, 2024.
Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
Chu et al. [2021] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS, pages 9355–9366, 2021.
Contributors [2020] MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In ICML, pages 7480–7512, 2023.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
Fang et al. [2023] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, pages 19358–19369, 2023.
Fu et al. [2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, pages 21271–21284, 2020.
Gui et al. [2024] Jie Gui, Tuo Chen, Jing Zhang, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE TPAMI, 2024.
Gurari et al. [2018] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
Han et al. [2023] Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
Hu et al. [2022] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
IDEFICS [2023] IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
Jain et al. [2023] Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. In CVPR, pages 2989–2998, 2023.
Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, pages 709–727, 2022.
Jie and Deng [2022] Shibo Jie and Zhi-Hong Deng. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
Jie and Deng [2023] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, pages 1060–1068, 2023.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, pages 4015–4026, 2023.
Kornblith et al. [2019] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In ICML, pages 3519–3529, 2019.
Li et al. [2023a] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, pages 3041–3050, 2023a.
Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023b.
Li et al. [2021] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
Li et al. [2022] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296, 2022.
Li et al. [2023c] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
Li et al. [2025] Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765, 2025.
Li et al. [2024] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE TIP, 2024.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
Liu et al. [2023] Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Va-depthnet: A variational approach to single image depth prediction. In ICLR, 2023.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024a.
Liu et al. [2024b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024b.
Liu et al. [2024c] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, pages 216–233, 2024c.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
Liu et al. [2022a] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022a.
Liu et al. [2022b] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022b.
Loshchilov [2017] I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
Mercea et al. [2024] Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, and Anurag Arnab. Time-memory-and parameter-efficient visual adaptation. In CVPR, pages 5536–5545, 2024.
Mottaghi et al. [2014] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898, 2014.
Ning et al. [2023] Jia Ning, Chen Li, Zheng Zhang, Chunyu Wang, Zigang Geng, Qi Dai, Kun He, and Han Hu. All in tokens: Unifying output space of visual tasks via soft token. In ICCV, pages 19900–19910, 2023.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2023.
Pan et al. [2022] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, pages 26462–26477, 2022.
Patil et al. [2022] Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, pages 1610–1621, 2022.
Piccinelli et al. [2023] Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: Internal discretization for monocular depth estimation. In CVPR, pages 21477–21487, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
Raghu et al. [2021] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? NeurIPS, 34:12116–12128, 2021.
Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
Rebuffi et al. [2017] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In NeurIPS, 2017.
Shao et al. [2024] Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation. In NeurIPS, 2024.
Silberman et al. [2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760, 2012.
Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
Wang et al. [2022] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
Wang et al. [2023] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023.
Xia et al. [2024] Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions. In CVPR, pages 5493–5502, 2024.
Xiao et al. [2018] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
Yang et al. [2021] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
Yang et al. [2024] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, pages 10371–10381, 2024.
Yu et al. [2024] Bruce XB Yu, Jianlong Chang, Haixin Wang, Lingbo Liu, Shijie Wang, Zhiyu Wang, Junfan Lin, Lingxi Xie, Haojie Li, Zhouchen Lin, et al. Visual tuning. ACM Computing Surveys, 56(12):1–38, 2024.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
Yu et al. [2023] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
Yuan et al. [2022] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocular depth estimation. In CVPR, pages 3916–3925, 2022.
Yun et al. [2023] Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, and Dong Hwan Kim. Spanet: Frequency-balancing token mixer using spectral pooling aggregation modulation. In ICCV, pages 6113–6124, 2023.
Zaken et al. [2022] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, pages 1–9, 2022.
Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
Zhang et al. [2025] Dong Zhang, Rui Yan, Pingcheng Dong, and Kwang-Ting Cheng. Memory efficient transformer adapter for dense predictions. In ICLR, 2025.
Zhao et al. [2023] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Zhou et al. [2019] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 127:302–321, 2019.
Zhou et al. [2023] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.
Zhou et al. [2022a] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2022a.
Zhou et al. [2022b] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022b.
Zhou et al. [2022c] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022c.

Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads (original) (raw)

Abstract

1 Introduction

2 Related Work

2.1 Vision foundation models

2.2 PEFT and VFM adapters

3 Method

3.1 The observation in VFMs

3.2 ViT-Split

4 Experiments

4.1 Semantic segmentation

4.2 Detection and Instance Segmentation

4.3 Visual Question Answering

4.4 Ablation Study

5 Conclusion

Appendix A Training details

A.1 Hyper-parameter setting

A.2 Sizes of various heads

A.3 Details of tuning the VLLM

A.4 Architecture details of various used VFMs

Appendix B Layer selection

B.1 Sparse gate

B.2 Performance on segmentation task

Appendix C Motivation of freezing the backbone

Appendix D Explanation of the lower performance on detection task

Appendix E More results

E.1 An apple-to-apple comparison with other VFM-adapters on segmentation

E.2 Hyper-parameter sensitivity analysis

E.3 Visualization

E.3.1 CKA analysis of other VFMs

E.3.2 CKA analysis of different DINOv2 sizes

E.3.3 More layer feature comparison

E.3.4 Semantic segmentation and instance segmentation results

E.4 Training efficiency comparison

E.5 Longer training time

E.6 Monocular depth estimation

E.7 Segmentation on Pascal Context

Appendix F Limitations

References