Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads (original) (raw)

Yifan Li1, Xin Li2, Tianqin Li2,3, Wenbin He2, Yu Kong1, Liu Ren2
1Michigan State University
2Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI)
3Carnegie Mellon University
{liyifa11, yukong}@msu.edu, {xin.li9, tianqin.li2, Wenbin.He2, liu.ren}@us.bosch.com

Abstract

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to 4×4\times4 × while achieving comparable or even better results on ADE20K, compared to other VFM adapters. Codes are available: https://jackyfl.github.io/vitsplit.github.io/.

1 Introduction

Recent studies reveal that the foundation models have the remarkable ability to acquire prior knowledge from large-scale datasets [93], which enhances the performance in downstream tasks. For vision tasks, vision foundation models (VFMs) acquire prior knowledge from large-scale datasets through self-supervised learning [29], utilizing techniques such as masked image modeling (MIM) [35, 94, 5], contrastive learning [13, 34, 8, 28], or hybrid approaches (MIM + contrastive) [65, 2]. They also leverage vision-language alignment [69, 25] and dense prediction tasks [43, 82], among others. VFMs exhibit remarkable zero-shot and transfer learning capabilities across a variety of downstream tasks, e.g., classification, detection, segmentation, monocular depth estimation (MDE), and visual question answering (VQA), etc.

Refer to caption

(a) Previous VFM adapters.

Refer to caption

(b) Ours (ViT-Split).

Figure 1: Comparison between previous VFM adapters and ours. Previous VFM adapters integrate low-level features learned by a CNN branch into a learnable VFM through an adapter. Our method exploits VFM prior knowledge with two heads: a prior head for multi-scale prior feature learning from a frozen VFM, and a task head for task-specific feature learning, initialized by the last few layers of the VFM.

To leverage prior knowledge from VFMs, previous VFM adapters such as ViT-Adapter [14] or ViT-CoMer [79] primarily adopt a two-branch architecture (see Fig. 1(a)). Such a design enables the adapter to integrate low-level features from a convolutional neural network (CNN) with global features from a vision transformer (ViT)-based VFM. While this architecture has demonstrated promising results across various downstream tasks, certain design aspects may affect training efficiency. From Fig. 1(a), we identify two main issues of inefficiency. First, the interaction between the CNN and ViT branches across multiple stages requires gradients to be back-propagated through all layers of the model during training. This results in increased computational and memory costs as the size of the VFM grows. Second, all components need to be tuned during training to achieve optimal performance. Specifically, for tasks like segmentation, a large head such as Mask2Former [16] is tuned, and its size is nearly equivalent to that of the VFM backbone.

To address the training inefficiency issue, parameter-efficient fine-tuning (PEFT) methods are proposed to reduce training parameters. These methods include prompt-tuning approaches like VPT [40], adapter-based methods like AdaptFormer [12], and low-rank weight tuning like LoRA [36] or FacT [42]. However, these methods still encounter the issue of early-layer gradient back-propagation, as learnable parameters are appended to each layer’s visual tokens (prompt tuning), or low-rank weights are inserted into the layers (adapter-based methods) or added to the original weights (low-rank weight tuning). Moreover, these PEFT methods do not incorporate low-level features as VFM adapters do, and their performance is either slightly inferior to or generally on par with traditional fine-tuning. Furthermore, despite their proven effectiveness across various tasks [65], the pretrained prior features are not fully leveraged by either PEFT methods or the VFM adapters.

Refer to caption

Figure 2: Comparison with previous VFM adapters (ViT-Adapter [14] and ViT-CoMer [79]) on ADE20K val. The results indicate that by leveraging the potential of VFMs (DINOv2 in this task), ViT-Split can achieve competitive results compared to previous VFM adapters. Notably, ViT-Split accomplishes this with only a single linear head and a small number of trainable parameters.

To tackle the aforementioned challenges, we propose a method called ViT-Split (see Fig. 1(b)). ViT-Split is built upon the observation that the layers of a VFM like DINOv2 [65] can be divided into two components: a low-level feature extractor and a task-specific feature adapter. Consequently, an additional CNN branch for local feature extraction becomes unnecessary, allowing us to remove it to resolve the early layer gradient propagation issue. Additionally, we propose a task-specific adapter, named “task head”, tailored for downstream tasks. This adapter is initialized from the last few layers of the VFM, further avoiding gradient propagation problems in early layers. To effectively leverage prior features learned by VFM from large-scale datasets, we introduce an additional “prior head” that integrates multi-scale prior features instead of tuning the entire VFM. Such a head reduces the number of trainable parameters and helps mitigate overfitting in the task head (see Appendix). Additionally, we explore two layer selection strategies to identify the most relevant layer features. Experiments on segmentation task (see Fig. 2) demonstrate that our ViT-Split, using only a single linear head, can achieve competitive performance compared with previous VFM adapters with larger segmentation heads like Mask2former [16] or UperNet [80], while tuning fewer parameters and reducing training time (see Fig. 8).

Furthermore, ViT-Split is both adaptive and memory efficient for multiple tasks (see Fig. 7). Previous VFM adapters require separate modules (VFM+CNN+adapter+heads) for each task, leading to high computational and memory overhead. In contrast, ViT-Split shares a pre-trained VFM backbone, requiring only a task-specific adapter and the corresponding task head to be learned. Our approach introduces a new paradigm for designing both computation and memory efficient VFM adapters across multiple tasks. In summary, the contributions of this paper are threefold:

2.1 Vision foundation models

Vision foundation models (VFMs) [3] are trained on large-scale datasets in a self-supervised, weakly-supervised, or supervised manner, making them adaptable to a wide range of downstream tasks. Benefiting from the scalability of the transformer architecture, recent ViT-based [24] VFMs demonstrate remarkable zero-shot and transfer ability across various downstream tasks. Self-supervised pretraining paradigm learns discriminative features solely from vision data at the image and pixel level, including contrastive learning (MoCo [34], SimCLR [13]), masked image modeling (BEiT [5], MAE [35], iBoT [94]) or hybrid approaches (DINOv2 [65], I-JEPA [2]). Weakly-supervised pretraining paradigm leverages text guidance, aligning visual representations with language space, such as CLIP [69], ALIGN [39], EVA2 [25], SigLip [89], etc. Supervised pretraining paradigm learns from different task labels, such as classification (DeiT [76]), segmentation (SAM [43]), and monocular depth estimation (DAM [82]), etc.

2.2 PEFT and VFM adapters

As the size of transformer-based foundation models continues to grow, such as large language models in language [7, 91], large vision models in vision [23, 84], and multi-modal large language model [15, 4] for multi-modal learning, training efficiency becomes increasingly crucial. To address this challenge, PEFT methods have gained significant popularity in recent years.

Current PEFT approaches for vision [83] generally fall into three categories: prompt tuning, adapter tuning, and parameter tuning. Prompt tuning involves learning a small number of prompt tokens, either in the first layer (CoOp [96], CoCoOp [95]) or in every layer (VPT [40]), making it lightweight and easy to implement. Adapter tuning inserts additional blocks into a frozen model either in a sequential manner (Res-adapt [72], ST-Adapter [66]) or in parallel (AdaptFormer [12], ConvPass [41], LoSA [62]), which shows good adaptability and generalizability. Parameter tuning modifies part of the model parameters, either by adjusting the weight (LoRA [36], FacT [42]) or tuning the bias (Bitfit [88]), resulting in effective and efficient tuning.

Current VFM adapters (ViT-Adapter [14], ViT-CoMer [79]) aim to enhance full fine-tuning performance by incorporating the inductive bias from the CNN branch with spatial prior. These adapters typically require tuning the whole backbone to achieve optimal performance, resulting in better performance than PEFT methods. The interaction between CNN and ViT features is achieved through cross-attention [14], self-attention [79] or mixed [90] across several layers. By contrast, our ViT-Split keeps the entire backbone frozen, introducing two lightweight heads for separate tuning, which is efficient and effective across various tasks.

3 Method

3.1 The observation in VFMs

We observe that in some VFMs, the layers can be broadly partitioned into two groups with similar features: the earlier and later layers. First, we plot the Centered Kernel Alignment (CKA) [44] across different layers for several VFMs, as shown in Fig. 3. The results reveal that features in the earlier layers are more similar to each other, as are those in the later layers, particularly in DINOv2 [65]. We attribute this phenomenon to the “encoder-decoder” architecture intrinsic to VFMs: the earlier layers function as an encoder (feature extractor) to capture features from the visual data, while the later layers act as a decoder (task-specific adapter) that generates features for downstream tasks.

Refer to caption

Figure 3: The CKA comparison of layer features across different VFMs, including a self-supervised method DINOv2-L [65], and three image-text alignment methods EVA2-L [25], CLIP-L [69] and SigLip-L [89]. For most of these VFMs, especially DINOv2, the features in the early and later layers show distinct similarities within their respective groups.

Refer to caption

Figure 4: Comparison of DINOv2-S layer features across different tasks, including pretraining (org.), segmentation (seg.), and detection (det.). Notably, the segmentation and detection models are fine-tuned from the DINOv2-S. The features within the red dotted boxes across the three tasks exhibit similar patterns, emphasizing detailed representations. In the later layers, however, the features diverge, becoming more specialized for each task.

A research question is raised: what do these two groups of layers actually learn? To answer this question, we visualize the features of each layer in DINOv2-S (Fig. 4) using the first channel of the visual tokens. To further explore feature differences across downstream tasks, we fine-tune the same DINOv2-S on segmentation and detection tasks by adding a linear head and a Mask R-CNN [33] head, respectively. As shown in Fig. 4, we observe that in the early layers (say layer 1-6), all three models exhibit similar feature patterns, focusing more on low-level features like texture and edges. This observation is also supported in [70], which demonstrates that ViT can learn low-level features through large-scale pretraining. While in the later layers, the features diverge for different tasks. Specifically, for the original DINOv2 and segmentation features (row 1 and row 2), the focus shifts towards the semantic information of objects. Whereas in the detection task, the feature attention gradually moves to the object corners or edges (row 3, L7-L12). We attribute this phenomenon to the intrinsic characteristics of each task: DINOv2’s pretraining objective is to reconstruct missing parts of the original features, which requires the semantic level understanding as the segmentation task does. In detection, the goal is to predict object bounding boxes, which necessitates focusing more on the corners. This phenomenon also highlights the difference between dense prediction and detection task.

Based on the findings, we divide layers of VFMs into two groups with similar features: a feature extractor for learning low-level features and a task-specific adapter for learning task-related features.

3.2 ViT-Split

The framework of ViT-Split is illustrated in Fig. 5, which includes three trainable components: a task head, a prior head and a fusion net. The task head, initialized with the last few layers of the VFM, is designed to learn task-specific features. The prior head integrates multi-scale prior features from the VFM, which are learned from large-scale, diverse datasets. Finally, the fusion net combines both task-specific and prior features to support various downstream tasks.

Refer to caption

Figure 5: The framework of ViT-Split. ViT-Split introduces two splitting heads, one prior head for aggregating multi-scale prior features from VFM and a task head for learning task-specific features. These features are then combined using a fusion network, enabling effective performance across various downstream tasks.

When an input image with a shape of H×W𝐻𝑊H\times Witalic_H × italic_W is fed into a frozen VFM (e.g., DINOv2), h⋅w⋅ℎ𝑤h\cdot witalic_h ⋅ italic_w vision tokens with D𝐷Ditalic_D channels will be obtained from each layer. The vision tokens from (L−Kt𝐿subscript𝐾𝑡L-K_{t}italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) layer are passed through a task head, which is copied from the last Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT layers of the VFM, where L𝐿Litalic_L is the number of the total layers. The task features are then reshaped to h×w×Dℎ𝑤𝐷h\times w\times Ditalic_h × italic_w × italic_D. Meanwhile, Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers of prior features from the frozen VFM are sampled using selection strategies, then concatenated and reshaped into a feature map of size h×w×(Kp⋅D)ℎ𝑤⋅subscript𝐾𝑝𝐷h\times w\times(K_{p}\cdot D)italic_h × italic_w × ( italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_D ). The feature map is then passed through a prior head, a two-layer CNN, resulting in a prior feature map of shape h×w×Dℎ𝑤𝐷h\times w\times Ditalic_h × italic_w × italic_D. Finally, the task and prior feature maps are concatenated along the channel dimension and fused by a fusion net, which has a similar architecture to the prior head. The final fusion feature map is provided for different downstream heads.

Task Head. Based on the observation in Sec. 3.1 that early layers of VFMs are capable of learning low-level features which are similar for different tasks, we avoid fine-tuning the entire backbone by sharing these early layers. Meanwhile, to retain the prior features of the VFM, we replicate the final Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT layers separately, utilizing them as a task-specific adapter for downstream tasks. The hyperparameter Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT controls the adapter’s size, balancing between model capacity and training efficiency.

We observe that the benefits of increasing Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT diminish, particularly for segmentation tasks, allowing us to choose a smaller Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to enhance efficiency (see hyper-parameter analysis in Appendix). Additionally, we find that a large segmentation head may be unnecessary, as the task-specific head is sufficient to capture the downstream dataset’s specific knowledge. Let the features from the (L−Kt𝐿subscript𝐾𝑡L-K_{t}italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)-th layer of the VFM be denoted as fL−Ktsubscript𝑓𝐿subscript𝐾𝑡f_{L-K_{t}}italic_f start_POSTSUBSCRIPT italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Consequently, the task-specific features are given by:

ft=gθt⁢(fL−Kt),subscript𝑓𝑡subscript𝑔subscript𝜃𝑡subscript𝑓𝐿subscript𝐾𝑡f_{t}=g_{\theta_{t}}(f_{L-K_{t}}),italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_L - italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (1)

where gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the task head. After obtaining the task feature ft∈ℝ(h⋅w+1)×Dsubscript𝑓𝑡superscriptℝ⋅ℎ𝑤1𝐷f_{t}\in\mathbb{R}^{(h\cdot w+1)\times D}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h ⋅ italic_w + 1 ) × italic_D end_POSTSUPERSCRIPT, we drop the class token and reshape it from the sequence dimension to form a feature map ft′∈ℝh×w×Dsubscriptsuperscript𝑓′𝑡superscriptℝℎ𝑤𝐷f^{\prime}_{t}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT.

Prior Head. The prior features learned by VFMs have demonstrated strong performance across a range of downstream tasks [69, 65]. However, most current VFM adapters and PEFT methods modify these prior features during training. In contrast, our ViT-Split approach fully leverages the prior knowledge embedded in the multi-scale features of the VFM through a dedicated prior head. Our rationale for utilizing these prior features is to harness the knowledge learned by VFMs to enhance task-specific features while mitigating the risk of overfitting downstream tasks.

Specifically, the architecture of the prior head is shown in Fig. 6, consisting of two CNN layers, a 1×\times×1 convolution layer and a 3×\times×3 deformable convolution layer. The 1×\times×1 convolution layer is used to compress the channels of the multi-scale feature maps, providing efficiency when dealing with larger scales. Meanwhile, the deformable convolution layer [21] enhances low-level features and models geometric transformations within the feature map.

Layer Selection. How to select suitable prior features from all the VFM layers? To address this, we explore two techniques for selecting Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers from a total of L𝐿Litalic_L layers: uniform sampling and sparse gate. We delineate sparse gate in the Appendix. Uniform sampling involves selecting Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT prior features uniformly from L𝐿Litalic_L layers. This design is motivated by two factors: first, mitigating the high similarity between features of neighboring layers (see Fig. 3), and second, promoting greater diversity among the selected features. Specifically, the set of sampled indices, 𝒮𝒮\mathcal{S}caligraphic_S, is defined as follows:

| δ=L−b−1Kp−1,𝒮={b+round⁢(i⋅δ)|i=0,…,Kp−1},formulae-sequence𝛿𝐿𝑏1subscript𝐾𝑝1𝒮conditional-set𝑏round⋅𝑖𝛿𝑖0…subscript𝐾𝑝1\small\delta=\frac{L-b-1}{K_{p}-1},\mathcal{S}=\{b+{\rm{round}}(i\cdot\delta)|% i=0,...,K_{p}-1\},italic_δ = divide start_ARG italic_L - italic_b - 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 end_ARG , caligraphic_S = { italic_b + roman_round ( italic_i ⋅ italic_δ ) | italic_i = 0 , … , italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 } , | (2) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | --- |

where b𝑏bitalic_b is the starting index, used to skip the first few layers, as these layers tend to contain more noise. In most experiments, we set b=2𝑏2b=2italic_b = 2 or b=3𝑏3b=3italic_b = 3. roundround\rm{round}roman_round indicates the rounding to the nearest integer, and δ𝛿\deltaitalic_δ represents the sampling interval.

Refer to caption

Figure 6: The illustration of the CNN fusion architecture. It is used to fuse multi-scale feature maps and serves as the architecture for both the prior head and fusion net. This module consists of two CNN layers: a 1×1 convolution layer followed by a 3×3 deformable convolution layer.

After obtaining the selected prior features fpi∈ℝ(h⋅w+1)×D,i={0,…,Kp−1}formulae-sequencesubscriptsuperscript𝑓𝑖𝑝superscriptℝ⋅ℎ𝑤1𝐷𝑖0…subscript𝐾𝑝1f^{i}_{p}\in\mathbb{R}^{(h\cdot w+1)\times D},i=\{0,...,K_{p}-1\}italic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h ⋅ italic_w + 1 ) × italic_D end_POSTSUPERSCRIPT , italic_i = { 0 , … , italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 }, we drop the class tokens, reshape and concatenate them to a multi-scale prior feature map fp∈ℝh×w×(Kp⋅D)subscript𝑓𝑝superscriptℝℎ𝑤⋅subscript𝐾𝑝𝐷f_{p}\in\mathbb{R}^{h\times w\times(K_{p}\cdot D)}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × ( italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_D ) end_POSTSUPERSCRIPT. Finally, the aggregated prior map fp′∈ℝh×w×Dsubscriptsuperscript𝑓′𝑝superscriptℝℎ𝑤𝐷f^{\prime}_{p}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT can be denoted as:

fp′=gθp⁢(fp),subscriptsuperscript𝑓′𝑝subscript𝑔subscript𝜃𝑝subscript𝑓𝑝f^{\prime}_{p}=g_{\theta_{p}}(f_{p}),italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (3)

where gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the prior head.

Fusion net. Fusion net is utilized to fuse prior feature map fp′subscriptsuperscript𝑓′𝑝f^{\prime}_{p}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the task-specific feature map ft′subscriptsuperscript𝑓′𝑡f^{\prime}_{t}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for different downstream tasks. This network has a similar architecture as the prior head (see Fig. 6). Let [fp′;ft′]∈ℝh×w×(2⁢D)subscriptsuperscript𝑓′𝑝subscriptsuperscript𝑓′𝑡superscriptℝℎ𝑤2𝐷[f^{\prime}_{p};f^{\prime}_{t}]\in\mathbb{R}^{h\times w\times(2D)}[ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × ( 2 italic_D ) end_POSTSUPERSCRIPT be the concatenated feature map of fp′subscriptsuperscript𝑓′𝑝f^{\prime}_{p}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and ft′subscriptsuperscript𝑓′𝑡f^{\prime}_{t}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along the channel dimension. The rationale of using concatenation to fuse two feature maps is to preserve more information (see Tab. 6). The final fused map fo∈ℝh×w×Dsubscript𝑓𝑜superscriptℝℎ𝑤𝐷f_{o}\in\mathbb{R}^{h\times w\times D}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_D end_POSTSUPERSCRIPT is given by:

fo=gθf⁢([fp′;ft′]),subscript𝑓𝑜subscript𝑔subscript𝜃𝑓subscriptsuperscript𝑓′𝑝subscriptsuperscript𝑓′𝑡f_{o}=g_{\theta_{f}}([f^{\prime}_{p};f^{\prime}_{t}]),italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( [ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) , (4)

where gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the fusion net.

We then apply different transformations based on the type of downstream task. Specifically, for the segmentation task, we upsample fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT by a factor of 4 using two transposed convolution layers. For the detection task, we transform fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT into four scales, i.e., 4×4\times4 ×, 2×2\times2 ×, 1×1\times1 × and 0.5×0.5\times0.5 × to match the input requirements of the detection head (MaskRCNN). For the VQA task, we reshape fosubscript𝑓𝑜f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT along the sequence dimension to (h⋅w)×D⋅ℎ𝑤𝐷{(h\cdot w)\times D}( italic_h ⋅ italic_w ) × italic_D for the LLM decoder.

4 Experiments

We conduct experiments on three tasks, semantic segmentation, object detection, and VQA, using well-established benchmarks, e.g., COCO [52], ADE20K [92], CityScapes [20], among others. We also present MDE results in the Appendix. Next, we perform ablation studies to further evaluate ViT-Split’s performance. A uniform selection strategy is applied to all experiments in this section, while results for the sparse gate are provided in the Appendix.

Table 1: Semantic segmentation results on the ADE20K val with 512*512 resolution image. ‡‡\ddagger‡ represents the DINOv2 initialization. “††\dagger†” denotes the use of ImageNet-22K pre-trained weight, while the default is to use ImageNet-1K pre-training.

Table 2: Compared with previous SOTA segmentic segmentation methods on ADE20K val with 896*896 resolution image. ‡‡\ddagger‡ are initialized with DINOv2. * is implemented without tuning the whole backbone [65]. “MS” means multi-scale testing. “MM” indicates multi-modal pretraining.

Refer to caption

Figure 7: Inference comparison: (a) Previous VFM adapters vs. (b) Our ViT-Split. ViT-Split is efficient during inference for multiple tasks.

4.1 Semantic segmentation

Settings. We conduct the semantic segmentation task on ADE20K [92] and Cityscapes [20], using MMSegmentation [19]. We employ AdamW [60] with a learning rate of 2e-4 and a weight decay of 1e-2. The training process uses a total batch size of 16. The learning rate for the task head is further reduced by a factor of 0.1. Unlike previous baselines, we use a simple linear head with two-layer deconvolutional blocks (×\times×4) for segmentation, with a total of 40k iterations (50k for DINOv2-g). We provide the hyper-parameter analysis of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the Appendix.

ADE20K val with 512×\times×512 image. As shown in Tab. 1, we can see that our ViT-Split surpasses all other baselines on ADE20K with 512×\times×512 resolution input image by fully leveraging the potential of the VFM. The results demonstrate the superiority of the DINOv2 compared to ImageNet pretrained models. Additionally, ViT-Split requires tuning only about 1/5 to 1/4 of the parameters and trains for just 1/4 of the iterations compared to previous baselines. The parameter efficiency is because of: 1) the efficient adaptation architecture of ViT-Split and 2) the lightweight linear head. The fast convergence speed attributes to effective utilization of the prior knowledge embedded in VFMs. Moreover, compared to fine-tuning the entire DINOv2 baseline, our ViT-Split adjusts only 1/4 to 1/2 of the parameters while achieving an average improvement of 2% across three model sizes. Since most tunable parameters come from the tuned head, which represents a small portion of the entire VFM, the overall parameter count for tuning remains low. The performance gains can be attributed to the utilization of the multi-scale prior features from the VFM.

Refer to caption

(a)

Refer to caption

(b)

Figure 8: Comparison of time complexity for VFM adapters on ADE20K using two different sizes of ViT: (a) ViT-S and (b) ViT-B. For a fair evaluation, we reimplemented the other adapters under the same conditions, i.e., 4×A6000 Ada, over 10,000 iterations.

ADE20K and Cityscapes val with 896×\times×896 image. Additionally, we also compare with other SOTA methods on ADE20K (Tab. 2) and Cityscapes (Tab. 3) using images of 896×\times×896 resolution. As shown in Tab. 2, we can see that ViT-Split achieves results comparable to current SOTA methods on ADE20K val. It is worth mentioning that ViT-Split uses only a small linear head and does not rely on extra pretraining data. For a fair comparison, we benchmark against ViT-Adapter-G, which trains only the adapter and the Mask2former head based on the DINOv2 backbone. Our ViT-Split not only delivers better performance but also requires half the training parameters and achieves faster training speed. Specifically, according to [65], training ViT-Adapter-G requires 16 V100 GPUs for 28 hours, whereas our ViT-Split-G takes only 8 A6000 Ada GPUs for 15.7 hours. Moreover, on Cityscapes dataset (Tab. 3), our ViT-Split outperforms ViT-Adapter with only around 1/6 parameters being tuned. The results suggest that a simple linear head is enough for competitive results on semantic segmentation by fully leveraging VFM prior knowledge.

Table 3: Semantic segmentation results on Cityscales val with 896*896 resolution image. “††\dagger†” indicates that the model is initialized with BEiTv2 then pretrained on the Mapillary dataset. “‡‡\ddagger‡” represents the use of DINOv2. “SS” denotes single-scale testing, and “MS” means multi-scale testing.

Method LLM Image Sample Size VQAv2 VizWiz LLaVA- SciQA- MM-Vet POPE [49] MMB
Size Pre Ft [27] [30] Wild [55] IMG [61] [85] rand pop adv [56]
BLIP-2 [46] Vicuna-13B 2242 129M - 65.0 19.6 19.6 61 22.4 89.6 85.5 80.9
InstructBLIP [22] Vicuna-7B 2242 129M 1.2M 34.5 34.5 60.5 26.2 36
InstructBLIP [22] Vicuna-13B 2242 129M 1.2M 33.4 33.4 63.1 25.6 87.7 77 72
Shikra [10] Vicuna-13B 2242 600K 5.5M 77.4∗ 58.8
IDEFICS-9B [37] LLaMA-7B 2242 353M 1M 50.9 35.5 35.5 48.2
IDEFICS-80B [37] LLaMA-65B 2242 353M 1M 60.0 36 36.0 54.5
Qwen-VL [4] Qwen-7B 4482 1.4B 50M 78.8∗ 35.2 35.2 67.1 38.2
Qwen-VL-Chat [4] Qwen-7B 4482 1.4B∗ 50M 78.2∗ 38.9 38.9 68.2 60.6
LLaVA-1.5 [54] Vicuna-7B 3362 558K 665K 78.5∗ 50.0∗ 65.4 66.8 31.1 87.3 86.2 84.2 64.3
LLaVA-1.5 + ViT-Split Vicuna-7B 3362 558K 665K 78.2-0.3 51.7+1.7 71.1+5.7 70.4 +3.6 31.2+0.1 88.5+1.2 87.4+1.2 86.1+1.9 66.4+2.1

Table 4: Comparison with different VLLM methods on VQA benchmarks. ViT-Split is integrated into the vision encoder (CLIP-L) of LLaVA-1.5 (7B), tuning the penultimate block and utilizing prior feature from this layer. This adaptation can consistently enhance performance across most benchmarks, demonstrating the effectiveness and generalization of ViT-Split.

Time complexity analysis. As illustrated in Fig. 8, our ViT-Split achieves, on average, approximately 4× faster training speed for the small model and 3× faster for the base model compared to the other two VFM adapters. The slower training speed of the other adapters can be attributed to two factors: the early gradient backpropagation and the interaction between the CNN branch and the ViT. In contrast, our ViT-Split avoids backpropagating gradients to early layers, and reduces both the CNN branch computations and interaction overhead by fully leveraging the prior knowledge in the VFM. As shown in Fig. 7, traditional VFM adapters require training a task-specific VFM along with its corresponding adapter and head. In contrast, ViT-Split keeps the entire VFM frozen, training only a smaller adapter and the corresponding head. This design significantly reduces computational costs, making it more efficient for supporting multiple downstream tasks during inference.

4.2 Detection and Instance Segmentation

Settings. We present detection and instance segmentation results on COCO-2017 [52] in Tab. 5, using MMDetection [9]. The AdamW optimizer is employed with an initial learning rate of 1e-4 and a weight decay of 5e-2, training for 12 epochs (1×\times× schedule). The total batch size is set to 16 and we utilize a MaskRCNN [33] head for experiment. The setting of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given in the Appendix.

As shown in Tab. 5, our ViT-Split achieves comparable performance with current SOTA VFM adapter ViT-CoMer. As discussed in 3.1, the detection task may differ significantly from the original DINOv2 pretraining task, necessitating the tuning of more parameters. Despite this, our ViT-Split still involves fewer parameters and faster training speed (reducing 42% training time) than ViT-CoMer, demonstrating the efficiency of our architecture.

Table 5: Object detection and instance segmentation using Mask R-CNN on COCO val2017. “††\dagger†” indicates pre-training with ImageNet-22K, ‡‡\ddagger‡” represents the use of DINOv2 [65], while the default setting uses ImageNet-1K pre-training.

4.3 Visual Question Answering

Settings. We also present VQA results using the popular visual large language model (VLLM) [50], LLaVA-1.5 [54]. This model comprises a CLIP-L visual encoder for encoding images, an MLP connector for projecting visual tokens into the language space, and a Vicuna-based LLM [17] for generating language tokens. In our modified LLaVA, we replace the original MLP projector with our ViT-Split. To comprehensively evaluate the effectiveness of our ViT-Split, we utilize both academic-task-oriented benchmarks ( VQA-v2 [27], VizWiz [30], SciQA-IMG [61]), and instruction-following LLM benchmarks (POPE [49], MMBench [56], LLaVA-Wild [55], MM-Vet [85]). Following [54], we first pretrain our ViT-Split using 558K image-text pairs, and subsequently fine-tune both ViT-Split and the LLM with 665K mixed data pairs. For more detailed information regarding the hyperparameter settings, please refer to the Appendix.

As shown in Tab. 4, our ViT-Split enhances LLaVA-1.5 performance across most benchmarks. This improvement demonstrates that ViT-Split is also applicable to other VFMs and VQA tasks. Unlike most current VLLMs that directly utilize features from the penultimate layer, ViT-Split leverages both the prior features of the vision encoder and the task-specific features, resulting in richer visual representations that improve the LLM’s learning process. Moreover, we tune only a small portion of the vision encoder’s parameters (specifically, one layer), which ensures efficiency for both training and inference. We believe that ViT-Split will offer new inspiration for VLLM design.

4.4 Ablation Study

We conduct an ablation study for each trainable component in Tab. 6 on ADE20K. The default settings are consistent with those described in Sec. 4.1.

The effectiveness of prior head. The results in Tab. 6 show that incorporating the prior head improves performance by 2.7% and 3.6% compared to the baseline that uses only the final-layer features. This suggests that the prior head effectively leverages multi-layer prior features from the VFM to enhance overall representation quality, surpassing the use of solely the final layer’s prior features. Additionally, our module enhances 2D local representations through the use of a CNN. Furthermore, the results demonstrate that the prior features extracted from the original VFM are highly valuable, achieving performance levels nearly equivalent to those obtained through full fine-tuning.

The effectiveness of task head. As shown in Tab. 6, by tuning only the task head gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the performance nearly matches that of fine-tuning the entire model, supporting the finding in Sec. 3.1. Last few layers can learn task-specific features and achieve similar performance as tuning the entire backbone. Furthermore, the experiments demonstrate that performance can be further enhanced when combined with prior features. We attribute this improvement to the combined benefits of task-specific and prior knowledge, with the latter helping to reduce task head overfitting.

The effectiveness of fusion head. Tab. 6 shows that using fusion net gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT yields a performance improvement of 1.1% for two ViT sizes. We attribute this enhancement to our CNN-based fusion module, which retains richer feature information compared to a simple addition operation. Again, the CNN component strengthens the local feature representation, contributing to improved fusion results.

Table 6: Ablation study of the prior head (gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT), task head (gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT), and fusion net (gθfsubscript𝑔subscript𝜃𝑓g_{\theta_{f}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT) on ADE20K, conducted with two ViT sizes: small and base on ViT-Splitu. We set Kt=3subscript𝐾𝑡3K_{t}=3italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 3 and Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 for both model sizes. The baseline model (no modules used, shown without background color) uses only the frozen features from the last layer. The baseline with a gray background indicates full fine-tuning of the entire backbone. When only gθpsubscript𝑔subscript𝜃𝑝g_{\theta_{p}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT and gθtsubscript𝑔subscript𝜃𝑡g_{\theta_{t}}italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are used, their features are combined via addition.

Table 7: Ablation study on the frozen layer selection strategies for our ViT-Split model on the ADE20K dataset, using three ViT sizes: small, base, and large. Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is same for all strategies.

The effectiveness of uniform layer selection. In Tab. 7, we evaluate the effectiveness of the selection strategy for prior features. Compared to selecting features from only the last few layers, which capture mostly task-specific prior information—uniform selection allows for a more diverse set of prior features, encompassing both low-level and task-specific characteristics. This uniform selection approach becomes increasingly impactful as the backbone size grows.

The effectiveness across different VFMs. To evaluate the generality of our ViT-Split, we present results on various VFMs in Fig. 9, leveraging the excellent VFM-benchmark codebase 111https://github.com/tue-mps/benchmark-vfm-ss. The experiments demonstrate that ViT-Split consistently enhances performance across both weakly-supervised VFMs (SAM and SigLip) and self-supervised VFMs (MAE). These results not only validate the effectiveness of ViT-Split on multiple VFMs but also suggest that our observations may hold for a broader range of VFMs.

Refer to caption

(a) mIoU.

Refer to caption

(b) Training parameters.

Figure 9: Segmentation results and parameters on ADE20K with different VFMs, including MAE-B [35], SAM-B [43] and SigLip-B [89]. We set Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 and Kt=8subscript𝐾𝑡8K_{t}=8italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 for all the VFMs.

5 Conclusion

In this paper, we introduce ViT-Split, an efficient, effective, and generalized adapter, to adapt VFMs for downstream tasks. Specifically, we introduce two heads based on a frozen VFM, a prior head for multi-scale prior feature extraction and a task head for task-specific feature adaptation. Experiments on segmentation, detection, MDE, and VQA verify the effectiveness and efficiency of our method. In the future, we aim to apply ViT-Split to more VFMs and tasks. We hope our method offers a fresh perspective for efficient and effective VFM adapter design.

\thetitle

Supplementary Material

Refer to caption

Figure 10: Illustration of our proposed layer selection methods: uniform sampling (left) and sparse gate (right). Uniform sampling selects Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT layers from L𝐿Litalic_L prior features, ranging from the b𝑏bitalic_b-th to L𝐿Litalic_L-th layer. The sparse gate, utilizing the STE technique (see Eq. 5), aggregates multiple layer features and filters out irrelevant ones.

Appendix A Training details

Table 8: Comparison of two layer selection methods on semantic segmentation. The results are conducted on Cityscales val with 896*896 resolution image.

Table 9: Comparison of two layer selection methods on semantic segmentation. The results are conducted on ADE20K val with 512*512 resolution image.

A.1 Hyper-parameter setting

We outline the settings for several key hyperparameters of ViT-Split in Tab. 10, including weight initialization, the number of tuning layers (Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), and the number of selected prior features (Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT), etc. We conduct experiments across four tasks: semantic segmentation, monocular depth prediction, detection, and visual question answering (VQA).

The selection guideline of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and b𝑏bitalic_b. As shown in Tab. 10, these hyperparameters vary across tasks, with their importance ranked as Kt>Kp>bsubscript𝐾𝑡subscript𝐾𝑝𝑏K_{t}>K_{p}>bitalic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > italic_b. As shown in Fig. 11, Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the most critical hyperparameter and is task-dependent. For dense prediction tasks (e.g., segmentation or monocular depth estimation), tuning smaller layers (around 1/6161/61 / 6 to 1/4141/41 / 4) yields good performance. For detection tasks, since the pretrained task differs significantly from detection (see Fig. 4), tuning more layers is necessary for better results. Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT has a smaller impact on results compared to Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 works well in most cases. Typically, we set b=2𝑏2b=2italic_b = 2 to sample prior features from both shallow and deep layers. However, for tasks like VQA, only the last-layer features are needed, as the LLM decoder benefits more from high-level features while low-level features may introduce noise.

Table 10: The settings of the important hyper-parameters of ViT-Split on different tasks, including semantic segmentation, monocular depth estimation, detection and instance segmentation, and vision question answering (VQA).

Table 11: Monocular depth estimation results on NYU-V2 with 416*544 resolution image. “‡‡\ddagger‡” represents the use of DINOv2. Other backbones are initialized with ImageNet-1K/22K weights

A.2 Sizes of various heads

We provide the sizes of the various heads used in ViT-Split for different tasks in Tab. 12, including segmentation (seg.), detection (det.) and monocular depth estimation (mde).

Table 12: The size of different heads used for ViT-Split.

A.3 Details of tuning the VLLM

LLaVA-1.5 employs a CLIP-based vision encoder for image encoding. We introduce a single-layer task head copied from CLIP’s original final layer (i.e., Kt=1subscript𝐾𝑡1K_{t}=1italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1) and utilize only the last-layer feature of CLIP as the input to the prior head (i.e., Kp=1subscript𝐾𝑝1K_{p}=1italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1). We replace the original MLP projector in LLaVA-1.5 with with our ViT-Split for two-stage training. The training follows the same hyperparameter settings as the original LLaVA-1.5.

A.4 Architecture details of various used VFMs

We provide the architecture details of various VFMs used in the main content in Tab. 13.

Table 13: The architecture details of used VFMs.

Appendix B Layer selection

B.1 Sparse gate

Another way is to learn the sparse gate Gs⁢p∈ℝL×Kpsubscript𝐺𝑠𝑝superscriptℝ𝐿subscript𝐾𝑝G_{sp}\in\mathbb{R}^{L\times K_{p}}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the dataset. This method eliminates the need for carefully tuning hyperparameters to select prior features. To remove noisy features, we enforce the sparsity in the gate by selecting top Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT scores, and normalizing the remained ones. However, directly optimizing Gs⁢psubscript𝐺𝑠𝑝G_{sp}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT is infeasible since the sparsity operation is non-differentiable. To address this issue, we employ the Straight-Through Estimator (STE) technique which allows for approximate gradient optimization. Specifically, let G∈ℝL×Kp𝐺superscriptℝ𝐿subscript𝐾𝑝G\in\mathbb{R}^{L\times K_{p}}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the learnable gate, which is continuous. From G𝐺Gitalic_G, we obtain the sparse gates Gs⁢psubscript𝐺𝑠𝑝G_{sp}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT by selecting the top Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT elements in each column. We then apply STE by optimizing the gradient of G𝐺Gitalic_G:

Gs⁢p=Gs⁢p+G−Gn⁢o⁢_⁢g⁢r⁢a⁢d.subscript𝐺𝑠𝑝subscript𝐺𝑠𝑝𝐺subscript𝐺𝑛𝑜_𝑔𝑟𝑎𝑑G_{sp}=G_{sp}+G-G_{no\_grad}.italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT + italic_G - italic_G start_POSTSUBSCRIPT italic_n italic_o _ italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT . (5)

After obtaining the sparse gate Gs⁢p∈ℝL×Kpsubscript𝐺𝑠𝑝superscriptℝ𝐿subscript𝐾𝑝G_{sp}\in\mathbb{R}^{L\times K_{p}}italic_G start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we can get the selected prior features by multiplying with the prior feature map 𝐟p∈ℝh×w×L×Dsubscript𝐟𝑝superscriptℝℎ𝑤𝐿𝐷\mathbf{f}_{p}\in\mathbb{R}^{h\times w\times L\times D}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_L × italic_D end_POSTSUPERSCRIPT from the layer dimension.

B.2 Performance on segmentation task

We present a comparison of layer selection methods on segmentation benchmarks, including Cityscapes and ADE20K, in Tab. 8 and Tab. 9. For a fair comparison, we set the same Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for both selection methods and use Kp=4subscript𝐾𝑝4K_{p}=4italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 4 for all sparse-gate-based experiments. Our results show that sparse gate selection achieves comparable performance to uniform sampling on segmentation tasks without requiring manual hyper-parameter selection. It indicates that sparse gate selection is a promising and versatile approach for reducing the number of hyper-parameters.

Appendix C Motivation of freezing the backbone

Freezing the backbone has three main motivations. ① Improved training and inference speed. Fig. 7 shows our ViT-Split achieves 2.4∼similar-to\sim∼5×\times×, and 2∼similar-to\sim∼6×\times× speedup over other VFM-Adapters on training and inference efficiency. Additionally, as detailed in Tab. 15, ViT-Split is 1.4∼similar-to\sim∼3×\times× faster than finetuning the entire backbone with a linear/UperNet head. ② Enhanced performance with prior features. We admit that the inference speed will decrease compared with finetuning DINOv2-linear due to the extra heads (around 30% on segmentation tasks). However, the performance can be further improved, which is also the main motivation of other VFM-adapters. Compared with these, ViT-Split achieves better training and inference efficiency. ③ Task adaptivity. ViT-Split requires storing only separate task-specific heads, rather than the entire model, making it more adaptive and memory-efficient for deployment across multiple tasks.

Appendix D Explanation of the lower performance on detection task

We acknowledge that the performance difference between ViT-Split and ViT-CoMer on Mask R-CNN (Tab. 5) is relatively small. However, ViT-Split uses only 90%–95% of ViT-CoMer’s trainable parameters, already demonstrating clear advantages in training efficiency while maintaining comparable accuracy. The primary reason ViT-Split does not significantly outperform other VFM-adapters lies in the relatively weak task alignment of the prior features from DINOv2 for object detection tasks. Unlike DETR-style models, which are pre-trained with strong detection-oriented objectives, self-supervised models like DINOv2 tend to provide less directly transferable features for detection. This necessitates using more layers in the task head (i.e., larger Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), effectively making ViT-Split rely more on fine-tuning, similar to other VFM-adapters. As self-supervised models begin to offer stronger detection-aware priors, we expect ViT-Split to better leverage them and close the gap with current SOTA DETR-style models.

Appendix E More results

E.1 An apple-to-apple comparison with other VFM-adapters on segmentation

We provide an apple-to-apple comparison with the SOTA VFM-adapters in Tab. 14, i.e., ViT-CoMer [79] and ViT-Adapter [14]. All models are trained for 40K iterations on ADE20K, using a UperNet head for the baselines and a linear head for ViT-Split. For VFM-adapters, we adopt a learning rate schedule similar to that used in detection tasks, incorporating layer-wise decay with carefully tuned rates for each baseline to ensure strong performance. Results show that with DINOv2 initialization, ViT-Split consistently outperforms other VFM-adapters across different model sizes. This highlights ViT-Split’s ability to better leverage the strong prior knowledge from DINOv2 without altering the original feature representations, which often results in suboptimal performance in other adapters.

Table 14: VFM-adapter comparison on ADE20K (40K iterations).

E.2 Hyper-parameter sensitivity analysis

We provide the analysis of two important hyper-parameters Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in our ViT-Split, which is given in Fig. 11.

Influence of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As shown in Fig. 11(a), the mIoU initially improves when tuning between one and three layers. This improvement is likely due to the task head previously underfitting the task. However, as more layers are tuned, overall performance begins to decline, suggesting that the task head starts to overfit. This experiment demonstrates that tuning additional layers does not necessarily guarantee better performance and can easily lead to overfitting. Therefore, we opt to tune three layers in this case.

Influence of Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. As shown in Fig. 11(b), the mIoU peaks when selecting four prior layer features. Selecting too few layers may result in missing critical information, while selecting too many can introduce noise. Additionally, we observe that increasing the number of selected layers does not increase more training parameters, highlighting the efficiency of the prior head. As a result, we choose four prior features in this case.

Refer to caption

(a) Tuning layers Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption

(b) Frozen layers Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT

Figure 11: Parameter sensitivity analysis of Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in ViT-Split. The experiments are conducted using ViT-Split-S on ADE20K.

E.3 Visualization

E.3.1 CKA analysis of other VFMs

We also present the CKA results for MAE-L [35] and SAM-L [43] in Fig. 12. The feature representations in the early layers of these VFMs exhibit similar patterns, as do those in the later layers. Based on these findings as well as those in the main paper, we hypothesize that our observation–that the layers of several VFMs can be divided into two components–may hold true for self-supervised models pretrained on large-scale dataset (e.g., DINOv2 [65], MAE [35], EVA2 [25], etc.), as well as weakly supervised ones (say CLIP [69], SigLip [89], SAM [43], etc.).

Refer to caption

Figure 12: The CKA of SAM (a) and MAE (b). (c) Training comparison between ViT-Split-s and DINOv2-s-UperNet on ADE20K.

Refer to caption

Figure 13: Further comparison of DINOv2-S layer features across original features, segmentation, and detection tasks. In each figure, the first, second, and third rows correspond to original, segmentation, and detection features, respectively. It can be observed that features from earlier layers exhibit similar patterns across different tasks, reflecting common low-level local features. However, features from deeper layers diverge significantly according to their specific downstream tasks.

[Uncaptioned image]

[Uncaptioned image]

Refer to caption

Figure 14: Semantic segmentation and instance segmentation results based on our ViT-Split-L (left: original image, middle: semantic segmentation results, right: instance segmentation results).

E.3.2 CKA analysis of different DINOv2 sizes

We also provide the CKA visualizations of different DINOv2 sizes in Fig. 15. From these visualizations, we observe that features in the early layers are more similar across different DINOv2 sizes compared to those in the later layers. As earlier mentioned, the early layers serve as an encoder to capture low-level features, while the later layers act as a decoder to produce task-specific features.

Refer to caption

Figure 15: The CKA visualizations of different sizes of DINOv2.

E.3.3 More layer feature comparison

We present additional visualizations of DINOv2 layer features across different tasks (i.e., DINOv2 pretraining, segmentation, and detection) in Fig. 13. These results demonstrate that earlier-layer features from various tasks consistently focus on detailed, low-level information. However, deeper-layer features diverge significantly between tasks. Specifically, features from both the original DINOv2 pretraining and semantic segmentation emphasize semantic-level information of particular objects, whereas detection features tend to highlight object corners and boundaries.

E.3.4 Semantic segmentation and instance segmentation results

We present semantic segmentation and instance segmentation results based on our ViT-Split-L (DINOv2 pretrained) in Fig. 14. We utilize ADE20K and COCO2017 datasets for training these two tasks, respectively, and evaluate both on the ADE20K validation dataset.

It is worth noting that both results are obtained using the same frozen DINOv2-L backbone, meaning only the task-specific adapters and heads require training. Consequently, the overall computational cost and the number of parameters are significantly reduced compared to previous VFM-adapters, while achieving competitive or superior performance. These visualizations demonstrate the strong generalization capability of ViT-Split, highlighting its versatility, effectiveness, and efficiency across multiple downstream tasks.

E.4 Training efficiency comparison

Table 15: Training time comparison on ADE20K (tuning 10k iterations on 4*A6000Ada). DINOv2-linear and DINOv2-UperNet are finetuned end to end.

To further illustrate the training efficiency compared with different heads on segmentation task, we provide the training time comparison in Tab. 15. For fair comparison, all of these baselines (except for ViT-Split) are finetuned using the DINOv2 backbone with two different heads (linear and UperNet) for 10k iterations on 4*A6000Ada.

From Tab. 15, we observe that our ViT-Split reduces the training time on average of DINOv2-linear by approximately 42% on average while maintaining the same linear head. This improvement in training efficiency is attributed to the task-head design, which prevents gradients from propagating to the early layers of the backbone. Compared to finetuning a VFM with a larger segmentation head (DINOv2-UperNet), our ViT-Split is 2.5 times faster across three sizes on average. This highlights the huge computation overhead introduced by a large segmentation head and demonstrates the efficiency of our ViT-Split.

E.5 Longer training time

We try to increase the training time to illustrate the upper bound of ViT-Split. We conduct an experiment in Fig. 12 to explore the performance upper bound with extended training (i.e., 160K iterations). As shown in Fig. 12 (c), ViT-Split-s achieves 52.2%, improving from 51.5% at 40K iterations and surpassing DINOv2s-UperNet (51.6%) while maintaining faster training speeds. This demonstrates that ViT-Split can achieve better performance when training for longer time.

E.6 Monocular depth estimation

Settings. To further investigate the effectiveness of our ViT-Split, we also provide the results on monocular depth estimation (MDE) on NYU-V2 [74] benchmark in Tab. 11. Following [51], we utilize the AdamW optimizer with an initial learning rate of 3e-4 and a weight decay of 1e-2. We multiply 0.1 by the learning rate of the task head during training. Moreover, one cycle learning rate decay schedule is utilized for better performance. We train ViT-Split for 384K iterations with a total batch size of 16 on 4*A6000ada GPUs.

As shown in Tab. 11, our ViT-Split achieves competitive or even superior results compared to previous state-of-the-art methods, while using a minimal number of trainable parameters. Notably, ViT-Split employs only a single linear head rather than a specially designed head, highlighting the potential of our approach. Leveraging the prior knowledge embedded in vision foundation models (VFMs), we believe the size of the downstream task head (e.g., for depth prediction) can be further reduced to improve efficiency.

When compared to DINOv2-G with DPT [71], which uses the same DINOv2 initialization but a larger and more sophisticated head, our smaller ViT-Split-B version achieves similar performance with fewer parameters, demonstrating both the effectiveness and efficiency of our method. Furthermore, compared to traditional end-to-end fine-tuning approaches, ViT-Split achieves better performance by fully utilizing the prior knowledge inherent in VFMs. This also highlights the significant potential of large-scale self-supervised learning initialization over traditional supervised learning initialization.

E.7 Segmentation on Pascal Context

Settings. Apart from ADE20K and Cityscapes, we also provide the results on Pascal Context [63] in Tab. 16. We utilize the AdamW optimizer with an initial learning rate of 1e-4 and weight decay of 1e-2. We multiply by 0.1 to the task head during training. We train our model for 20K iterations, and the total batch size is set to 16.

As shown in Tab. 16, our method outperforms ViT-Adapter, achieving a 2% improvement for the base model and a 0.3% improvement for the large model, using just a simple linear head and training for only 20K iterations. The results demonstrate the strength of VFMs, with our method achieving both effectiveness and efficiency by fully utilizing the prior knowledge within the VFMs.

Table 16: Semantic segmentation results on the Pascal Context val with 480*480 resolution image. “††\dagger†” indicates the BEiT initialization and “††\dagger†” represents the use of DINOv2.

Appendix F Limitations

Currently, we have demonstrated the effectiveness of ViT-Split only on a limited set of VFMs, such as DINOv2 and CLIP, leaving its performance on a broader range of models to be explored in future work.

References