Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions (original) (raw)

Siyuan Yao1,2, Rui Zhu1, Ziqi Wang1, Wenqi Ren2,4,5, Yanyang Yan3, Xiaochun Cao2
1 Beijing University of Posts and Telecommunications2 Sun Yat-sen University
3 University of Chinese Academy of Sciences4 MoE Key Laboratory of Information Technology
5 Guangdong Key Laboratory of Information Security Technology
yaosiyuan04@gmail.com, ruizhu@bupt.edu.cn, zq_wang@bupt.edu.cn,
yanyanyang@ict.ac.cn, rwq.renwenqi@gmail.com, caoxiaochun@mail.sysu.edu.cn

Abstract

Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2%percent22\%2 % frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects’ representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.

1 Introduction

Visual object tracking (VOT) is a fundamental visual task of computer vision over the past decades, aiming to estimate the state of arbitrary target objects in video sequences given the initial annotation. Existing mainstream methods formulate object tracking as a target matching problem, which constructs template-search pairs to learn a position-sensitive matching network for target localization. Owing to the promising advances of recent deep learning architectures, VOT has achieved remarkable success in terms of accuracy and efficiency.

Refer to caption

Figure 1: Three representative tracking pipelines under adverse weather conditions. (a) ”Track-by-Enhancement” pipeline [48]. (b) Single domain adaptation pipeline [49]. (c) The proposed unified multi-domain adaptive tracking (UMDATrack) pipeline. UMDATrack utilizes controllable scenarios generator to synthesize unlabeled video frames and employ a flexible domain-customized adapter to transfer the knowledge to multi-domain.

Recent advanced object trackers typically utilize well-conditioned daytime datasets, e.g. LaSOT [10] or TrackingNet [29] as supervision for model training, however, the performance of these SOTA trackers is unsatisfactory in real-world scenarios with adverse weather conditions (e.g. nighttime or foggy environment) due to the tremendous domain gap. To address this issue, some efforts have explored to introduce synthesized datasets [52, 40] or domain adaptive discriminator [49, 55] to enhance the cross-domain transferability. Despite the significant advances, they potentially suffer from two drawbacks. First, most of the existing approaches are designed for single weather condition, while the generalization abilities are greatly limited in various scenarios where multiple target domains are available. For example, as shown in Fig. 1, the nighttime tracker UDAT [49] is capable of predicting the target state in nighttime data, but its performance drops significantly when the environment changed to another foggy weather condition. Besides, recent domain adaptive trackers generate large amounts of target domain samples for model knowledge transfer, the sample generation process is time-consuming and the intrinsic relationship of the target objects in multiple domains has been overlooked. For different weather conditions in multiple target domains, existing approaches require to introduce redundant parameters to conduct feature alignment separately, which fails to perform cross-domain interaction in an efficient manner.

In this paper, we propose a unified multi-domain adaptive tracker termed UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions. Inspired by the great success of the controllable text-to-image generation technique, we first utilize a text-conditioned diffusion model to synthesize unlabeled videos in multiple weather conditions under the guidance of different text prompts. Afterwards, to flexibly transfer the target objects’ representation from source domain to multiple target domains, we froze the backbone feature extractor and design a simple yet effective domain-customized adapter (DCA) to remedy the tracking model, allowing it to be rapidly adapted to various weather conditions without redundant model updating. Furthermore, we propose an target-aware confidence alignment module (TCA) with optimal transport theorem, which enhances the localization consistency between source and target domains by measuring the discrepancies of the localization confidence at the candidate positions. Experiments show that by only synthesizing a small partition of videos (less than 2%percent22\%2 % frames in source domain) at arbitrary weather conditions, UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance on either real-world or synthesized datasets by a significant margin. To the best of our knowledge, this is the first unfiied multi-domain adaptation tracker in VOT community.

In summary, the main contributions of this work can be concluded in three aspects:

∙∙\bullet∙ We propose a unified multi-domain adaptive tracking framework termed UMDATrack, which conducts multi-domain transfer using text-conditioned diffusion model and maintains high-quality target state prediction under various adverse weather conditions.

∙∙\bullet∙ We design a simple yet effective domain-specific adapter (DCA) to remedy the tracking model, which can flexibly transfer the target objects’ representation from original daytime scenario to various weather conditions without redundant model updating.

∙∙\bullet∙ We propose a target-aware confidence alignment module (TCA) with optimal transport theorem to enhance the localization consistency in source and target domains. Extensive experiments demonstrate that UMDATrack achieves superior performance to existing state-of-the-art methods.

2.1 Tracking in Adverse Weather Conditions

Recently, object tracking in adverse weather conditions has attracted increasing interest due to a variety of practical applications. The classical methods employ multi-modal sensors, e.g. Visible+Depth (RGB-D) [43]Visible+Thermal (RGB-T) [38] for target appearance modeling in complex scenarios. However, these methods require to collect large amount of labelled examples to learn the cross-modal target representation. To address this issue, some works explore to use the RGB images only to transfer the knowledge to unlabelled target domains. Existing methods generally [52, 48] perform image enhancement to unify target object’s representation. For example, Zhang et al. [52] combine RGB images and the corresponding depth maps to synthesize the foggy images. The feature alignment is conducted on Siamese trackers [6, 46, 45] using the synthesized foggy datasets to eliminate the semantic-level domain shift. HighlightNet [11] adapts to illumination variation and excavates the potential object for low-light UAV tracking. UDAT [49] proposes a transformer-based bridging layer to transfer the semantic knowledge from daytime domain to the nighttime domain. Though effective, the aforementioned trackers are designed for single weather condition, while the generalization abilities are greatly limited in various weather conditions where multiple target domains are available.

2.2 Controllable Text-to-Image Generation

To transfer the knowledge in various weather conditions, the scene translation technique has been introduced to synthesize high-quality images. The early efforts use Generative Adversarial Networks (GANs) [18] to transform images from source domain to target domain by modifying image style. However, these GAN-based methods typically require training from scratch on the specific domains. Recently, the advanced text-to-image (T2I) diffusion models [13, 50] have shown impressive controllable flexibilities using text descriptions. GLIDE [30]trains a CLIP model in noisy image space to provide CLIP guidance for image generation and editing. DALL-E [32] employs an autoregressive transformer to combine both text and image tokens, which demonstrates remarkable zero-shot translation capabilities without using large-scale training samples. ControlNet [50] treats the pretrained model as a strong backbone and finetune the trainable copy connected with zero convolution layers, allowing users to add various spatial conditions to control the image generation. Inspired by the success of these text-to-image (T2I) generation models, in this work, we utilize text-conditioned diffusion model to synthesize unlabeled videos in multiple weather conditions for target feature translation.

Refer to caption

Figure 2: Overview of the proposed UMDATrack. It first utilizes a controllable scenarios generator (CSG) to synthesize the video frames in arbitrary adverse weather conditions. The cropped template-candidate pairs are sent into a student-teacher network, which transfers the target objects’ representation to multiple weather conditions using an encoder network with domain customized adapter (DCA) and a localization head with target-aware confidence alignment module (TCA). Here we only demonstrate the daytime →→\to→ foggy environment translation for simplicity.

2.3 Multi-Target Domain Adaptation

Recently, various techniques have been employed for Multi-Target Domain Adaptation (MTDA) to enhance cross-domain robustness and generalization. For example, curriculum learning and feature aggregation have been combined to align similar features and adapt models gradually to domain complexities [35]. Other approaches[23] have explored merging independently adapted models from distinct domains by combining model parameters and buffer merging. Additionally, graph matching techniques [24] have been applied to improve generalization in cross-domain object detection, with self-training methods also showing promising potential. Optimal transport theory has been widely studied and applied across various domains. A regularized unsupervised optimal transport model[7] has been proposed to align source and target domain representations, using a transport plan that enhances cross-domain robustness. In particular, SOOD [16] uses optimal transport to ensure global layout consistency between pseudo-labels and predictions. Despite the aforementioned efforts, it is still challenging to design a unified tracker to conduct MTDA in adverse weather conditions like fog, nighttime, and rain. Our research effectively fills this gap by leveraging optimal transport theory to improve tracking robustness in these challenging scenarios.

3 Method

In this section, we describe the overall architecture of the proposed UMDATrack, which consists of three main components: a controllable scenarios generator (CSG), an encoder network with domain customized adapter (DCA) and a localization head with target-aware confidence alignment module (TCA).

3.1 Controllable Scenario Generator

As it is not trivial to collect large number of video sequences in adverse weather conditions, we first synthesize a small amount of training data to conduct domain knowledge transfer. Inspired by recent advances of text-to-image (T2I) techniques, we utilize a controllable scenario generator (CSG) for data synthesis. Let 𝕍={𝐕1,𝐕2,⋯,𝐕K}𝕍subscript𝐕1subscript𝐕2⋯subscript𝐕𝐾\mathbb{V}=\left\{\mathbf{V}_{1},\mathbf{V}_{2},\cdots,\mathbf{V}_{K}\right\}blackboard_V = { bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } denotes the videos in source domain and 𝕍∗={𝐕1∗,𝐕2∗,⋯,𝐕L∗}superscript𝕍∗subscriptsuperscript𝐕∗1subscriptsuperscript𝐕∗2⋯subscriptsuperscript𝐕∗𝐿\mathbb{V}^{\ast}=\left\{\mathbf{V}^{\ast}_{1},\mathbf{V}^{\ast}_{2},\cdots,% \mathbf{V}^{\ast}_{L}\right\}blackboard_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } denotes the videos in target domain, here L≪Kmuch-less-than𝐿𝐾L\ll Kitalic_L ≪ italic_K indicates the size of 𝕍∗superscript𝕍∗\mathbb{V}^{\ast}blackboard_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is significantly smaller compared to 𝕍𝕍\mathbb{V}blackboard_V. Our goal is to randomly select the videos in 𝕍𝕍\mathbb{V}blackboard_V and translate them to arbitrary weather conditions, e.g. hazy, dark and rainy, etc. To achieve this, we use the T2I model, i.e. Stable Diffusion-Turbo [36] to translate the scenarios using different text prompts. As shown in Fig. 3, the text prompt cXsubscript𝑐𝑋c_{X}italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, e.g. ”Car in the night/haze/rain/snow” and the video images x∈𝕍𝑥𝕍x\in\mathbb{V}italic_x ∈ blackboard_V in source domain are fed into the text encoder and image encoder respectively. We generate the output video frames y∈𝕍∗𝑦superscript𝕍∗y\in\mathbb{V}^{\ast}italic_y ∈ blackboard_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in target domain by integrating video frame x𝑥xitalic_x with conditional controls cXsubscript𝑐𝑋c_{X}italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and the noise ϵitalic-ϵ\epsilonitalic_ϵ as:

y=GSDT⁢(x,cX,ϵ),ϵ∼𝒩⁢(𝟎,𝐈),formulae-sequence𝑦subscript𝐺SDT𝑥subscript𝑐𝑋italic-ϵsimilar-toitalic-ϵ𝒩0𝐈y=G_{\mathrm{SDT}}(x,c_{X},\epsilon),\epsilon\sim\mathcal{N}(\mathbf{0},% \mathbf{I}),italic_y = italic_G start_POSTSUBSCRIPT roman_SDT end_POSTSUBSCRIPT ( italic_x , italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_ϵ ) , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) , (1)

where GSDT⁢(x,cX,ϵ)subscript𝐺SDT𝑥subscript𝑐𝑋italic-ϵG_{\mathrm{SDT}}(x,c_{X},\epsilon)italic_G start_POSTSUBSCRIPT roman_SDT end_POSTSUBSCRIPT ( italic_x , italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_ϵ ) denotes the Stable Diffusion-Turbo generator, ϵitalic-ϵ\epsilonitalic_ϵ is the noise map. The skip connections and Zero-Convs are used to preserve the essential structural details of the images. Benefited from the powerful transferability of T2I model, the video frames in target domains can be rapidly generated within only 1-4 iteration steps by simply changing the text prompts.

Refer to caption

Figure 3: Details of the Controllable Scenario Generation (CSG) module.

3.2 Tracking in Multiple Weather Conditions

Though CSG can generate continuous video frames in multiple weather conditions, the appearance discrepancies of target objects between the source daytime videos and the synthesized videos still limit the tracker’s generalization ability. To address this issue, we design a unified domain adaptation framework following the teacher-student pipeline, which can be flexibly deployed to various domain-customized scenarios. Specifically, given N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT video frames 𝒟𝒮={(ℐi𝒮,𝐛i𝒮)}i=1N𝒮subscript𝒟𝒮superscriptsubscriptsuperscriptsubscriptℐ𝑖𝒮superscriptsubscript𝐛𝑖𝒮𝑖1subscript𝑁𝒮\mathcal{D}_{\mathcal{S}}=\{(\mathcal{I}_{i}^{\mathcal{S}},\mathbf{b}_{i}^{% \mathcal{S}})\}_{i=1}^{N_{\mathcal{S}}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = { ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in source domain and N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT unlabeled frames 𝒟𝒯={ℐi𝒯}i=1N𝒯subscript𝒟𝒯superscriptsubscriptsuperscriptsubscriptℐ𝑖𝒯𝑖1subscript𝑁𝒯\mathcal{D}_{\mathcal{T}}=\{\mathcal{I}_{i}^{\mathcal{T}}\}_{i=1}^{N_{\mathcal% {T}}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = { caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where ℐi𝒮superscriptsubscriptℐ𝑖𝒮\mathcal{I}_{i}^{\mathcal{S}}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and 𝐛iSsuperscriptsubscript𝐛𝑖𝑆\mathbf{b}_{i}^{S}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denotes the images and annotated bounding boxes in the source domain, ℐi𝒯superscriptsubscriptℐ𝑖𝒯\mathcal{I}_{i}^{\mathcal{T}}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT denotes the images in multiple target domains. We crop the paired template-search images of 𝒟𝒮subscript𝒟𝒮\mathcal{D}_{\mathcal{S}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT and then send them into the student and teacher network, respectively. The student →→\to→ teacher knowledge transfer is conducted by updating the weights of the teacher model using the EMA (Exponential Moving Average) as:

θ𝒯←α⁢θ𝒯+(1−α)⁢θ𝒮,←superscript𝜃𝒯𝛼superscript𝜃𝒯1𝛼superscript𝜃𝒮\theta^{\mathcal{T}}\leftarrow\alpha\theta^{\mathcal{T}}+(1-\alpha)\theta^{% \mathcal{S}},italic_θ start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← italic_α italic_θ start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT , (2)

where θ𝒯superscript𝜃𝒯\theta^{\mathcal{T}}italic_θ start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and θ𝒮superscript𝜃𝒮\theta^{\mathcal{S}}italic_θ start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT denote the learnable parameters of the teacher and student networks. α𝛼\alphaitalic_α is the momentum coefficient controlling the updating rate of the teacher.

Domain-Customized Adapter The student-teacher training paradigm allows the tracker to gradually propagate source domain information to target domain. However, as the data distributions in different weather conditions vary greatly, it’s time-consuming to generate large amounts of multi-domain samples and would inevitably introduce redundant parameters if we conduct domain knowledge transfer separately. Considering this, we propose a Domain Customized Adapter (DCA) to transfer the target object’s representation to arbitrary weather conditions in an efficient fashion.

Refer to caption

Figure 4: Details of the Domain-Customized Adapter (DCA) module.

We present the detailed structure of DCA in Fig. 4. Formally, suppose the cropped template-search images in source domain are 𝐙𝒮superscript𝐙𝒮\mathbf{Z}^{\mathcal{S}}bold_Z start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and 𝐗𝒮superscript𝐗𝒮\mathbf{X}^{\mathcal{S}}bold_X start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, respectively. While the image pairs in target domain are 𝐙𝒯superscript𝐙𝒯\mathbf{Z}^{\mathcal{T}}bold_Z start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and 𝐗𝒯superscript𝐗𝒯\mathbf{X}^{\mathcal{T}}bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. We first use a lightweight ResNet block to transform and reshape 𝐗𝒯superscript𝐗𝒯\mathbf{X}^{\mathcal{T}}bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT as query 𝐐∈ℝK×C𝐐superscriptℝ𝐾𝐶\mathbf{Q}\in\mathbb{R}^{K\times C}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C end_POSTSUPERSCRIPT. Then we initialize a Gaussian random variable and embed it to be learnable token bank 𝐁∈ℝL′×C𝐁superscriptℝsuperscript𝐿′𝐶\mathbf{B}\in\mathbb{R}^{L^{\prime}\times C}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT that consists of L′superscript𝐿′L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT learnable feature vectors with channel dimension C𝐶Citalic_C. The token bank B𝐵Bitalic_B is further projected as key-value tokens K𝐾Kitalic_K and V𝑉Vitalic_V with the size of L′×Csuperscript𝐿′𝐶L^{\prime}\times Citalic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C by two FC layers, respectively. We compute an structural token 𝐒𝐒\mathbf{S}bold_S between the query and embedded key-value tokens as follows:

𝐒=Softmax⁢(𝐐𝐊⊤dk)⁢𝐕,𝐒Softmaxsuperscript𝐐𝐊topsubscript𝑑𝑘𝐕\mathbf{S}={\mathrm{Softmax}}(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d_{k}}}% )\mathbf{V},bold_S = roman_Softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V , (3)

the structural token 𝐒∈ℝK×C𝐒superscriptℝ𝐾𝐶\mathbf{S}\in\mathbb{R}^{K\times C}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C end_POSTSUPERSCRIPT encodes the latent image content representation, which shares similar contextual structure to 𝐗𝒮superscript𝐗𝒮\mathbf{X}^{\mathcal{S}}bold_X start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT in the embedding space. The structural token 𝐒𝐒\mathbf{S}bold_S are subsequently fed into the frozen vision transformer and concatenated with the encoded template-search tokens of the source domain images, allowing the model to rapidly find the optimal convergence checkpoints in various adverse weather conditions.

3.3 Target-Aware Confidence Alignment

Since the annotations are only available in the source domain, we train the tracker following a pseudo-label propagation strategy. Specifically, we send the synthesized template-search pairs into the teacher network to generate pseudo labels. These pseudo labels are fed back into the student network as supervision to update the weights of the tracking model. However, as the pseudo labels may be noisy, the incorrect pseudo labels will mislead the target state prediction. To address this problem, we propose a Target-Aware Confidence Alignment (TCA) module using optimal transport theory (OT) to enhance localization consistency in both domains by measuring the discrepancies in localization confidence at the candidate positions.

To be concrete, suppose the regressed response maps of student and teacher network are 𝐫𝒮∈ℝN×(H′×W′)superscript𝐫𝒮superscriptℝ𝑁superscript𝐻′superscript𝑊′\mathbf{r}^{\mathcal{S}}\in\mathbb{R}^{N\times\left(H^{\prime}\times W^{\prime% }\right)}bold_r start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT and 𝐫𝒯∈ℝN×(H′×W′)superscript𝐫𝒯superscriptℝ𝑁superscript𝐻′superscript𝑊′\mathbf{r}^{\mathcal{T}}\in\mathbb{R}^{N\times\left(H^{\prime}\times W^{\prime% }\right)}bold_r start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, where N𝑁Nitalic_N denotes the number of image samples in a mini-batch, H′superscript𝐻′H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, W′superscript𝑊′W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent the height and width of the response maps. We construct confidence distributions 𝐝𝒮∈ℝNsuperscript𝐝𝒮superscriptℝ𝑁\mathbf{d}^{\mathcal{S}}\in\mathbb{R}^{N}bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and 𝐝𝒯∈ℝNsuperscript𝐝𝒯superscriptℝ𝑁\mathbf{d}^{\mathcal{T}}\in\mathbb{R}^{N}bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for each sample in a mini-batch as:

𝐝𝒮=exp⁡(𝐫i,𝐩i𝒮),𝐝𝒯=exp⁡(𝐫i,𝐩i𝒯),formulae-sequencesuperscript𝐝𝒮subscriptsuperscript𝐫𝒮𝑖subscript𝐩𝑖superscript𝐝𝒯subscriptsuperscript𝐫𝒯𝑖subscript𝐩𝑖\mathbf{d}^{\mathcal{S}}=\exp(\mathbf{r}^{\mathcal{S}}_{i,\mathbf{p}_{i}}),% \mathbf{d}^{\mathcal{T}}=\exp(\mathbf{r}^{\mathcal{T}}_{i,\mathbf{p}_{i}}),bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT = roman_exp ( bold_r start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = roman_exp ( bold_r start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (4)

where for the i𝑖iitalic_i-th sample, 𝐩i=arg⁡maxj=1⁢…⁢H′×W′⁢𝐫i,j𝒯subscript𝐩𝑖𝑗1…superscript𝐻′superscript𝑊′subscriptsuperscript𝐫𝒯𝑖𝑗\mathbf{p}_{i}=\underset{j=1\ldots H^{\prime}\times W^{\prime}}{\arg\max}% \mathbf{r}^{\mathcal{T}}_{i,j}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_UNDERACCENT italic_j = 1 … italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_arg roman_max end_ARG bold_r start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the spatial index of the response map with the highest confidence score.

To construct the costmap 𝐂i,jsubscript𝐂𝑖𝑗\mathbf{C}_{i,j}bold_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for the OT problem, we simultaneously consider the spatial and confidence discrepancies of each sample. Here we introduce two cost to measure the matching cost:

| 𝐂i,jConf=‖𝐫i,𝐩i𝒮−𝐫j,𝐩j𝒯‖1max1≤m,n≤N⁡‖𝐫m,𝐩m𝒮−𝐫n,𝐩n𝒯‖1,subscriptsuperscript𝐂Conf𝑖𝑗subscriptnormsubscriptsuperscript𝐫𝒮𝑖subscript𝐩𝑖subscriptsuperscript𝐫𝒯𝑗subscript𝐩𝑗1subscriptformulae-sequence1𝑚𝑛𝑁subscriptnormsubscriptsuperscript𝐫𝒮𝑚subscript𝐩𝑚subscriptsuperscript𝐫𝒯𝑛subscript𝐩𝑛1\mathbf{C}^{\textrm{Conf}}_{i,j}=\frac{\left\|\mathbf{r}^{\mathcal{S}}_{i,% \mathbf{p}_{i}}-\mathbf{r}^{\mathcal{T}}_{j,\mathbf{p}_{j}}\right\|_{1}}{\max_% {1\leq m,n\leq N}\left\|\mathbf{r}^{\mathcal{S}}_{m,\mathbf{p}_{m}}-\mathbf{r}% ^{\mathcal{T}}_{n,\mathbf{p}_{n}}\right\|_{1}},bold_C start_POSTSUPERSCRIPT Conf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ∥ bold_r start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_r start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT 1 ≤ italic_m , italic_n ≤ italic_N end_POSTSUBSCRIPT ∥ bold_r start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_r start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , | (5) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

| 𝐂i,jPos=‖𝐩i𝒮−𝐩j𝒯‖2max1≤m,n≤N⁡‖𝐩m𝒮−𝐩n𝒯‖2,subscriptsuperscript𝐂Pos𝑖𝑗subscriptnormsubscriptsuperscript𝐩𝒮𝑖subscriptsuperscript𝐩𝒯𝑗2subscriptformulae-sequence1𝑚𝑛𝑁subscriptnormsubscriptsuperscript𝐩𝒮𝑚subscriptsuperscript𝐩𝒯𝑛2\mathbf{C}^{\textrm{Pos}}_{i,j}=\frac{\left\|\mathbf{p}^{\mathcal{S}}_{i}-% \mathbf{p}^{\mathcal{T}}_{j}\right\|_{2}}{\max_{1\leq m,n\leq N}\left\|\mathbf% {p}^{\mathcal{S}}_{m}-\mathbf{p}^{\mathcal{T}}_{n}\right\|_{2}},bold_C start_POSTSUPERSCRIPT Pos end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ∥ bold_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT 1 ≤ italic_m , italic_n ≤ italic_N end_POSTSUBSCRIPT ∥ bold_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , | (6) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

𝐂i,j=𝐂i,jConf+𝐂i,jPossubscript𝐂𝑖𝑗subscriptsuperscript𝐂Conf𝑖𝑗subscriptsuperscript𝐂Pos𝑖𝑗\mathbf{C}_{i,j}=\mathbf{C}^{\textrm{Conf}}_{i,j}+\mathbf{C}^{\textrm{Pos}}_{i% ,j}bold_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = bold_C start_POSTSUPERSCRIPT Conf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + bold_C start_POSTSUPERSCRIPT Pos end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (7)

where 𝐂Confsuperscript𝐂Conf\mathbf{C}^{\textrm{Conf}}bold_C start_POSTSUPERSCRIPT Conf end_POSTSUPERSCRIPT and 𝐂Possuperscript𝐂Pos\mathbf{C}^{\textrm{Pos}}bold_C start_POSTSUPERSCRIPT Pos end_POSTSUPERSCRIPT represent the confidence and position cost between the distribution 𝐝𝒮superscript𝐝𝒮\mathbf{d}^{\mathcal{S}}bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to 𝐝𝒯superscript𝐝𝒯\mathbf{d}^{\mathcal{T}}bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT.

Based on what we discussed above, we design a position-sensitive optimal transport (PSOT) loss to measure the cost for moving the confidence distribution from 𝐝𝒮superscript𝐝𝒮\mathbf{d}^{\mathcal{S}}bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to 𝐝𝒯superscript𝐝𝒯\mathbf{d}^{\mathcal{T}}bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, which can be defined as the OT problem’s dual formulation:

| Lp=⟨𝝁,𝐝𝒯‖𝐝𝒯‖1⟩+⟨𝝂,𝐝𝒮‖𝐝𝒮‖1⟩.subscript𝐿𝑝𝝁superscript𝐝𝒯subscriptnormsuperscript𝐝𝒯1𝝂superscript𝐝𝒮subscriptnormsuperscript𝐝𝒮1L_{p}=\left\langle\bm{\mu},\frac{\mathbf{d}^{\mathcal{T}}}{\left\|\mathbf{d}^{% \mathcal{T}}\right\|_{1}}\right\rangle+\left\langle\bm{\nu},\frac{\mathbf{d}^{% \mathcal{S}}}{\left\|\mathbf{d}^{\mathcal{S}}\right\|_{1}}\right\rangle.italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ⟨ bold_italic_μ , divide start_ARG bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ + ⟨ bold_italic_ν , divide start_ARG bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ . | (8) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

where 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝂𝝂\bm{\nu}bold_italic_ν are the solutions of the OT problem. The details can be found in appendix B.

During training, we jointly adopt the target supervision loss and position-sensitive optimal transport loss as hybrid supervision loss to train the whole student-teacher model, which is given by:

ℒ=ℒt+λ⁢ℒp,ℒsubscriptℒ𝑡𝜆subscriptℒ𝑝\mathcal{L}=\mathcal{L}_{t}+\lambda\mathcal{L}_{p},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , (9)

where λ𝜆\lambdaitalic_λ is the hyperparameter to balance the weights of the loss terms. We solve the OT problem by a fast Sinkhorn distances algorithm [8]. Similar to [47], the target supervision loss consists of the classification loss, localization L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and generalized GIoU loss as below:

Lt=ℒc⁢l⁢s+β⁢L1+γ⁢LG⁢I⁢o⁢U.subscript𝐿𝑡subscriptℒ𝑐𝑙𝑠𝛽subscript𝐿1𝛾subscript𝐿𝐺𝐼𝑜𝑈L_{t}=\mathcal{L}_{cls}+\beta L_{1}+\gamma L_{GIoU}.italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_G italic_I italic_o italic_U end_POSTSUBSCRIPT . (10)

By minimizing the target supervision loss and position-sensitive optimal transport loss, the feature representations and localization response can be effectively aligned to alleviate the domain shift.

4 Experiments

In this section, we conduct several experiments to evaluate the effectiveness of our proposed method. Our method is implemented based on python 3.10 and pytorch 2.1.1. Our tracker is trained with 4 NVIDIA RTX 3090 GPUs. All of the inference speed testing are conducted on a single NVIDIA RTX 3090 GPU.

Table 1: Comparison with state-of-the-art visual trackers on synthetic datasets: GOT-10k-Foggy, DTB70-Foggy, GOT-10k-Dark, DTB70-Dark, GOT-10k-Rainy and DTB70-Rainy. The top two results are highlighted with red and blue fonts, respectively. The double line above represents the cross-domaintrackers, while the line below represents the generic trackers.

Tracker GOT-10k-Foggy DTB70-Foggy GOT-10k-Dark DTB70-Dark GOT-10k-Rainy DTB70-Rainy
AO SR0.50 SR0.75 AUC P AO SR0.50 SR0.75 AUC P AO SR0.50 SR0.75 AUC P
UMDATrack 66.6 75.8 62.2 66.21 86.05 65.4 75.3 57.3 66.07 85.72 68.5 78.4 63.2 66.75 87.60
DCPT[55] 61.6 70.2 56.9 58.31 75.33 62.4 70.5 54.2 61.87 80.11 62.3 70.1 59.8 61.68 82.56
UDAT-CAR[49] 51.5 60.3 45.2 50.21 69.41 56.8 64.2 49.1 57.20 75.80 59.5 65.2 55.3 56.42 75.36
SAM-DA[12] 50.2 60.5 48.3 51.33 69.89 55.4 63.1 48.3 57.15 75.12 60.2 66.1 57.6 57 63 76.12
MLKD-Track[28] 52.3 62.3 49.1 52.46 70.32 53.8 61.6 46.9 55.21 73.68 57.3 64.8 57.1 56.89 74.12
ARTrackV2[1] 64.8 73.0 59.9 62.25 80.15 63.1 72.8 53.9 62.87 80.56 66.2 75.8 61.2 63.84 83.32
EVPTrack[37] 63.5 70.7 56.5 57.96 75.45 62.7 71.8 53.9 63.01 81.12 65.5 75.2 60.5 64.03 84.11
ODTrack[53] 65.1 74.5 56.0 61.12 79.32 62.5 71.5 53.1 62.21 80.23 64.8 74.5 59.5 63.95 83.56
HipTrack[3] 63.3 72.0 59.6 60.52 78.22 62.9 72.4 53.8 62.48 80.57 65.6 75.4 60.2 63.57 83.36
DropTrack[41] 64.9 73.8 58.5 59.95 77.66 62.2 72.5 54.3 61.98 80.21 65.3 75.3 60.4 62.87 83.13
SeqTrack[5] 65.2 74.6 56.3 60.21 78.70 61.4 70.5 52.3 62.84 81.57 65.1 75.0 60.3 63.75 83.28
AQATrack[42] 64.9 72.8 59.7 57.28 75.61 61.7 70.6 52.5 61.17 79.87 63.4 72.3 61.8 63.12 83.55
ROMTrack[4] 63.6 70.9 56.7 59.05 76.59 60.8 71.1 51.7 60.80 77.95 62.7 73.4 60.1 63.21 83.25
OSTrack[47] 61.9 71.7 59.7 56.23 77.43 61.3 70.9 51.5 59.23 77.43 61.6 71.0 58.6 59.23 77.43
AVTrack[25] 56.9 63.5 49.5 52.35 68.09 55.3 62.3 46.2 56.66 72.21 57.5 63.4 48.1 60.21 79.53
DiMP[2] 57.6 64.2 50.4 53.80 69.50 56.9 60.4 44.3 55.20 72.30 57.9 63.8 49.2 57.32 75.21
SiamRPN++[20] 58.4 64.9 51.2 55.80 74.70 56.6 60.8 45.1 48.80 70.30 56.2 61.4 46.8 51.52 71.96
SiamRPN[21] 51.7 55.6 32.5 47.40 67.40 49.2 53.2 31.4 43.70 60.30 50.1 54.6 35.1 48.25 68.22

4.1 Implementation Details

Model settings. We adopt vanilla ViT-Base [9] model as the backbone of our tracker, similar to OSTrack[47]. The patch size is set to 16×16161616\times 1616 × 16. We adopt a lightweight FCN consists of 4 stacked Conv-BN-ReLU layers as prediction head for both teacher and student branches. The sizes of the template and search region are resized to 128×128128128128\times 128128 × 128 and 256×256256256256\times 256256 × 256 respectively, corresponding to 22superscript222^{2}2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 42superscript424^{2}4 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times of the target box area.

Training Details. Our training process is divided into two stages: backbone training stage and domain customized training stage. We first synthesize the videos in adverse weather conditions only using GOT-10k dataset, the synthesized datasets includes GOT-10k-Dark, GOT-10k-Foggy and GOT-10k-Rainy. For backbone training, the DCA module is not introduced, we employ target supervision loss and position-sensitive optimal transport loss to perform domain adaptation between the teacher and student networks. Four source domain datasets, including LaSOT [10], TrackingNet [29], COCO [26], and GOT-10k [17], as well as three synthetic datasets train the student model. The sampling ratio of the datasets is set to 1:1:1:1:4:4:4. The backbone training takes 250 epochs. The learning rate is 4×10−44superscript1044\times 10^{-4}4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and decreased with weight decay 1×10−41superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The EMA hyperparameter α𝛼\alphaitalic_α is set to 0.99. For domain customized training stage, we froze the backbone feature extractor and train the DCA module for an additional 50 epochs. Both two stages optimize the model with ADAMW. Note that our UMDATrack does not require repetitive backbone training stage, we only need to train the DCA module for each weather condition. Therefore, it only takes one and a half days to train UMDATrack in all weather conditions. This approach significantly improves training efficiency while maintaining superior model performance.

Loss Function. In our implementation, we utilize focal loss [34] for foreground-background classification and employ L1 loss and GIoU loss [33, 44] for bounding box regression. Additionally, PSOT (Position-Sensitive Optimal Transport) loss is applied to align the distributions between the teacher and student networks. The weighting coefficients for the focal loss, L1 loss, GIoU loss, and PDOT loss are set to 1.0, 5.0, 2.0, and 10.0, respectively.

Inference. To accelerate the inference, the template feature is initialized using the first frame of each video sequence and stored for relation modeling between the template and search region in subsequent frames. As demonstrated in Tab 3, we compared inference speed, MACs, and parameter counts with those of state-of-the-art trackers, showing that UMDATrack achieves the highest inference speed with relatively low computational costs and parameter counts.

4.2 Comparisons with State-of-the-arts

In this subsection, we comprehensively compare UMDATrack with SOTA trackers in both real-world and synthesized adverse weather conditions to demonstrate the effectiveness and high efficiency of our method. It’s worth noting that our task is focused on cross-domain tracking, rather than being a generic one. However, we have observed significant performance improvement compared to the current state-of-the-art in generic trackers.

Specifically, for nighttime conditions, we use the real-world NAT2021-test[49], UAVDark70[19], and two synthesized datasets, i.e. GOT-10k-Dark, and DTB70-Dark. For foggy environment, we evaluate the tracking performance using the GOT-10k-Foggy and DTB70-Foggy datasets. For rainy conditions, we use the GOT-10k-Rainy and DTB70-Rainy datasets. Finally, we use the real-world AVisT [31] dataset to evaluate the tracking performance under various adverse weather conditions in natural environment.

Synthetic GOT-10k and DTB70[22].As shown in Table 1, UMDATrack performs exceptionally well across all three challenging conditions (foggy, dark, and rainy) on both the synthetic GOT-10k and DTB70 datasets. Under dark conditions, UMDATrack achieved the highest AUC (66.07) and precision (85.72) on the DTB70-Dark dataset, outperforming the second-best resutls by a notable margin of 3.06% in AUC and 4.15% in precision. A similar trend is observed on the GOT-10k-Dark dataset, where UMDATrack leads both AUC and precision. In foggy conditions, UMDATrack outperforms the second-best results obtained by other trackers by 3.96% in AUC and 5.90% in precision on the DTB70-Foggy dataset. In rainy conditions, UMDATrack also demonstrates superior performance to the advanced SOTA trackers. e.g. ARTrackV2 or ODTrack.

Table 2: Comparison with state-of-the-art visual trackers on real-world datasets: NAT2021, UAVDark70, and AVisT. The top two results are highlighted in red and blue, respectively. The double line above represents the cross-domaintrackers, while the line below represents the generic trackers.

Table 3: Comparison of inference speed, FLOPs, and model parameters across different trackers.

Results on Real-World datasetsTo further verify the effectiveness of the proposed UMDATrack, we conduct experiments on the real-world datasets with adverse weather conditions for comparison. As shown in Table 2, on the large-scale night dataset NAT2021, UMDATrack achieved the best AUC (54.58) and precision (70.78). Specifically, in terms of AUC, we outperformed the second tracker ARTrackV2 (53.13) by 1.45 points. This partially proves that our proposed framework helps the model learn effectively from synthetic extreme domain datasets. For the challenging UAV tracking dataset UAVDark70, UMDATrack outperforms all other trackers on the UAVDark70 real-world dataset, achieving an AUC score 1.83 points higher and a precision 1.4 points greater than the second-best tracker. Note that most of the reported trackers in the table can not directly deployed run for UAV system. However, UMDATrack obtains the best performance with real-time speed, shown great potential in real-world UAV tracking. Furthermore, we also test UMDATrack on AVisT dataset, which is specifically collected for tracking in diverse scenarios with adverse visibility. The various weather conditions such as rain, snow, fog and camouflage are included in this dataset, UMDATrack also obtains the leading performance in both precision and AUC metrics.

Inference Speed. Since UMDATrack does not require to introduce heavy blocks for target appearance model, the computational cost of UMDATrack is limited. As demonstrated in Table 3, we compared inference speed, MACs, and parameter counts with those of state-of-the-art trackers, showing that UMDATrack achieves the highest inference speed with relatively low computational costs and parameter counts.

4.3 Ablation Studies and Visualization

Table 4: Ablation study on the individual impact of each module (CSG, DCA, and TCA) in our model. The presence or absence of each module is marked with a check or dash, respectively. Results are reported in terms of AUC and Precision for each configuration, evaluated on the NAT2021 dataset.

Modules Indicators
CSG DCA TCA AUC (%) Precision (%)
- - - 49.11 63.52
✓✓\checkmark✓ - - 50.90 65.38
- ✓✓\checkmark✓ - 50.56 65.50
✓✓\checkmark✓ - ✓✓\checkmark✓ 52.27 67.10
✓✓\checkmark✓ ✓✓\checkmark✓ - 52.24 67.49
✓✓\checkmark✓ ✓✓\checkmark✓ ✓✓\checkmark✓ 54.58 70.78

Study on the components of UMDATrack. We conducted ablation experiments on the proposed three modules to verify their effectiveness. As shown in Table 4, the baseline approach doesn’t introduce any modules, thus it is only trained only on the four source domain datasets. When the CTG module is introduced, the model achieves the AUC of 50.90% and Precision of 65.38%. Adding the TCA module improves these results, bringing the AUC to 52.27% and precision to 67.10%. Further including the DCA module increases performance to the AUC of 54.58% and Precision of 70.78%. These results demonstrate that each module provides a significant performance gain, with the full model configuration yielding the highest scores in both metrics on the NAT2021 dataset.

Table 5: Effect of different EMA (Exponential Moving Average) update frequencies on model performance.

Table 6: Different dataset proportions used for training, with LaSOT, GOT-10k, TrackingNet, COCO, and Synthetic datasets in the specified ratios.

Study on the training hyper-parameter of UMDATrack.We conducted two ablation studies on the update frequency of EMA and the proportion of the training dataset. As shown in Table 5, we experimented with performing EMA after each epoch, every three epochs, every five epochs, and after completing each batch to transfer student network’s weight to the teacher network. The results indicate that performing EMA after each epoch yields the best results. For the dataset proportion settings, we conducted four groups of experiments as shown in the Table 6, and the results indicate that group 3 achieve the best performance. Therefore, we set the training dataset proportion to 1:1:1:1:4:4:4.

Refer to caption

Figure 5: The convergence speed of DCA. Please zoom in for details.

Refer to caption

Figure 6: Feature visualization by t-SNE of dark, foggy, and rainy scenes compared to normal (daytime) scenes. Orange andblueindicate source domain and target domains, respectively. The scattergrams depict the feature distributions of the base tracker and UMDATracker across different weather conditions. The results show that UMDATracker effectively narrows the domain discrepancy in various challenging weather conditions.

Table 7: Quality comparison of synthetic datasets generated by different generators. AUC is evaluated on NAT2021 dataset.

Study on the speed of DCA convergence. We analyze the convergence speed in which the DCA achieves its optimal performance during training. As shown in Fig. 5, around 50 epochs, the DCA has already obtained encouraging performance. Beyond this point, performance increases only slightly, and may even decline with additional epochs. Therefore, we suggest a trade-off between performance and training time to achieve efficiency.

Study on the impact of the synthetic datasets.We use SSIM [39] and LPIPS[51] to evaluate image quality in the second and third columns of Table 7, Compared to other methods like CycleGAN, UNIT, or simply using Gamma, CSG especially with text prompt achieves the best generation quality. Although our generator requires slightly more time to synthesize datasets, this is a trade-off between data generation quality and computational time. The use of text prompts improves the quality and relevance of the generated datasets, leading to better downstream performance. As a result, the tracker achieves the best AUC performance.

Refer to caption

Figure 7: Visualization comparison of our approach and other excellent trackers and results of the scoremaps.

Visualizing Robustness in Adverse Conditions.Fig. 6 shows feature distributions using t-SNE[15], where UMDATrack better aligns source domain and target domain across dark, foggy, and rainy conditions, reducing domain discrepancy. Fig. 7 presents tracking results, with UMDATrack achieving higher accuracy and significantly stronger resistance compared to other trackers in extreme scenarios.

5 Conclusion

In this paper, we propose a unified multi-domain adaptive tracker termed UMDATrack to predict target state under various adverse weather conditions. We first use a controllable scenario generator to synthesize unlabeled videos in multiple weather conditions under the guidance of different text prompts. Afterwards, we propose a simple yet effective domain-customized adapter to remedy the tracking model, allowing it to rapidly adapt to various weather conditions without redundant model updating. Furthermore, we propose a target-aware confidence alignment module (TCA) with optimal transport theorem, which enhances the localization consistency between source and target domains by measuring the discrepancies of the localization confidence at the candidate positions. Experiments show that UMDATrack leads new state-of-the-art performance on either real-world or synthesized datasets by a significant margin.

References