Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras (original) (raw)

Yuhui Lin1∗Jiahao Zhang1Siyuan Li2Jimin Xiao1
Ding Xu3Wenjun Wu4Jiaxuan Lu5†
1Xi’an Jiaotong-Liverpool University
2Dalian University of Technology
3Alibaba International Digital Business Group
4University of Illinois at Urbana-Champaign
5Shanghai Artificial Intelligence Laboratory
*First Author: Yuhui.Lin21@student.xjtlu.edu.cn
†Corresponding Author: lujiaxuan@pjlab.org.cn

Abstract

Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT’s adaptability and effectiveness. The code will be made available upon acceptance.

1 Introduction

Refer to caption

Figure 1: The proposed U-shaped State Space Model Knowledge Transfer (USKT) framework with the BiR-SSM module combines reconstruction and classification losses for Event-to-RGB feature adaptation, enabling the reuse of the pre-trained RGB encoder.

Event cameras represent a novel imaging technology that differs fundamentally from traditional frame-based cameras by capturing changes in brightness at the pixel level continuously, rather than capturing entire frames at regular intervals. The unique mechanism provides event cameras with exceptionally high temporal resolution and minimal latency, making them particularly well-suited for capturing fast-moving activities and handling scenes with high dynamic range [50, 20]. Compared to traditional cameras, event cameras excel in environments with significant lighting variations, while also consuming less energy, making them highly promising for applications such as autonomous driving [3], robotic navigation [47], and high-speed motion capture [21]. However, as a relatively new imaging modality, event cameras face significant challenges related to data scarcity [4, 49].

To address the challenge of limited data availability in event-based imaging, exploring knowledge transfer for event data is a promising direction worth investigating. In the broader field of knowledge transfer, methods can generally be categorized into domain-based and generative-based approaches. Domain-based methods aim to improve target domain performance by transferring knowledge from auxiliary domains [71, 45, 28, 42], while generative-based methods focus on generating synthetic data to enhance model performance [57, 61, 56].

In the field of event cameras, data is recorded only during changes in pixel brightness, resulting in event streams that are often sparse in visual content and differ significantly from the feature distributions of RGB images. Consequently, domain-based methods frequently encounter challenges related to domain mismatches, making effective knowledge transfer difficult [31]. On the other hand, generative-based models can simulate sparse event streams to generate additional synthetic RGB data, which can be leveraged to enhance model training [46, 49].

To address the scarcity of event data, we design a generative U-shaped State Space Model Knowledge Transfer (USKT) framework tailored to the characteristics of event data. Previous research has widely recognized U-shaped methods for their excellent reconstruction capabilities [17, 66]. Building upon these capabilities, we propose a generative knowledge transfer approach specifically for adapting event data to RGB features. As shown in Figure 1, our proposed method includes a Residual Down Sampling Block, a Residual Up Sampling Block, and a Bidirectional Reverse State Space Model. Specifically, the first Residual Down Sampling Block increases feature dimensionality while reducing spatial resolution, whereas the Residual Up Sampling Block enhances image restoration and preserves critical feature information, aligning the feature distribution more closely with that of RGB features.

Furthermore, since convolutional in our model predominantly focus on local features during the downsampling process, we incorporate sequence modeling that captures global feature dependencies. As past Transformer-based approaches often faced significant computational resource demands [27, 51], we introduce the Bidirectional Reverse State Space Model (BiR-SSM), which performs feature propagation through bidirectional scanning. Compared to the traditional Bidirectional State Space Model (Bi-SSM) [78], our BiR-SSM employs a shared SSM layer strategy aimed at ensuring feature consistency and reducing computational overhead. Additionally, our approach simultaneously performs reconstruction and classification to improve the model’s performance in classifying event images. In summary, our research makes the following three key contributions:

•
We introduce Event USKT, the first generative framework for knowledge transfer that adapts event data to pre-trained RGB models, establishing a new benchmark in this domain.
•
We propose the Bidirectional Reverse State Space Model (BiR-SSM), which efficiently reduces computational overhead while ensuring effective feature adaptation.
•
We present a hybrid loss function that synergistically combines reconstruction and classification objectives, significantly enhancing the performance of knowledge transfer for event image recognition.

2.1 Event-based Image Recognition

Event image recognition predominantly include graph-based models, Spiking Neural Networks (SNNs), and attention mechanisms. Graph-based models, using vertex and edge structures along with heterogeneous graph models and voxel grids, emulate spatial and temporal relationships among events and analyze complex data patterns, as demonstrated in various studies [37, 14, 59, 64, 63, 70, 44]. Spiking Neural Networks (SNNs) excel in processing time-step sequences for event image classification and, when integrated with attention mechanisms, significantly improve object recognition in dynamic environments by managing asynchronous data and focusing on critical features [75, 18, 73, 19, 16, 69, 76, 53, 77]. Additionally, some attention-based methods have also been widely used [40, 13, 34, 30, 22].

Among these methods, while tailored for event cameras, fail to address data scarcity. As a result, some studies have shifted to training with RGB-based models to mitigate this issue. For instance, several approaches based on ResNet have utilized RGB information to enhance the representational capability of event data [33, 12], while other methods have employed pre-trained ViT models based on RGB to improve the handling of sparse event streams [62]. Additionally, methods that integrate RGB and event camera data have proven to noticeably enhance the performance of downstream tasks [68].

2.2 Knowledge Transfer

In knowledge transfer, most approaches focus on RGB-to-RGB transfer. Domain adaptation methods align feature distributions between source and target domains. The PMC method enhances cross-modal recognition by generating missing target domain modalities through multimodal collaboration [72]. CLDA and MAJA mitigate domain shift via adversarial learning, boosting classification accuracy, especially in unsupervised scenarios [26, 79]. The DARDR method enhances cross-domain recognition by applying cross-modal constraints to transfer RGB-D data to the RGB target domain [36].

Generative-based methods use Generative Adversarial Networks (GANs) to generate target domain samples, reducing inter-domain differences. TriGAN and MSAN generate target samples from multiple source domains, significantly enhancing classification accuracy in unlabeled target domain tasks [52, 5]. DINE achieves privacy-preserving knowledge transfer with only a black-box source model [39], while DupGAN employs a dual-GAN structure to effectively ensure feature consistency across domains [29]. Meanwhile, U-shaped methods has shown promising results in generative tasks [17, 66].

Refer to caption

Figure 2: Overview of USKT framework. The proposed method is based on a U-shaped network, starting by mapping event data into suitable channels for USKT input through a time-accumulation. Subsequently, the data dimension is increased and the size is reduced via a downsampling process. Furthermore, we design a Bidirectional Reverse State Space Model (BiR-SSM) for sequence modeling. Following this, data is restored to its original resolution through an upsampling process. Finally, a reconstruction loss is introduced to enhance classification accuracy.

In the research on knowledge transfer between Event and RGB modalities, a few studies adopt co-training approaches, where models integrate features from both modalities to enhance robustness and accuracy across various environments [60, 58, 35]. In addition, CTN [74] is a Transformer-based cross-domain adaptation method that enhances the classification performance of event data by transferring features from RGB data.

3 Method

3.1 Overview

We propose a U-shaped State Space Model Knowledge Transfe (USKT) framework that efficiently converts event data into RGB features. As shown in Fig 2, the model consists of three key components: event data processing that transforms multiple time steps into voxel information to capture dynamic data changes; a Residual Down Sampling Block that reduces sequence length for efficient feature extraction; and a Residual Up Sampling Block that reconstructs the features into RGB domain suitable for encoder inputs. Additionally, we introduce a Bidirectional Reverse State Space Model (BiR-SSM) to fully capture the sequential dependencies between features. Finally, we focus on the performance of the Residual Up Sampling Block’s output XUSKTsubscript𝑋USKTX_{\text{USKT}}italic_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT after it passes through the feature extractor and present the design of the hybrid loss function.

3.2 Event Data Processing

An event stream can be visualized as consisting of multiple events, each characterized by (x,y,t,p)𝑥𝑦𝑡𝑝(x,y,t,p)( italic_x , italic_y , italic_t , italic_p ), where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) represent spatial coordinates, t𝑡titalic_t denotes the timestamp, and p𝑝pitalic_p indicates the polarity (+11+1+ 1 or −11-1- 1, signifying an increase or decrease in brightness). Consequently, event data is mapped into a three-dimensional grid where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) serve as the spatial dimensions and the time dimension is segmented into discrete bins, effectively organizing the event data temporally. Furthermore, based on the (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinates and the calculated time bin k𝑘kitalic_k, the event polarity p𝑝pitalic_p is accumulated in the respective voxel within the grid. Each voxel (x,y,k)𝑥𝑦𝑘(x,y,k)( italic_x , italic_y , italic_k ) then holds the aggregated polarity of events within the corresponding time bin, where the aggregation method, whether summing or averaging, depends on specific use cases. Ultimately, the final result is a three-dimensional tensor that retains both spatial and temporal information from the event stream.

3.3 Generative U-SSM Knowledge Transfer

Generative-based methods have been widely applied in the field of knowledge transfer across various tasks [54, 2, 67], and U-Net-based architectures have shown promising results in generative tasks [17, 66]. Building on these advances, we propose the U-SSM Knowledge Transfer (USKT) block for event-to-RGB knowledge transfer. Specifically, we input the event data 𝐗input∈ℝT×224×224subscript𝐗inputsuperscriptℝ𝑇224224\mathbf{X}_{\text{input}}\in\mathbb{R}^{T\times 224\times 224}bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 224 × 224 end_POSTSUPERSCRIPT, where T represents the time steps. Using a convolutional layer, we map the input data to 12 dimensions, standardizing the time steps of the event camera. The convolution operation can be expressed as:

𝐗proj=𝐂𝐨𝐧𝐯⁢(𝐗input),subscript𝐗proj𝐂𝐨𝐧𝐯subscript𝐗input\mathbf{X}_{\text{proj}}=\mathbf{Conv}(\mathbf{X}_{\text{input}}),bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT = bold_Conv ( bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT ) ,	(1)

U-shaped models are highly effective for knowledge transfer, primarily due to the essential roles of their downsampling and upsampling modules. Downsampling modules compress data by reducing feature sizes and increasing dimensionality [43, 24], whereas upsampling modules expand features and retain detailed information necessary for reconstruction [10, 55]. However, traditional U-shaped approaches, typically designed for RGB data, may not directly translate to event data, which primarily captures changes in brightness. The mismatch can lead to overfitting. Moreover, the inherent sparsity of event data necessitates a departure from conventional downsampling techniques; therefore, we incorporate residual connections to maintain the integrity of the original features. To address these challenges, we introduce the Residual Down Sampling Block and Residual Up Sampling Block for effective downsampling and upsampling, respectively. As illustrated in Figure 2, the proposed framework employs 4 Residual Down Sampling Blocks and 5 Residual Up Sampling Blocks.

Residual Down Sampling Block. For the Block, the input feature 𝐗proj∈ℝD×N×Nsubscript𝐗projsuperscriptℝ𝐷𝑁𝑁\mathbf{X}_{\text{proj}}\in\mathbb{R}^{D\times N\times N}bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N × italic_N end_POSTSUPERSCRIPT undergoes a series of operations. First, a convolution operation is applied to extract global features, resulting in 𝐗conv1∈ℝF×N×Nsubscript𝐗conv1superscriptℝ𝐹𝑁𝑁\mathbf{X}_{\text{conv1}}\in\mathbb{R}^{F\times N\times N}bold_X start_POSTSUBSCRIPT conv1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N × italic_N end_POSTSUPERSCRIPT. Next, another convolution focuses on feature downsampling, producing 𝐗conv2∈ℝF×N/2×N/2subscript𝐗conv2superscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{conv2}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT conv2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. Simultaneously, the original input feature is downsampled directly through a convolution layer, yielding 𝐗res∈ℝF×N/2×N/2subscript𝐗ressuperscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{res}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. A residual connection is then applied, resulting in 𝐗down∈ℝF×N/2×N/2subscript𝐗downsuperscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{down}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. The Residual Down Sampling Block preserves essential features while reducing the spatial dimensions of the data.

Meanwhile, in our method, the input is 𝐗proj∈ℝ12×224×224subscript𝐗projsuperscriptℝ12224224\mathbf{X}_{\text{proj}}\in\mathbb{R}^{12\times 224\times 224}bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 12 × 224 × 224 end_POSTSUPERSCRIPT and the outpput is 𝐗down∈ℝ128×14×14subscript𝐗downsuperscriptℝ1281414\mathbf{X}_{\text{down}}\in\mathbb{R}^{128\times 14\times 14}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 × 14 × 14 end_POSTSUPERSCRIPT sequentially. For the output of the final Residual Down Sampling Block, we apply an average pooling strategy to further reduce the spatial dimensions and computational complexity while preserving global information.

After downsampling process, our model employs BiR-SSM for feature modeling, which achieves effective feature propagation under relatively low computational resources. It will be further detailed in Section 3.4.

Residual Up Sampling Block. For the Block, the input 𝐗inputsubscript𝐗input\mathbf{X}_{\text{input}}bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT is ∈ℝD×N×Nabsentsuperscriptℝ𝐷𝑁𝑁\in\mathbb{R}^{D\times N\times N}∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N × italic_N end_POSTSUPERSCRIPT. Initially, we employ bilinear interpolation to enlarge the input dimensions. Subsequent feature extraction is performed using a 3×3333\times 33 × 3 convolutional kernel followed by a 1×1111\times 11 × 1 point convolution kernel. Afterwards, 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT is concatenated with the corresponding scale feature 𝐗downsubscript𝐗down\mathbf{X}_{\text{down}}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. To finalize the process, a convolutional fusion technique is applied to reduce the dimensionality.

Meanwhile, in our method, the input of the first block, we use the modeling result from BiR-SSM, 𝐗ssm∈ℝ128×7×7subscript𝐗ssmsuperscriptℝ12877\mathbf{X}_{\text{ssm}}\in\mathbb{R}^{128\times 7\times 7}bold_X start_POSTSUBSCRIPT ssm end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 × 7 × 7 end_POSTSUPERSCRIPT. And the result is 𝐗up∈ℝD×224×224subscript𝐗upsuperscriptℝ𝐷224224\mathbf{X}_{\text{up}}\in\mathbb{R}^{D\times 224\times 224}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 224 × 224 end_POSTSUPERSCRIPT (D𝐷Ditalic_D represents the dimension of the input provided to the USKT.) for each step. Additionally, if the final 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT does not have a dimension of 3, we apply a convolution to the final 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT to produce an output with the desired dimensions, 𝐗USKT∈ℝ3×224×224subscript𝐗USKTsuperscriptℝ3224224\mathbf{X}_{\text{USKT}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT.

3.4 Bidirectional Reverse State Space Model

Refer to caption

Figure 3: The figure on the left shows the traditional Bi-SSM, while the figure on the right represents our proposed BiR-SSM.

In this section, we focus on the application of a bidirectional reverse state space model for sequence modeling, as illustrated in Figure 3. While Transformer-based methods offer substantial benefits for sequence modeling, their quadratic computational complexity often limits their performance [27, 51]. To address this, we conduct sequence modeling after the Residual Down Sampling Block, which enables efficient processing while preserving essential feature information. Notably, previous bidirectional state space models typically relied on two separate SSM layers [78], as depicted in Figure 3. We believe this design can be optimized to improve parameter efficiency without sacrificing model performance.

After the Residual Down Sampling Block, we flatten the 2D data into a 1D sequence and then employ the Bidirectional Reverse State Space Model to process the downsampled data, represented as Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. The sequence is then processed through a linear layer, a convolutional layer, and a State Space Model (SSM) layer. To retain original information, the output from the SSM undergoes a residual connection with the original sequence. The downsampled data Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT is represented as a set of features {p1,p2,…,pn}subscript𝑝1subscript𝑝2…subscript𝑝𝑛\{p_{1},p_{2},\dots,p_{n}\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where each pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an element of Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT, as shown in the following formula:

XC⁢o⁢n⁢v=𝐶𝑜𝑛𝑣⁢(𝐿𝑖𝑛𝑒𝑎𝑟⁢(Xdown)),subscript𝑋𝐶𝑜𝑛𝑣𝐶𝑜𝑛𝑣𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋downX_{Conv}=\mathit{Conv}(\mathit{Linear}(X_{\text{down}})),italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT = italic_Conv ( italic_Linear ( italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) ) ,	(2)

where XC⁢o⁢n⁢vsubscript𝑋𝐶𝑜𝑛𝑣X_{Conv}italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT is obtained after passing through a linear layer and a convolutional mapping.

Simultaneously, when we obtain the result of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we apply a SiLU function to the output of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT after processing through XC⁢o⁢n⁢vsubscript𝑋𝐶𝑜𝑛𝑣X_{Conv}italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT, as shown in the following formula:

XS⁢S⁢M+=𝑆𝑖𝐿𝑈⁢(𝑆𝑆𝑀+⁢(XC⁢o⁢n⁢v)),subscript𝑋𝑆𝑆superscript𝑀𝑆𝑖𝐿𝑈superscript𝑆𝑆𝑀subscript𝑋𝐶𝑜𝑛𝑣X_{SSM^{+}}=\mathit{SiLU}(\mathit{SSM^{+}}(X_{Conv})),italic_X start_POSTSUBSCRIPT italic_S italic_S italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_SiLU ( italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT ) ) ,

(3)

where the forward 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT modeling is applied to the features to obtain XS⁢S⁢M+subscript𝑋𝑆𝑆superscript𝑀X_{SSM^{+}}italic_X start_POSTSUBSCRIPT italic_S italic_S italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

The result of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is then reversed, represented as a set of features {pn,…,p2,p1}subscript𝑝𝑛…subscript𝑝2subscript𝑝1\{p_{n},\dots,p_{2},p_{1}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } where each p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an element from X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, to facilitate the subsequent 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT processing. Finally, the sequence passes through 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, as shown in the following formula:

X𝑆𝑆𝑀−=𝑆𝑖𝐿𝑈⁢(𝑆𝑆𝑀−⁢(X𝑆𝑆𝑀+)),subscript𝑋superscript𝑆𝑆𝑀𝑆𝑖𝐿𝑈superscript𝑆𝑆𝑀subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}=\mathit{SiLU}(\mathit{SSM^{-}}(X_{\mathit{SSM^{+}}})),italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_SiLU ( italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) ,

(4)

where 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT modeling is applied to the features.

We apply a residual connection between X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and X𝑑𝑜𝑤𝑛subscript𝑋𝑑𝑜𝑤𝑛X_{\mathit{down}}italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT, producing X𝑋Xitalic_X. Then, we reverse the resulting sequence to restore the original structural arrangement, {p1,p2,…,pn}subscript𝑝1subscript𝑝2…subscript𝑝𝑛\{p_{1},p_{2},\dots,p_{n}\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where each pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an element of X𝑆𝑆𝑀subscript𝑋𝑆𝑆𝑀\mathit{X_{\mathit{SSM}}}italic_X start_POSTSUBSCRIPT italic_SSM end_POSTSUBSCRIPT, the reversed form of X𝑋Xitalic_X.

X=𝐿𝑖𝑛𝑒𝑎𝑟⁢(X𝑆𝑆𝑀−+𝑆𝑖𝐿𝑈⁢(𝐿𝑖𝑛𝑒𝑎𝑟⁢(X𝑑𝑜𝑤𝑛))),𝑋𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋superscript𝑆𝑆𝑀𝑆𝑖𝐿𝑈𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋𝑑𝑜𝑤𝑛X=\mathit{Linear}(X_{\mathit{SSM^{-}}}+\mathit{SiLU}(\mathit{Linear}(\mathit{X% _{\mathit{down}}}))),italic_X = italic_Linear ( italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_SiLU ( italic_Linear ( italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT ) ) ) ,

(5)

where a residual connection is used to combine X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and X𝑑𝑜𝑤𝑛subscript𝑋𝑑𝑜𝑤𝑛\mathit{X_{\mathit{down}}}italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT to mitigate feature loss.

3.5 Reconstruction and Classification

Feature Extraction. We use the ResNet [23] as pre-trained RGB Encoder for feature extraction. Unlike traditional ResNet applications, we utilize the adaptive output from USKT as the input for Encoder. After feature extraction through Encoder, the feature matrix 𝐗res∈ℝD×7×7subscript𝐗ressuperscriptℝ𝐷77\mathbf{X}_{\text{res}}\in\mathbb{R}^{D\times 7\times 7}bold_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 7 × 7 end_POSTSUPERSCRIPT. Subsequently, as shown in Fig 2 through decoder, the features are mapped back to the original space, ultimately resulting in an output 𝐗rec∈ℝ3×224×224subscript𝐗recsuperscriptℝ3224224\mathbf{X}_{\text{rec}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT. In our method, our decoder employs a deconvolution approach.

Loss Function. In our proposed method, we primarily used two types of loss functions. For the classification, we applied the Focal Loss to the classification results from the linear layer, is defined as:

ℒcls=−αt⁢(1−pt)γ⁢log⁡(pt),subscriptℒclssubscript𝛼𝑡superscript1subscript𝑝𝑡𝛾subscript𝑝𝑡\mathcal{L}_{\text{cls}}=-\alpha_{t}(1-p_{t})^{\gamma}\log(p_{t}),caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

(6)

where ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the predicted probability for the correct class t𝑡titalic_t, αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT balances positive and negative samples, and γ𝛾\gammaitalic_γ focuses on hard-to-classify sample.

For the reconstruction part, we use the Mean Squared Error (MSE) loss function to compare the reconstructed features 𝐗rec∈ℝ3×224×224subscript𝐗recsuperscriptℝ3224224\mathbf{X}_{\text{rec}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT and 𝐗USKT∈ℝ3×224×224subscript𝐗USKTsuperscriptℝ3224224\mathbf{X}_{\text{USKT}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT, is defined as:

ℒrec=1N⁢∑i=1N(𝐗rec(i)−𝐗USKT(i))2,subscriptℒrec1𝑁superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝐗rec𝑖superscriptsubscript𝐗USKT𝑖2\mathcal{L}_{\text{rec}}=\frac{1}{N}\sum_{i=1}^{N}\left(\mathbf{X}_{\text{rec}% }^{(i)}-\mathbf{X}_{\text{USKT}}^{(i)}\right)^{2},caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

(7)

where N𝑁Nitalic_N is the total number of elements, and i𝑖iitalic_i indexes the elements.

Finally, we combine λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to compute our total loss ℒℒ\mathcal{L}caligraphic_L, defined as:

ℒ=λ1⋅Lcls+λ2⋅Lrec,ℒ⋅subscript𝜆1subscript𝐿cls⋅subscript𝜆2subscript𝐿rec\mathcal{L}=\lambda_{1}\cdot L_{\text{cls}}+\lambda_{2}\cdot L_{\text{rec}},caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ,

(8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the weights for the classification loss and reconstruction loss, respectively.

4 Experiments

4.1 Experimental Setup

Table 1: Comparison of classification accuracies on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS, showing the top-1 accuracy.

Dataset.

We utilize the ImageNet-1K dataset [11] for pre-training our models. In our experiments, we compare SimCLR, MoCo-v2, and MoCo-v3, all pretrained on both ImageNet-1K [11] and N-ImageNet [32]. Furthermore, we extend our knowledge transfer activities to the DVS128 Gesture [1], N-Caltech101 [48], and CIFAR-10-DVS [9] datasets to assess the generalization capabilities of our models across various domains. Additionally, we adapt the input by resizing images to a resolution of 224×224 pixels.

DVS128 Gesture [1] consists of 1,188 event streams from 29 participants, categorized into 11 gesture types, with each event stream featuring a resolution of approximately 128×128 pixels. N-Caltech101 [48] comprises a total of 8,242 images across 101 categories, with each image having a resolution of around 300×200 pixels. CIFAR-10-DVS [9] includes 10 classes, with 1,000 samples per class, totaling 10,000 samples, each at a resolution of 128×128 pixels.

Implementation.

Our model is implemented using PyTorch and trained on NVIDIA RTX 2080Ti GPUs. For all experiments, we employ the AdamW optimizer [41] and utilize a cosine scheduler. The initial learning rate is set to 0.0025, with a reduced rate of 0.000025 for fine-tuning layers.

4.2 Comparison with Existing Methods

Compared to RGB-based Supervised Methods.

In the non-pretrained models, our method achieved significant improvements on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets compared to VIT-S/16 and ResNet50. Specifically, our method outperformed VIT-S/16 by 29.17%, 33.19%, and 24.3%, and surpassed ResNet50 by 16.58%, 26.13%, and 20.1% respectively. For the pre-trained models, our method outperformed VIT-S/16 by 17.05% on DVS128 Gesture, 3.80% on N-Caltech101, and 0.6% on CIFAR-10-DVS. It demonstrates that our experiments show significant performance improvements under both pretrained and non-pretrained supervised conditions.

Compared to RGB-based Unsupervised Methods.

We used a frozen ResNet50 backbone to compare with traditional RGB unsupervised methods. On the DVS128 Gesture, our frozen model can outperform many unsupervised models, surpassing SimCLR [6] and MoCo-v3 [8] by 0.76% and 2.65%, respectively. On the N-Caltech101, our model also outperforms many unsupervised models, surpassing SimCLR [6] and MoCo-v2 [7] by 2.25% and 4.66%, respectively. On the CIFAR-10-DVS, our model also surpasses many unsupervised models, exceeding SimCLR [6] and MoCo-v2 [7] by 1.6% and 2.1%, respectively. In the unfrozen condition, on the DVS128 Gesture, our model can surpass MoCo-v2 [7] by 2.10%. Therefore, compared to traditional RGB unsupervised methods, our model demonstrates significant advantages.

Compared to SNN methods.

Due to the effective handling of event information by SNNs in event camera classification tasks, our method(frozen) was compared with SNN-based methods. On the DVS128 Gesture, our model showed significant advantages over other advanced SNN-based methods, not only outperforming Spikformer [38] by 5.71% but also achieving a comparable level to MLF [19]. Similarly, on the N-Caltech101, our model excelled, surpassing Spikformer [38] by 15.99%. Furthermore, on the CIFAR-10-DVS, our model further demonstrated its superiority, outperforming Spikformer [38], MLF [19], and TEBN [16] by 8.2%, 9.68%, and 9.05%, respectively. These results fully demonstrate the excellent performance and leading position of our model in handling tasks based on SNNs.

Table 2: Comparison of the performance of ResNet18, ResNet34, and ResNet50 with and without the implementation of USKT, illustrating top-1 accuracy on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets. The table also delineates the results under both frozen and unfrozen backbone conditions.

Compared to Knowledge Transfer Methods.

We primarily demonstrate the superiority of our method by comparing it with knowledge transfer-based approaches. In supervised methods, our model with a frozen backbone network can train with very low parameter counts, surpassing PKOA [25] by 1.17% on N-Caltech101 and by 0.4% on CIFAR-10-DVS. When the backbone network is unfrozen, our model further exceeds PKOA [25] by 3.7% on N-Caltech101 and by 2.75% on CIFAR-10-DVS, and surpasses CAF [65] by 3.21% on DVS128-Gesture.

In unsupervised methods, our proposed method with a frozen backbone outperforms TriGAN [52] by 1.47%, 5.87%, and 1.55% on DVS128-Gesture, N-Caltech101, and CIFAR-10-DVS, respectively. Unfreezing the backbone allows our model to further exceed CTN [74] by 0.92%, 0.25%, and 1.8% on DVS128-Gesture, N-Caltech101, and CIFAR-10-DVS, respectively.

4.3 Abaltion Studies

In this section, we address three key issues: Firstly, we examine the applicability of our proposed USKT Block to various sizes of ResNet models. Secondly, we assess the effectiveness of the USKT Block in enhancing model performance. Thirdly, we conduct a comparative analysis of the BiR-SSM Block.

Adaptability of USKT.

As shown in Table 2, we evaluated the performance of our proposed USKT across different sizes of ResNet on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets to validate the applicability of USKT to various ResNet architectures.

Initially, we conducted experiments with the backbone network frozen (with only the bias parameters of ResNet unfrozen). Using ResNet18 as the backbone, the integration of USKT resulted in performance improvements of 2.94%, 2.12%, and 3.45% on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively. With ResNet34 as the backbone, USKT enhanced the model’s performance by 2.24%, 1.8%, and 2.1% on these respective datasets. When employing ResNet50, the addition of USKT led to gains of 0.95%, 3.57%, and 2.9%.

Further, we evaluated the performance on the N-Caltech101 with the backbone network completely unfrozen. In this dataset, adding USKT improved the model’s performance by 1.2% with ResNet18, 0.49% with ResNet34, and 2.58% with ResNet50 as the backbone.

Table 2 illustrates that our method achieves the most substantial improvements with the ResNet50 backbone, irrespective of the network’s state (frozen or unfrozen). Its superior performance is likely attributable to ResNet50’s enhanced capability to extract richer fine-grained information from images compared to the ResNet18 and ResNet34 models.

Effectiveness of USKT.

Table 3: Comparison of different domain-adaptive generation methods for classification accuracies, showing top-1 accuracy on the N-Caltech101.

As illustrated in Table 3, we conducted comparative evaluations between convolution and Transformer-based methods to validate the effectiveness of our proposed USKT. Initially, we substituted USKT with convolutional layers to assess the adaptive capabilities of our approach. The experiments were executed with a frozen ResNet50 backbone. Our model demonstrated improvements of 2.6% and 2.94% over single and double convolution layer setups, respectively.

Furthermore, to rigorously assess the efficacy of our proposed BiR-SSM, we carried out comparative experiments under both frozen and unfrozen conditions of the ResNet50 backbone, where BiR-SSM was replaced with a Transformer module. Under the frozen condition, our method exceeded the performance of the Transformer-based methods by 2.14%, achieving results comparable to those of the unfrozen backbone Transformer. Remarkably, even with fewer parameters in the unfrozen state, our approach not only matched but surpassed the Transformer-based model by 2.15%.

Comparision of BiR-SSM Block.

Table 4: Comparison of different ssm layers for classification accuracies, showing top-1 accuracy on N-Caltech101.

As shown in Table 4, we have frozen the ResNet50 backbone and substituted the original SSM with our novel BiR-SSM in various configurations. It can be concluded that our proposed BiR-SSM outperforms the traditional Bi-SSM. This enhancement is likely attributable to the improved data consistency achieved through the shared SSM mechanism that we implemented.

4.4 Hyperparameter Studies

This section first discusses the impact of different numbers of BiR-SSM layers on the model, followed by an analysis of different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT affect model performance.

Refer to caption

Figure 4: The left is the comparison of the performance of ResNet50 with different numbers of SSM layers in USKT and the right is the comparison of the performance of ResNet50 with different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, showing top-1 accuracy on DVS128 Gesture, N-Caltech101 and CIAFR-10-DVS.

Comparison of Different Number of BiR-SSM Layers.

As demonstrated in Figure 4, employing a single SSM layer yields the highest accuracy, surpassing the configurations where no BiR-SSM layers or multiple BiR-SSM layers are used. In our setup, we utilized ResNet-50 as the backbone with the main network components frozen. Specifically, with one BiR-SSM layer, our method achieved an accuracy of 88.82% on the N-Caltech101 dataset. Similarly, this configuration attained accuracies of 88.62% on the DVS128 Gesture and 76.75% on CIFAR-10-DVS. We hypothesize that the absence of any BiR-SSM layers causes the adaptive domain to predominantly focus on local information, thereby neglecting global context. Conversely, incorporating more than one BiR-SSM layer can lead to overfitting or an excessive emphasis on the classification task, potentially compromising the model’s performance.

Comparison of Different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Settings.

As depicted in Figure 4, varying the parameter λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT significantly influences the performance of our model. In our evaluations, we employed a frozen ResNet-50 architecture as the backbone, and set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 1. The results indicate optimal performance when λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is set to 0.05, with the model achieving an accuracy of 88.82% on the N-Caltech101 dataset. Similar efficacy is observed on the DVS128 Gesture with an accuracy of 88.62%, and a noteworthy performance of 76.75% on CIFAR-10-DVS. We hypothesize that a λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT value below 0.05 potentially leads the model to prioritize the classification task, possibly at the expense of generalization capabilities. Conversely, a λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT value above 0.05 seems to excessively focus the model on the reconstruction task, which detrimentally impacts classification accuracy.

5 Conclusion

In this paper, we introduce the USKT framework to tackle the challenge of limited event data in event-based imaging by facilitating effective Event-to-RGB knowledge transfer. The framework allows event data to leverage pre-trained RGB models with minimal tuning, achieving robust performance. Our BiR-SSM component, with its shared weight strategy, further enhances computational efficiency. Experimental results across multiple datasets demonstrate USKT’s adaptability and effectiveness in advancing event-based imaging.

References

Amir et al. [2017] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7243–7252, 2017.
Bai et al. [2019] Wenjun Bai, Changqin Quan, and Zhi-Wei Luo. Adaptive generative initialization in transfer learning. Computer and Information Science 17, pages 63–74, 2019.
Brebion et al. [2021] Vincent Brebion, Julien Moreau, and Franck Davoine. Real-time optical flow for vehicular perception with low-and high-resolution event cameras. IEEE Transactions on Intelligent Transportation Systems, 23(9):15066–15078, 2021.
Cadena et al. [2021] Pablo Rodrigo Gantier Cadena, Yeqiang Qian, Chunxiang Wang, and Ming Yang. Spade-e2vid: Spatially-adaptive denormalization for event-based video reconstruction. IEEE Transactions on Image Processing, 30:2488–2500, 2021.
Chen et al. [2020a] Chaoqi Chen, Weiping Xie, Yi Wen, Yue Huang, and Xinghao Ding. Multiple-source domain adaptation with generative adversarial nets. Knowledge-Based Systems, 199:105962, 2020a.
Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607, 2020b.
Chen et al. [2020c] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv Preprint arXiv:2003.04297, 2020c.
Chen et al. [2021] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
Cheng et al. [2020] Wensheng Cheng, Hao Luo, Wen Yang, Lei Yu, and Wei Li. Structure-aware network for lane marker extraction with dynamic vision sensor. arXiv Preprint arXiv:2008.06204, 2020.
Dai et al. [2021] Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity-aware upsampling for deep image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6841–6850, 2021.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
Deng et al. [2020] Yongjian Deng, Youfu Li, and Hao Chen. Amae: Adaptive motion-agnostic encoder for event-based object classification. IEEE Robotics and Automation Letters, 5(3):4596–4603, 2020.
Deng et al. [2021] Yongjian Deng, Hao Chen, and Youfu Li. Mvf-net: A multi-view fusion network for event-based object classification. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8275–8284, 2021.
Deng et al. [2022] Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li. A voxel graph cnn for object classification with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1172–1181, 2022.
Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv Preprint arXiv:2010.11929, 2020.
[16] Chaoteng Duan, Jianhao Ding, Shiyan Chen, Zhaofei Yu, and Tiejun Huang. Temporal effective batch normalization in spiking neural networks.
Esser et al. [2018] Patrick Esser, Ekaterina Sutter, and Björn Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018.
Fang et al. [2021] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothee Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks, 2021.
Feng et al. [2023] Lang Feng, Qianhui Liu, Huajin Tang, De Ma, and Gang Pan. Multi-level firing with spiking ds-resnet: Enabling better and deeper directly-trained spiking neural networks, 2023.
Gallego et al. [2020] Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, Jörg Conradt, Kostas Daniilidis, et al. Event-based vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2020.
Gao et al. [2023] Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai. Action recognition and benchmark using event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Gao et al. [2024] Yue Gao, Jiaxuan Lu, Siqi Li, Yipeng Li, and Shaoyi Du. Hypergraph-based multi-view action recognition using event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
He and Wang [2023] Lianlian He and Ming Wang. Slicesamp: A promising downsampling alternative for retaining information in a neural network. Applied Sciences, 13(21):11657, 2023.
He et al. [2019] Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 578–587, 2019.
He et al. [2020] Zhihai He, Bo Yang, Chaoxian Chen, Qilin Mu, and Zesong Li. Clda: An adversarial unsupervised domain adaptation method with classifier-level adaptation. Multimedia Tools and Applications, 79:33973–33991, 2020.
He et al. [2023] Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, and Zhouhan Lin. Fourier transformer: Fast long range modeling by removing sequence redundancy with fft operator. arXiv Preprint arXiv:2305.15099, 2023.
Hu et al. [2018a] Guangneng Hu, Yu Zhang, and Qiang Yang. Conet: Collaborative cross networks for cross-domain recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 667–676, 2018a.
Hu et al. [2018b] Lanqing Hu, Meina Kan, Shiguang Shan, and Xilin Chen. Duplex generative adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1498–1507, 2018b.
Jia et al. [2023] Zexi Jia, Kaichao You, Weihua He, Yang Tian, Yongxiang Feng, Yaoyuan Wang, Xu Jia, Yihang Lou, Jingyi Zhang, Guoqi Li, et al. Event-based semantic segmentation with posterior attention. IEEE Transactions on Image Processing, 32:1829–1842, 2023.
Kang and Kang [2023] Daehyun Kang and Dongwoo Kang. Event camera-based pupil localization: Facilitating training with event-style translation of rgb faces. IEEE Access, 2023.
Kim et al. [2021] Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-imagenet: Towards robust, fine-grained object recognition with event cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2146–2156, 2021.
Klenk et al. [2024] Simon Klenk, David Bonello, Lukas Koestler, Nikita Araslanov, and Daniel Cremers. Masked event modeling: Self-supervised pretraining for event cameras. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2378–2388, 2024.
Li and Liu [2023] Lin Li and Yang Liu. Multi-dimensional attention spiking transformer for event-based image classification. pages 359–362, 2023.
Li et al. [2024] Lei Li, Alexander Linger, Mario Millhaeusler, Vagia Tsiminaki, Yuanyou Li, and Dengxin Dai. Object-centric cross-modal feature distillation for event-based object detection. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15440–15447, 2024.
Li et al. [2017] Xiao Li, Min Fang, Ju-Jie Zhang, and Jinqiao Wu. Domain adaptation from rgb-d to rgb images. Signal Processing, 131:27–35, 2017.
Li et al. [2021] Yijin Li, Han Zhou, Bangbang Yang, Ye Zhang, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. Graph-based asynchronous event processing for rapid object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 934–943, 2021.
Li et al. [2022] Yudong Li, Yunlin Lei, and Xu Yang. Spikeformer: A novel architecture for training high-performance low-latency spiking neural network. arXiv Preprint arXiv:2211.10686, 2022.
Liang et al. [2022] Jian Liang, Dapeng Hu, Jiashi Feng, and Ran He. Dine: Domain adaptation from single and multiple black-box predictors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8003–8013, 2022.
Liang et al. [2021] Zichen Liang, Guang Chen, Zhijun Li, Peigen Liu, and Alois Knoll. Event-based object detection with lightweight spatial attention mechanism. In 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), pages 498–503, 2021.
Loshchilov [2017] I Loshchilov. Decoupled weight decay regularization. arXiv Preprint arXiv:1711.05101, 2017.
Lu et al. [2024] Jiaxuan Lu, Fang Yan, Xiaofan Zhang, Yue Gao, and Shaoting Zhang. Pathotune: Adapting visual foundation model to pathological specialists. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 395–406, 2024.
Lu et al. [2023a] Wei Lu, Si-Bao Chen, Jin Tang, Chris HQ Ding, and Bin Luo. A robust feature downsampling module for remote-sensing visual tasks. IEEE Transactions on Geoscience and Remote Sensing, 61:1–12, 2023a.
Lu et al. [2023b] Yunfan Lu, Zipeng Wang, Minjie Liu, Hongjian Wang, and Lin Wang. Learning spatial-temporal implicit neural representations for event-guided video super-resolution. pages 1557–1567, 2023b.
Moreno et al. [2012] Orly Moreno, Bracha Shapira, Lior Rokach, and Guy Shani. Talmud: Transfer learning for multiple domains. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pages 425–434, 2012.
Mostafavi et al. [2021] Mohammad Mostafavi, Yeongwoo Nam, Jonghyun Choi, and Kuk-Jin Yoon. E2sri: Learning to super-resolve intensity images from events. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6890–6909, 2021.
Mueggler et al. [2018] Elias Mueggler, Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. Continuous-time visual-inertial odometry for event cameras. IEEE Transactions on Robotics, 34(6):1425–1440, 2018.
Orchard et al. [2015] Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in Neuroscience, 9:437, 2015.
Pan et al. [2020] Liyuan Pan, Richard Hartley, Cedric Scheerlinck, Miaomiao Liu, Xin Yu, and Yuchao Dai. High frame rate video reconstruction based on an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519–2533, 2020.
Rebecq et al. [2019] Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6):1964–1980, 2019.
Ren et al. [2023] Lei Ren, Haiteng Wang, and Gao Huang. Dlformer: A dynamic length transformer-based network for efficient feature representation in remaining useful life prediction. IEEE Transactions on Neural Networks and Learning Systems, 2023.
Roy et al. [2021] Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe, and Elisa Ricci. Trigan: Image-to-image translation for multi-source domain adaptation. Machine Vision and Applications, 32:1–12, 2021.
Sironi et al. [2018] Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1731–1740, 2018.
Sohn et al. [2023] Kihyuk Sohn, Huiwen Chang, José Lezama, Luisa Polania, Han Zhang, Yuan Hao, Irfan Essa, and Lu Jiang. Visual prompt tuning for generative transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19840–19851, 2023.
Song et al. [2022] Zhaoyang Song, Xiaoqiang Zhao, Yongyong Hui, Hongmei Jiang, et al. Inverted n-type lightweight network based on back projection for image super-resolution reconstruction. 2022.
Tan et al. [2020] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. Kt-gan: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Transactions on Image Processing, 30:1275–1290, 2020.
Tian et al. [2021] Kun Tian, Chenghao Zhang, Ying Wang, Shiming Xiang, and Chunhong Pan. Knowledge mining and transferring for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9133–9142, 2021.
Tomy et al. [2022] Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event-based and rgb camera for robust object detection in adverse conditions. In 2022 International Conference on Robotics and Automation (ICRA), pages 933–939, 2022.
Wang et al. [2023] Ruilin Wang, Li Wang, and Yingbo He. A time-related voxel representation method for event camera. pages 553–557, 2023.
Wang et al. [2024] Xiao Wang, Shiao Wang, Chuanming Tang, Lin Zhu, Bo Jiang, Yonghong Tian, and Jin Tang. Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19248–19257, 2024.
Wang et al. [2020] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis Herranz, Fahad Shahbaz Khan, and Joost van de Weijer. Minegan: Effective knowledge transfer from gans to target domains with few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9332–9341, 2020.
Wang et al. [2022] Zuowen Wang, Yuhuang Hu, and Shih-Chii Liu. Exploiting spatial sparsity for event cameras with visual transformers. In 2022 IEEE International Conference on Image Processing (ICIP), pages 411–415, 2022.
Wu et al. [2020] Jinjian Wu, Chuanwei Ma, Xiaojie Yu, and Guangming Shi. Denoising of event-based sensors with spatial-temporal correlation. pages 4437–4441, 2020.
Xie et al. [2022a] Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, and Youfu Li. Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification. IEEE Robotics and Automation Letters, 7(2):1976–1983, 2022a.
Xie et al. [2022b] Binhui Xie, Shuang Li, Fangrui Lv, Chi Harold Liu, Guoren Wang, and Dapeng Wu. A collaborative alignment framework of transferable knowledge extraction for unsupervised domain adaptation. IEEE Transactions on Knowledge and Data Engineering, 35(7):6518–6533, 2022b.
Xie et al. [2023] Zhihui Xie, Min Fu, and Xuefeng Liu. Electrical fittings inspection based on improved unet with generative adversarial network and attention mechanism. In 2023 8th International Conference on Image, Vision and Computing (ICIVC), pages 776–782, 2023.
Yamaguchi et al. [2022] Shin’ya Yamaguchi, Sekitoshi Kanai, Atsutoshi Kumagai, Daiki Chijiwa, and Hisashi Kashima. Transfer learning with pre-trained conditional generative models. arXiv Preprint arXiv:2204.12833, 2022.
Yang et al. [2023] Yan Yang, Liyuan Pan, and Liu Liu. Event camera data pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10699–10709, 2023.
Yao et al. [2021] Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10221–10230, 2021.
Yuan et al. [2023] Chengguo Yuan, Yu Jin, Zongzhen Wu, Fanting Wei, Yangzirui Wang, Lan Chen, and Xiao Wang. Learning bottleneck transformer for event image-voxel feature fusion based classification. pages 3–15, 2023.
Zhang et al. [2021a] Hongwei Zhang, Xiangwei Kong, and Yujia Zhang. Selective knowledge transfer for cross-domain collaborative recommendation. IEEE Access, 9:48039–48051, 2021a.
Zhang et al. [2021b] Weichen Zhang, Dong Xu, Jing Zhang, and Wanli Ouyang. Progressive modality cooperation for multi-modality domain adaptation. IEEE Transactions on Image Processing, 30:3293–3306, 2021b.
Zhao et al. [2022a] Dongcheng Zhao, Yi Zeng, and Yang Li. Backeisnn: A deep spiking neural network with adaptive self-feedback and balanced excitatory–inhibitory neurons. Neural Networks, 154:68–77, 2022a.
Zhao et al. [2022b] Junwei Zhao, Shiliang Zhang, and Tiejun Huang. Transformer-based domain adaptation for event data classification. pages 4673–4677, 2022b.
Zheng et al. [2020] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks, 2020.
Zhou et al. [2023] Qian Zhou, Peng Zheng, and Xiaohu Li. A bio-inspired hierarchical spiking neural network with biological synaptic plasticity for event camera object recognition. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi= Journal of Biomedical Engineering= Shengwu Yixue Gongchengxue Zazhi, 40(4):692–699, 2023.
Zhou et al. [2022] Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer, 2022.
Zhu et al. [2024] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv Preprint arXiv:2401.09417, 2024.
Zuo et al. [2021] Yukun Zuo, Hantao Yao, Liansheng Zhuang, and Changsheng Xu. Margin-based adversarial joint alignment domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology, 32(4):2057–2067, 2021.

Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras (original) (raw)

Abstract

1 Introduction

2 Related Work

2.1 Event-based Image Recognition

2.2 Knowledge Transfer

3 Method

3.1 Overview

3.2 Event Data Processing

3.3 Generative U-SSM Knowledge Transfer

3.4 Bidirectional Reverse State Space Model

3.5 Reconstruction and Classification

4 Experiments

4.1 Experimental Setup

Dataset.

Implementation.

4.2 Comparison with Existing Methods

Compared to RGB-based Supervised Methods.

Compared to RGB-based Unsupervised Methods.

Compared to SNN methods.

Compared to Knowledge Transfer Methods.

4.3 Abaltion Studies

Adaptability of USKT.

Effectiveness of USKT.

Comparision of BiR-SSM Block.

4.4 Hyperparameter Studies

Comparison of Different Number of BiR-SSM Layers.

Comparison of Different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Settings.

5 Conclusion

References