Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras (original) (raw)

Yuhui Lin1∗Jiahao Zhang1Siyuan Li2Jimin Xiao1
Ding Xu3Wenjun Wu4Jiaxuan Lu5†
1Xi’an Jiaotong-Liverpool University
2Dalian University of Technology
3Alibaba International Digital Business Group
4University of Illinois at Urbana-Champaign
5Shanghai Artificial Intelligence Laboratory
*First Author: Yuhui.Lin21@student.xjtlu.edu.cn
†Corresponding Author: lujiaxuan@pjlab.org.cn

Abstract

Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT’s adaptability and effectiveness. The code will be made available upon acceptance.

1 Introduction

Refer to caption

Figure 1: The proposed U-shaped State Space Model Knowledge Transfer (USKT) framework with the BiR-SSM module combines reconstruction and classification losses for Event-to-RGB feature adaptation, enabling the reuse of the pre-trained RGB encoder.

Event cameras represent a novel imaging technology that differs fundamentally from traditional frame-based cameras by capturing changes in brightness at the pixel level continuously, rather than capturing entire frames at regular intervals. The unique mechanism provides event cameras with exceptionally high temporal resolution and minimal latency, making them particularly well-suited for capturing fast-moving activities and handling scenes with high dynamic range [50, 20]. Compared to traditional cameras, event cameras excel in environments with significant lighting variations, while also consuming less energy, making them highly promising for applications such as autonomous driving [3], robotic navigation [47], and high-speed motion capture [21]. However, as a relatively new imaging modality, event cameras face significant challenges related to data scarcity [4, 49].

To address the challenge of limited data availability in event-based imaging, exploring knowledge transfer for event data is a promising direction worth investigating. In the broader field of knowledge transfer, methods can generally be categorized into domain-based and generative-based approaches. Domain-based methods aim to improve target domain performance by transferring knowledge from auxiliary domains [71, 45, 28, 42], while generative-based methods focus on generating synthetic data to enhance model performance [57, 61, 56].

In the field of event cameras, data is recorded only during changes in pixel brightness, resulting in event streams that are often sparse in visual content and differ significantly from the feature distributions of RGB images. Consequently, domain-based methods frequently encounter challenges related to domain mismatches, making effective knowledge transfer difficult [31]. On the other hand, generative-based models can simulate sparse event streams to generate additional synthetic RGB data, which can be leveraged to enhance model training [46, 49].

To address the scarcity of event data, we design a generative U-shaped State Space Model Knowledge Transfer (USKT) framework tailored to the characteristics of event data. Previous research has widely recognized U-shaped methods for their excellent reconstruction capabilities [17, 66]. Building upon these capabilities, we propose a generative knowledge transfer approach specifically for adapting event data to RGB features. As shown in Figure 1, our proposed method includes a Residual Down Sampling Block, a Residual Up Sampling Block, and a Bidirectional Reverse State Space Model. Specifically, the first Residual Down Sampling Block increases feature dimensionality while reducing spatial resolution, whereas the Residual Up Sampling Block enhances image restoration and preserves critical feature information, aligning the feature distribution more closely with that of RGB features.

Furthermore, since convolutional in our model predominantly focus on local features during the downsampling process, we incorporate sequence modeling that captures global feature dependencies. As past Transformer-based approaches often faced significant computational resource demands [27, 51], we introduce the Bidirectional Reverse State Space Model (BiR-SSM), which performs feature propagation through bidirectional scanning. Compared to the traditional Bidirectional State Space Model (Bi-SSM) [78], our BiR-SSM employs a shared SSM layer strategy aimed at ensuring feature consistency and reducing computational overhead. Additionally, our approach simultaneously performs reconstruction and classification to improve the model’s performance in classifying event images. In summary, our research makes the following three key contributions:

2.1 Event-based Image Recognition

Event image recognition predominantly include graph-based models, Spiking Neural Networks (SNNs), and attention mechanisms. Graph-based models, using vertex and edge structures along with heterogeneous graph models and voxel grids, emulate spatial and temporal relationships among events and analyze complex data patterns, as demonstrated in various studies [37, 14, 59, 64, 63, 70, 44]. Spiking Neural Networks (SNNs) excel in processing time-step sequences for event image classification and, when integrated with attention mechanisms, significantly improve object recognition in dynamic environments by managing asynchronous data and focusing on critical features [75, 18, 73, 19, 16, 69, 76, 53, 77]. Additionally, some attention-based methods have also been widely used [40, 13, 34, 30, 22].

Among these methods, while tailored for event cameras, fail to address data scarcity. As a result, some studies have shifted to training with RGB-based models to mitigate this issue. For instance, several approaches based on ResNet have utilized RGB information to enhance the representational capability of event data [33, 12], while other methods have employed pre-trained ViT models based on RGB to improve the handling of sparse event streams [62]. Additionally, methods that integrate RGB and event camera data have proven to noticeably enhance the performance of downstream tasks [68].

2.2 Knowledge Transfer

In knowledge transfer, most approaches focus on RGB-to-RGB transfer. Domain adaptation methods align feature distributions between source and target domains. The PMC method enhances cross-modal recognition by generating missing target domain modalities through multimodal collaboration [72]. CLDA and MAJA mitigate domain shift via adversarial learning, boosting classification accuracy, especially in unsupervised scenarios [26, 79]. The DARDR method enhances cross-domain recognition by applying cross-modal constraints to transfer RGB-D data to the RGB target domain [36].

Generative-based methods use Generative Adversarial Networks (GANs) to generate target domain samples, reducing inter-domain differences. TriGAN and MSAN generate target samples from multiple source domains, significantly enhancing classification accuracy in unlabeled target domain tasks [52, 5]. DINE achieves privacy-preserving knowledge transfer with only a black-box source model [39], while DupGAN employs a dual-GAN structure to effectively ensure feature consistency across domains [29]. Meanwhile, U-shaped methods has shown promising results in generative tasks [17, 66].

Refer to caption

Figure 2: Overview of USKT framework. The proposed method is based on a U-shaped network, starting by mapping event data into suitable channels for USKT input through a time-accumulation. Subsequently, the data dimension is increased and the size is reduced via a downsampling process. Furthermore, we design a Bidirectional Reverse State Space Model (BiR-SSM) for sequence modeling. Following this, data is restored to its original resolution through an upsampling process. Finally, a reconstruction loss is introduced to enhance classification accuracy.

In the research on knowledge transfer between Event and RGB modalities, a few studies adopt co-training approaches, where models integrate features from both modalities to enhance robustness and accuracy across various environments [60, 58, 35]. In addition, CTN [74] is a Transformer-based cross-domain adaptation method that enhances the classification performance of event data by transferring features from RGB data.

3 Method

3.1 Overview

We propose a U-shaped State Space Model Knowledge Transfe (USKT) framework that efficiently converts event data into RGB features. As shown in Fig 2, the model consists of three key components: event data processing that transforms multiple time steps into voxel information to capture dynamic data changes; a Residual Down Sampling Block that reduces sequence length for efficient feature extraction; and a Residual Up Sampling Block that reconstructs the features into RGB domain suitable for encoder inputs. Additionally, we introduce a Bidirectional Reverse State Space Model (BiR-SSM) to fully capture the sequential dependencies between features. Finally, we focus on the performance of the Residual Up Sampling Block’s output XUSKTsubscript𝑋USKTX_{\text{USKT}}italic_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT after it passes through the feature extractor and present the design of the hybrid loss function.

3.2 Event Data Processing

An event stream can be visualized as consisting of multiple events, each characterized by (x,y,t,p)𝑥𝑦𝑡𝑝(x,y,t,p)( italic_x , italic_y , italic_t , italic_p ), where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) represent spatial coordinates, t𝑡titalic_t denotes the timestamp, and p𝑝pitalic_p indicates the polarity (+11+1+ 1 or −11-1- 1, signifying an increase or decrease in brightness). Consequently, event data is mapped into a three-dimensional grid where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) serve as the spatial dimensions and the time dimension is segmented into discrete bins, effectively organizing the event data temporally. Furthermore, based on the (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinates and the calculated time bin k𝑘kitalic_k, the event polarity p𝑝pitalic_p is accumulated in the respective voxel within the grid. Each voxel (x,y,k)𝑥𝑦𝑘(x,y,k)( italic_x , italic_y , italic_k ) then holds the aggregated polarity of events within the corresponding time bin, where the aggregation method, whether summing or averaging, depends on specific use cases. Ultimately, the final result is a three-dimensional tensor that retains both spatial and temporal information from the event stream.

3.3 Generative U-SSM Knowledge Transfer

Generative-based methods have been widely applied in the field of knowledge transfer across various tasks [54, 2, 67], and U-Net-based architectures have shown promising results in generative tasks [17, 66]. Building on these advances, we propose the U-SSM Knowledge Transfer (USKT) block for event-to-RGB knowledge transfer. Specifically, we input the event data 𝐗input∈ℝT×224×224subscript𝐗inputsuperscriptℝ𝑇224224\mathbf{X}_{\text{input}}\in\mathbb{R}^{T\times 224\times 224}bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 224 × 224 end_POSTSUPERSCRIPT, where T represents the time steps. Using a convolutional layer, we map the input data to 12 dimensions, standardizing the time steps of the event camera. The convolution operation can be expressed as:

𝐗proj=𝐂𝐨𝐧𝐯⁢(𝐗input),subscript𝐗proj𝐂𝐨𝐧𝐯subscript𝐗input\mathbf{X}_{\text{proj}}=\mathbf{Conv}(\mathbf{X}_{\text{input}}),bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT = bold_Conv ( bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT ) , (1)

U-shaped models are highly effective for knowledge transfer, primarily due to the essential roles of their downsampling and upsampling modules. Downsampling modules compress data by reducing feature sizes and increasing dimensionality [43, 24], whereas upsampling modules expand features and retain detailed information necessary for reconstruction [10, 55]. However, traditional U-shaped approaches, typically designed for RGB data, may not directly translate to event data, which primarily captures changes in brightness. The mismatch can lead to overfitting. Moreover, the inherent sparsity of event data necessitates a departure from conventional downsampling techniques; therefore, we incorporate residual connections to maintain the integrity of the original features. To address these challenges, we introduce the Residual Down Sampling Block and Residual Up Sampling Block for effective downsampling and upsampling, respectively. As illustrated in Figure 2, the proposed framework employs 4 Residual Down Sampling Blocks and 5 Residual Up Sampling Blocks.

Residual Down Sampling Block. For the Block, the input feature 𝐗proj∈ℝD×N×Nsubscript𝐗projsuperscriptℝ𝐷𝑁𝑁\mathbf{X}_{\text{proj}}\in\mathbb{R}^{D\times N\times N}bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N × italic_N end_POSTSUPERSCRIPT undergoes a series of operations. First, a convolution operation is applied to extract global features, resulting in 𝐗conv1∈ℝF×N×Nsubscript𝐗conv1superscriptℝ𝐹𝑁𝑁\mathbf{X}_{\text{conv1}}\in\mathbb{R}^{F\times N\times N}bold_X start_POSTSUBSCRIPT conv1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N × italic_N end_POSTSUPERSCRIPT. Next, another convolution focuses on feature downsampling, producing 𝐗conv2∈ℝF×N/2×N/2subscript𝐗conv2superscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{conv2}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT conv2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. Simultaneously, the original input feature is downsampled directly through a convolution layer, yielding 𝐗res∈ℝF×N/2×N/2subscript𝐗ressuperscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{res}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. A residual connection is then applied, resulting in 𝐗down∈ℝF×N/2×N/2subscript𝐗downsuperscriptℝ𝐹𝑁2𝑁2\mathbf{X}_{\text{down}}\in\mathbb{R}^{F\times N/2\times N/2}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_N / 2 × italic_N / 2 end_POSTSUPERSCRIPT. The Residual Down Sampling Block preserves essential features while reducing the spatial dimensions of the data.

Meanwhile, in our method, the input is 𝐗proj∈ℝ12×224×224subscript𝐗projsuperscriptℝ12224224\mathbf{X}_{\text{proj}}\in\mathbb{R}^{12\times 224\times 224}bold_X start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 12 × 224 × 224 end_POSTSUPERSCRIPT and the outpput is 𝐗down∈ℝ128×14×14subscript𝐗downsuperscriptℝ1281414\mathbf{X}_{\text{down}}\in\mathbb{R}^{128\times 14\times 14}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 × 14 × 14 end_POSTSUPERSCRIPT sequentially. For the output of the final Residual Down Sampling Block, we apply an average pooling strategy to further reduce the spatial dimensions and computational complexity while preserving global information.

After downsampling process, our model employs BiR-SSM for feature modeling, which achieves effective feature propagation under relatively low computational resources. It will be further detailed in Section 3.4.

Residual Up Sampling Block. For the Block, the input 𝐗inputsubscript𝐗input\mathbf{X}_{\text{input}}bold_X start_POSTSUBSCRIPT input end_POSTSUBSCRIPT is ∈ℝD×N×Nabsentsuperscriptℝ𝐷𝑁𝑁\in\mathbb{R}^{D\times N\times N}∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N × italic_N end_POSTSUPERSCRIPT. Initially, we employ bilinear interpolation to enlarge the input dimensions. Subsequent feature extraction is performed using a 3×3333\times 33 × 3 convolutional kernel followed by a 1×1111\times 11 × 1 point convolution kernel. Afterwards, 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT is concatenated with the corresponding scale feature 𝐗downsubscript𝐗down\mathbf{X}_{\text{down}}bold_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. To finalize the process, a convolutional fusion technique is applied to reduce the dimensionality.

Meanwhile, in our method, the input of the first block, we use the modeling result from BiR-SSM, 𝐗ssm∈ℝ128×7×7subscript𝐗ssmsuperscriptℝ12877\mathbf{X}_{\text{ssm}}\in\mathbb{R}^{128\times 7\times 7}bold_X start_POSTSUBSCRIPT ssm end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 × 7 × 7 end_POSTSUPERSCRIPT. And the result is 𝐗up∈ℝD×224×224subscript𝐗upsuperscriptℝ𝐷224224\mathbf{X}_{\text{up}}\in\mathbb{R}^{D\times 224\times 224}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 224 × 224 end_POSTSUPERSCRIPT (D𝐷Ditalic_D represents the dimension of the input provided to the USKT.) for each step. Additionally, if the final 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT does not have a dimension of 3, we apply a convolution to the final 𝐗upsubscript𝐗up\mathbf{X}_{\text{up}}bold_X start_POSTSUBSCRIPT up end_POSTSUBSCRIPT to produce an output with the desired dimensions, 𝐗USKT∈ℝ3×224×224subscript𝐗USKTsuperscriptℝ3224224\mathbf{X}_{\text{USKT}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT.

3.4 Bidirectional Reverse State Space Model

Refer to caption

Figure 3: The figure on the left shows the traditional Bi-SSM, while the figure on the right represents our proposed BiR-SSM.

In this section, we focus on the application of a bidirectional reverse state space model for sequence modeling, as illustrated in Figure 3. While Transformer-based methods offer substantial benefits for sequence modeling, their quadratic computational complexity often limits their performance [27, 51]. To address this, we conduct sequence modeling after the Residual Down Sampling Block, which enables efficient processing while preserving essential feature information. Notably, previous bidirectional state space models typically relied on two separate SSM layers [78], as depicted in Figure 3. We believe this design can be optimized to improve parameter efficiency without sacrificing model performance.

After the Residual Down Sampling Block, we flatten the 2D data into a 1D sequence and then employ the Bidirectional Reverse State Space Model to process the downsampled data, represented as Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. The sequence is then processed through a linear layer, a convolutional layer, and a State Space Model (SSM) layer. To retain original information, the output from the SSM undergoes a residual connection with the original sequence. The downsampled data Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT is represented as a set of features {p1,p2,…,pn}subscript𝑝1subscript𝑝2…subscript𝑝𝑛\{p_{1},p_{2},\dots,p_{n}\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where each pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an element of Xdownsubscript𝑋downX_{\text{down}}italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT, as shown in the following formula:

XC⁢o⁢n⁢v=𝐶𝑜𝑛𝑣⁢(𝐿𝑖𝑛𝑒𝑎𝑟⁢(Xdown)),subscript𝑋𝐶𝑜𝑛𝑣𝐶𝑜𝑛𝑣𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋downX_{Conv}=\mathit{Conv}(\mathit{Linear}(X_{\text{down}})),italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT = italic_Conv ( italic_Linear ( italic_X start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) ) , (2)

where XC⁢o⁢n⁢vsubscript𝑋𝐶𝑜𝑛𝑣X_{Conv}italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT is obtained after passing through a linear layer and a convolutional mapping.

Simultaneously, when we obtain the result of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we apply a SiLU function to the output of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT after processing through XC⁢o⁢n⁢vsubscript𝑋𝐶𝑜𝑛𝑣X_{Conv}italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT, as shown in the following formula:

XS⁢S⁢M+=𝑆𝑖𝐿𝑈⁢(𝑆𝑆𝑀+⁢(XC⁢o⁢n⁢v)),subscript𝑋𝑆𝑆superscript𝑀𝑆𝑖𝐿𝑈superscript𝑆𝑆𝑀subscript𝑋𝐶𝑜𝑛𝑣X_{SSM^{+}}=\mathit{SiLU}(\mathit{SSM^{+}}(X_{Conv})),italic_X start_POSTSUBSCRIPT italic_S italic_S italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_SiLU ( italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_C italic_o italic_n italic_v end_POSTSUBSCRIPT ) ) , (3)

where the forward 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT modeling is applied to the features to obtain XS⁢S⁢M+subscript𝑋𝑆𝑆superscript𝑀X_{SSM^{+}}italic_X start_POSTSUBSCRIPT italic_S italic_S italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

The result of 𝑆𝑆𝑀+superscript𝑆𝑆𝑀\mathit{SSM^{+}}italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is then reversed, represented as a set of features {pn,…,p2,p1}subscript𝑝𝑛…subscript𝑝2subscript𝑝1\{p_{n},\dots,p_{2},p_{1}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } where each p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an element from X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, to facilitate the subsequent 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT processing. Finally, the sequence passes through 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, as shown in the following formula:

X𝑆𝑆𝑀−=𝑆𝑖𝐿𝑈⁢(𝑆𝑆𝑀−⁢(X𝑆𝑆𝑀+)),subscript𝑋superscript𝑆𝑆𝑀𝑆𝑖𝐿𝑈superscript𝑆𝑆𝑀subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}=\mathit{SiLU}(\mathit{SSM^{-}}(X_{\mathit{SSM^{+}}})),italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_SiLU ( italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , (4)

where 𝑆𝑆𝑀−superscript𝑆𝑆𝑀\mathit{SSM^{-}}italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT modeling is applied to the features.

We apply a residual connection between X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and X𝑑𝑜𝑤𝑛subscript𝑋𝑑𝑜𝑤𝑛X_{\mathit{down}}italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT, producing X𝑋Xitalic_X. Then, we reverse the resulting sequence to restore the original structural arrangement, {p1,p2,…,pn}subscript𝑝1subscript𝑝2…subscript𝑝𝑛\{p_{1},p_{2},\dots,p_{n}\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where each pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an element of X𝑆𝑆𝑀subscript𝑋𝑆𝑆𝑀\mathit{X_{\mathit{SSM}}}italic_X start_POSTSUBSCRIPT italic_SSM end_POSTSUBSCRIPT, the reversed form of X𝑋Xitalic_X.

X=𝐿𝑖𝑛𝑒𝑎𝑟⁢(X𝑆𝑆𝑀−+𝑆𝑖𝐿𝑈⁢(𝐿𝑖𝑛𝑒𝑎𝑟⁢(X𝑑𝑜𝑤𝑛))),𝑋𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋superscript𝑆𝑆𝑀𝑆𝑖𝐿𝑈𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑋𝑑𝑜𝑤𝑛X=\mathit{Linear}(X_{\mathit{SSM^{-}}}+\mathit{SiLU}(\mathit{Linear}(\mathit{X% _{\mathit{down}}}))),italic_X = italic_Linear ( italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_SiLU ( italic_Linear ( italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT ) ) ) , (5)

where a residual connection is used to combine X𝑆𝑆𝑀−subscript𝑋superscript𝑆𝑆𝑀X_{\mathit{SSM^{-}}}italic_X start_POSTSUBSCRIPT italic_SSM start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and X𝑑𝑜𝑤𝑛subscript𝑋𝑑𝑜𝑤𝑛\mathit{X_{\mathit{down}}}italic_X start_POSTSUBSCRIPT italic_down end_POSTSUBSCRIPT to mitigate feature loss.

3.5 Reconstruction and Classification

Feature Extraction. We use the ResNet [23] as pre-trained RGB Encoder for feature extraction. Unlike traditional ResNet applications, we utilize the adaptive output from USKT as the input for Encoder. After feature extraction through Encoder, the feature matrix 𝐗res∈ℝD×7×7subscript𝐗ressuperscriptℝ𝐷77\mathbf{X}_{\text{res}}\in\mathbb{R}^{D\times 7\times 7}bold_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 7 × 7 end_POSTSUPERSCRIPT. Subsequently, as shown in Fig 2 through decoder, the features are mapped back to the original space, ultimately resulting in an output 𝐗rec∈ℝ3×224×224subscript𝐗recsuperscriptℝ3224224\mathbf{X}_{\text{rec}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT. In our method, our decoder employs a deconvolution approach.

Loss Function. In our proposed method, we primarily used two types of loss functions. For the classification, we applied the Focal Loss to the classification results from the linear layer, is defined as:

ℒcls=−αt⁢(1−pt)γ⁢log⁡(pt),subscriptℒclssubscript𝛼𝑡superscript1subscript𝑝𝑡𝛾subscript𝑝𝑡\mathcal{L}_{\text{cls}}=-\alpha_{t}(1-p_{t})^{\gamma}\log(p_{t}),caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (6)

where ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the predicted probability for the correct class t𝑡titalic_t, αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT balances positive and negative samples, and γ𝛾\gammaitalic_γ focuses on hard-to-classify sample.

For the reconstruction part, we use the Mean Squared Error (MSE) loss function to compare the reconstructed features 𝐗rec∈ℝ3×224×224subscript𝐗recsuperscriptℝ3224224\mathbf{X}_{\text{rec}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT and 𝐗USKT∈ℝ3×224×224subscript𝐗USKTsuperscriptℝ3224224\mathbf{X}_{\text{USKT}}\in\mathbb{R}^{3\times 224\times 224}bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT, is defined as:

ℒrec=1N⁢∑i=1N(𝐗rec(i)−𝐗USKT(i))2,subscriptℒrec1𝑁superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝐗rec𝑖superscriptsubscript𝐗USKT𝑖2\mathcal{L}_{\text{rec}}=\frac{1}{N}\sum_{i=1}^{N}\left(\mathbf{X}_{\text{rec}% }^{(i)}-\mathbf{X}_{\text{USKT}}^{(i)}\right)^{2},caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_X start_POSTSUBSCRIPT USKT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where N𝑁Nitalic_N is the total number of elements, and i𝑖iitalic_i indexes the elements.

Finally, we combine λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to compute our total loss ℒℒ\mathcal{L}caligraphic_L, defined as:

ℒ=λ1⋅Lcls+λ2⋅Lrec,ℒ⋅subscript𝜆1subscript𝐿cls⋅subscript𝜆2subscript𝐿rec\mathcal{L}=\lambda_{1}\cdot L_{\text{cls}}+\lambda_{2}\cdot L_{\text{rec}},caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT , (8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the weights for the classification loss and reconstruction loss, respectively.

4 Experiments

4.1 Experimental Setup

Table 1: Comparison of classification accuracies on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS, showing the top-1 accuracy.

Dataset.

We utilize the ImageNet-1K dataset [11] for pre-training our models. In our experiments, we compare SimCLR, MoCo-v2, and MoCo-v3, all pretrained on both ImageNet-1K [11] and N-ImageNet [32]. Furthermore, we extend our knowledge transfer activities to the DVS128 Gesture [1], N-Caltech101 [48], and CIFAR-10-DVS [9] datasets to assess the generalization capabilities of our models across various domains. Additionally, we adapt the input by resizing images to a resolution of 224×224 pixels.

DVS128 Gesture [1] consists of 1,188 event streams from 29 participants, categorized into 11 gesture types, with each event stream featuring a resolution of approximately 128×128 pixels. N-Caltech101 [48] comprises a total of 8,242 images across 101 categories, with each image having a resolution of around 300×200 pixels. CIFAR-10-DVS [9] includes 10 classes, with 1,000 samples per class, totaling 10,000 samples, each at a resolution of 128×128 pixels.

Implementation.

Our model is implemented using PyTorch and trained on NVIDIA RTX 2080Ti GPUs. For all experiments, we employ the AdamW optimizer [41] and utilize a cosine scheduler. The initial learning rate is set to 0.0025, with a reduced rate of 0.000025 for fine-tuning layers.

4.2 Comparison with Existing Methods

Compared to RGB-based Supervised Methods.

In the non-pretrained models, our method achieved significant improvements on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets compared to VIT-S/16 and ResNet50. Specifically, our method outperformed VIT-S/16 by 29.17%, 33.19%, and 24.3%, and surpassed ResNet50 by 16.58%, 26.13%, and 20.1% respectively. For the pre-trained models, our method outperformed VIT-S/16 by 17.05% on DVS128 Gesture, 3.80% on N-Caltech101, and 0.6% on CIFAR-10-DVS. It demonstrates that our experiments show significant performance improvements under both pretrained and non-pretrained supervised conditions.

Compared to RGB-based Unsupervised Methods.

We used a frozen ResNet50 backbone to compare with traditional RGB unsupervised methods. On the DVS128 Gesture, our frozen model can outperform many unsupervised models, surpassing SimCLR [6] and MoCo-v3 [8] by 0.76% and 2.65%, respectively. On the N-Caltech101, our model also outperforms many unsupervised models, surpassing SimCLR [6] and MoCo-v2 [7] by 2.25% and 4.66%, respectively. On the CIFAR-10-DVS, our model also surpasses many unsupervised models, exceeding SimCLR [6] and MoCo-v2 [7] by 1.6% and 2.1%, respectively. In the unfrozen condition, on the DVS128 Gesture, our model can surpass MoCo-v2 [7] by 2.10%. Therefore, compared to traditional RGB unsupervised methods, our model demonstrates significant advantages.

Compared to SNN methods.

Due to the effective handling of event information by SNNs in event camera classification tasks, our method(frozen) was compared with SNN-based methods. On the DVS128 Gesture, our model showed significant advantages over other advanced SNN-based methods, not only outperforming Spikformer [38] by 5.71% but also achieving a comparable level to MLF [19]. Similarly, on the N-Caltech101, our model excelled, surpassing Spikformer [38] by 15.99%. Furthermore, on the CIFAR-10-DVS, our model further demonstrated its superiority, outperforming Spikformer [38], MLF [19], and TEBN [16] by 8.2%, 9.68%, and 9.05%, respectively. These results fully demonstrate the excellent performance and leading position of our model in handling tasks based on SNNs.

Table 2: Comparison of the performance of ResNet18, ResNet34, and ResNet50 with and without the implementation of USKT, illustrating top-1 accuracy on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets. The table also delineates the results under both frozen and unfrozen backbone conditions.

Compared to Knowledge Transfer Methods.

We primarily demonstrate the superiority of our method by comparing it with knowledge transfer-based approaches. In supervised methods, our model with a frozen backbone network can train with very low parameter counts, surpassing PKOA [25] by 1.17% on N-Caltech101 and by 0.4% on CIFAR-10-DVS. When the backbone network is unfrozen, our model further exceeds PKOA [25] by 3.7% on N-Caltech101 and by 2.75% on CIFAR-10-DVS, and surpasses CAF [65] by 3.21% on DVS128-Gesture.

In unsupervised methods, our proposed method with a frozen backbone outperforms TriGAN [52] by 1.47%, 5.87%, and 1.55% on DVS128-Gesture, N-Caltech101, and CIFAR-10-DVS, respectively. Unfreezing the backbone allows our model to further exceed CTN [74] by 0.92%, 0.25%, and 1.8% on DVS128-Gesture, N-Caltech101, and CIFAR-10-DVS, respectively.

4.3 Abaltion Studies

In this section, we address three key issues: Firstly, we examine the applicability of our proposed USKT Block to various sizes of ResNet models. Secondly, we assess the effectiveness of the USKT Block in enhancing model performance. Thirdly, we conduct a comparative analysis of the BiR-SSM Block.

Adaptability of USKT.

As shown in Table 2, we evaluated the performance of our proposed USKT across different sizes of ResNet on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets to validate the applicability of USKT to various ResNet architectures.

Initially, we conducted experiments with the backbone network frozen (with only the bias parameters of ResNet unfrozen). Using ResNet18 as the backbone, the integration of USKT resulted in performance improvements of 2.94%, 2.12%, and 3.45% on the DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively. With ResNet34 as the backbone, USKT enhanced the model’s performance by 2.24%, 1.8%, and 2.1% on these respective datasets. When employing ResNet50, the addition of USKT led to gains of 0.95%, 3.57%, and 2.9%.

Further, we evaluated the performance on the N-Caltech101 with the backbone network completely unfrozen. In this dataset, adding USKT improved the model’s performance by 1.2% with ResNet18, 0.49% with ResNet34, and 2.58% with ResNet50 as the backbone.

Table 2 illustrates that our method achieves the most substantial improvements with the ResNet50 backbone, irrespective of the network’s state (frozen or unfrozen). Its superior performance is likely attributable to ResNet50’s enhanced capability to extract richer fine-grained information from images compared to the ResNet18 and ResNet34 models.

Effectiveness of USKT.

Table 3: Comparison of different domain-adaptive generation methods for classification accuracies, showing top-1 accuracy on the N-Caltech101.

As illustrated in Table 3, we conducted comparative evaluations between convolution and Transformer-based methods to validate the effectiveness of our proposed USKT. Initially, we substituted USKT with convolutional layers to assess the adaptive capabilities of our approach. The experiments were executed with a frozen ResNet50 backbone. Our model demonstrated improvements of 2.6% and 2.94% over single and double convolution layer setups, respectively.

Furthermore, to rigorously assess the efficacy of our proposed BiR-SSM, we carried out comparative experiments under both frozen and unfrozen conditions of the ResNet50 backbone, where BiR-SSM was replaced with a Transformer module. Under the frozen condition, our method exceeded the performance of the Transformer-based methods by 2.14%, achieving results comparable to those of the unfrozen backbone Transformer. Remarkably, even with fewer parameters in the unfrozen state, our approach not only matched but surpassed the Transformer-based model by 2.15%.

Comparision of BiR-SSM Block.

Table 4: Comparison of different ssm layers for classification accuracies, showing top-1 accuracy on N-Caltech101.

As shown in Table 4, we have frozen the ResNet50 backbone and substituted the original SSM with our novel BiR-SSM in various configurations. It can be concluded that our proposed BiR-SSM outperforms the traditional Bi-SSM. This enhancement is likely attributable to the improved data consistency achieved through the shared SSM mechanism that we implemented.

4.4 Hyperparameter Studies

This section first discusses the impact of different numbers of BiR-SSM layers on the model, followed by an analysis of different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT affect model performance.

Refer to caption

Figure 4: The left is the comparison of the performance of ResNet50 with different numbers of SSM layers in USKT and the right is the comparison of the performance of ResNet50 with different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, showing top-1 accuracy on DVS128 Gesture, N-Caltech101 and CIAFR-10-DVS.

Comparison of Different Number of BiR-SSM Layers.

As demonstrated in Figure 4, employing a single SSM layer yields the highest accuracy, surpassing the configurations where no BiR-SSM layers or multiple BiR-SSM layers are used. In our setup, we utilized ResNet-50 as the backbone with the main network components frozen. Specifically, with one BiR-SSM layer, our method achieved an accuracy of 88.82% on the N-Caltech101 dataset. Similarly, this configuration attained accuracies of 88.62% on the DVS128 Gesture and 76.75% on CIFAR-10-DVS. We hypothesize that the absence of any BiR-SSM layers causes the adaptive domain to predominantly focus on local information, thereby neglecting global context. Conversely, incorporating more than one BiR-SSM layer can lead to overfitting or an excessive emphasis on the classification task, potentially compromising the model’s performance.

Comparison of Different λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Settings.

As depicted in Figure 4, varying the parameter λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT significantly influences the performance of our model. In our evaluations, we employed a frozen ResNet-50 architecture as the backbone, and set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 1. The results indicate optimal performance when λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is set to 0.05, with the model achieving an accuracy of 88.82% on the N-Caltech101 dataset. Similar efficacy is observed on the DVS128 Gesture with an accuracy of 88.62%, and a noteworthy performance of 76.75% on CIFAR-10-DVS. We hypothesize that a λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT value below 0.05 potentially leads the model to prioritize the classification task, possibly at the expense of generalization capabilities. Conversely, a λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT value above 0.05 seems to excessively focus the model on the reconstruction task, which detrimentally impacts classification accuracy.

5 Conclusion

In this paper, we introduce the USKT framework to tackle the challenge of limited event data in event-based imaging by facilitating effective Event-to-RGB knowledge transfer. The framework allows event data to leverage pre-trained RGB models with minimal tuning, achieving robust performance. Our BiR-SSM component, with its shared weight strategy, further enhances computational efficiency. Experimental results across multiple datasets demonstrate USKT’s adaptability and effectiveness in advancing event-based imaging.

References