Chen Chen | University of Central Florida (original) (raw)

Papers by Chen Chen

Research paper thumbnail of Infrared and visible image fusion via detail preserving adversarial learning

Information Fusion, 2019

Targets can be detected easily from the background of infrared images due to their significantly ... more Targets can be detected easily from the background of infrared images due to their significantly discriminative thermal radiations, while visible images contain textural details with high spatial resolution which are
beneficial to the enhancement of target recognition. Therefore, fused images with abundant detail information
and effective target areas are desirable. In this paper, we propose an end-to-end model for infrared and visible
image fusion based on detail preserving adversarial learning. It is able to overcome the limitations of the manual
and complicated design of activity-level measurement and fusion rules in traditional fusion methods. Considering the specific information of infrared and visible images, we design two loss functions including the detail loss
and target edge-enhancement loss to improve the quality of detail information and sharpen the edge of infrared
targets under the framework of generative adversarial network. Our approach enables the fused image to simultaneously retain the thermal radiation with sharpening infrared target boundaries in the infrared image and the
abundant textural details in the visible image. Experiments conducted on publicly available datasets demonstrate
the superiority of our strategy over the state-of-the-art methods in both objective metrics and visual impressions.
In particular, our results look like enhanced infrared images with clearly highlighted and edge-sharpened targets
as well as abundant detail information.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of remote sensing SmokeNet: Satellite Smoke Scene Detection Using Convolutional Neural Network with Spatial and Channel-Wise Attention

A variety of environmental analysis applications have been advanced by the use of satellite remot... more A variety of environmental analysis applications have been advanced by the use of satellite remote sensing. Smoke detection based on satellite imagery is imperative for wildfire detection and monitoring. However, the commonly used smoke detection methods mainly focus on smoke discrimination from a few specific classes, which reduces their applicability in different regions of various classes. To this end, in this paper, we present a new large-scale satellite imagery smoke detection benchmark based on Moderate Resolution Imaging Spectroradiometer (MODIS) data, namely USTC_SmokeRS, consisting of 6225 satellite images from six classes (i.e., cloud, dust, haze, land, seaside, and smoke) and covering various areas/regions over the world. To build a baseline for smoke detection in satellite imagery, we evaluate several state-of-the-art deep learning-based image classification models. Moreover, we propose a new convolution neural network (CNN) model, SmokeNet, which incorporates spatial and channel-wise attention in CNN to enhance feature representation for scene classification. The experimental results of our method using different proportions (16%, 32%, 48%, and 64%) of training images reveal that our model outperforms other approaches with higher accuracy and Kappa coefficient. Specifically, the proposed SmokeNet model trained with 64% training images achieves the best accuracy of 92.75% and Kappa coefficient of 0.9130. The model trained with 16% training images can also improve the classification accuracy and Kappa coefficient by at least 4.99% and 0.06, respectively, over the state-of-the-art models.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of One-two-one networks for compression artifacts reduction in remote sensing

Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Mo... more Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Most recent deep learning based methods have demonstrated superior performance over the previous hand-crafted methods. In this paper, we propose an end-to-end one-two-one (OTO) network, to combine different deep models, i.e., summation and difference models, to solve the CAR problem. Particularly, the difference model motivated by the Laplacian pyramid is designed to obtain the high frequency information , while the summation model aggregates the low frequency information. We provide an in-depth investigation into our OTO architecture based on the Taylor expansion, which shows that these two kinds of information can be fused in a nonlinear scheme to gain more capacity of handling complicated image compression artifacts, especially the blocking effect in compression. Extensive experiments are conducted to demonstrate the superior performance of the OTO networks, as compared to the state-of-the-arts on remote sensing datasets and other benchmark datasets. The source code will be available

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Gabor Convolutional Networks

Steerable properties dominate the design of traditional filters, e.g., Gabor filters, and endow f... more Steerable properties dominate the design of traditional filters, e.g., Gabor filters, and endow features the capability of dealing with spatial transformations. However, such excellent properties have not been well explored in the popular deep convolutional neural networks (DCNNs). In this paper, we propose a new deep model, termed Gabor Convo-lutional Networks (GCNs or Gabor CNNs), which incorporates Gabor filters into DCNNs to enhance the resistance of deep learned features to the orientation and scale changes. By only manipulating the basic element of DCNNs based on Gabor filters, i.e., the convolution operator, GCNs can be easily implemented and are compatible with any popular deep learning architecture. Experimental results demonstrate the super capability of our algorithm in recognizing objects, where the scale and rotation changes occur frequently. The proposed GCNs have much fewer learnable network parameters, and thus is easier to train with an end-to-end pipeline. The source code will be here 1 .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions

IEEE Transactions on Multimedia, 2017

3D action recognition has broad applications in human-computer interaction and intelligent survei... more 3D action recognition has broad applications in human-computer interaction and intelligent surveillance. However , recognizing similar actions remains challenging since previous literature fails to capture motion and shape cues effectively from noisy depth data. In this paper, we propose a novel two-layer Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and jointly encodes both motion and shape cues. First, background clutter is removed by a background modeling method that is designed for depth data. Then, motion and shape cues are jointly used to generate robust and distinctive spatial-temporal interest points (STIPs): motion-based STIPs and shape-based STIPs. In the first layer of our model, a multi-scale 3D local steering kernel (M3DLSK) descriptor is proposed to describe local appearances of cuboids around motion-based STIPs. In the second layer, a spatial-temporal vector (STV) descriptor is proposed to describe the spatial-temporal distributions of shape-based STIPs. Using the Bag-of-Visual-Words (BoVW) model, motion and shape cues are combined to form a fused action representation. Our model performs favorably compared with common STIP detection and description methods. Thorough experiments verify that our model is effective in distinguishing similar actions and robust to background clutter, partial occlusions and pepper noise.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos

Deep learning has been demonstrated to achieve excellent results for image classification and obj... more Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has not been that significant due to complexity of video data and lack of annotations. In addition, training deep neural networks on large scale video datasets is extremely computationally expensive. Previous convolutional neural networks (CNNs) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal features separately. In this paper, we propose an end-to-end 3D CNN for action detection and segmentation in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D CNN features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals. This top-down action detection approach explicitly relies on a set of good tube proposals to perform well and training the bounding box regression usually requires a large number of annotated samples. To remedy this, we further extend the 3D CNN to an encoder-decoder structure and formulate the localization problem as action segmentation. The foreground regions (i.e. action regions) for each frame are segmented first then the segmented foreground maps are used to generate the bounding boxes. This bottom-up approach effectively avoids tube proposal generation by leveraging the pixel-wise annotations of segmentation. The segmentation framework also can be readily applied to a general problem of video object segmentation. Extensive experiments on several video datasets demonstrate the superior performance of our approach for action detection and video object segmentation compared to the state-of-the-arts.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Latent Constrained Correlation Filter

Correlation filters are special classifiers designed for shift-invariant object recognition, whic... more Correlation filters are special classifiers designed for shift-invariant object recognition, which are robust to pattern distortions. The recent literature shows that combining a set of sub-filters trained based on a single or a small group of images obtains the best performance. The idea is equivalent to estimating variable distribution based on the data sampling (bagging), which can be interpreted as finding solutions (variable distribution approximation) directly from sampled data space. However, this methodology fails to account for the variations existed in the data. In this paper, we introduce an intermediate step – solution sampling – after the data sampling step to form a subspace, in which an optimal solution can be estimated. More specifically, we propose a new method, named latent constrained correlation filters (LCCF), by mapping the correlation filters to a given latent subspace, and develop a new learning framework in the latent subspace that embeds distribution-related constraints into the original problem. To solve the optimization problem, we introduce a subspace based alternating direction method of multipliers (SADMM), which is proven to converge at the saddle point. Our approach is successfully applied to three different tasks, including eye localization, car detection and object tracking. Extensive experiments demonstrate that LCCF outperforms the state-of-the-art methods. 1

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Person Reidentification via Discrepancy Matrix and Matrix Metric

—Person reidentification (re-id), as an important task in video surveillance and forensics applic... more —Person reidentification (re-id), as an important task in video surveillance and forensics applications, has been widely studied. Previous research efforts toward solving the person re-id problem have primarily focused on constructing robust vector description by exploiting appearance's characteristic, or learning discriminative distance metric by labeled vectors. Based on the cognition and identification process of human, we propose a new pattern, which transforms the feature description from characteristic vector to discrepancy matrix. In particular , in order to well identify a person, it converts the distance metric from vector metric to matrix metric, which consists of the intradiscrepancy projection and interdiscrepancy projection parts. We introduce a consistent term and a discriminative term to form the objective function. To solve it efficiently, we utilize a simple gradient-descent method under the alternating optimization process with respect to the two projections. Experimental results on public datasets demonstrate the effectiveness of the proposed pattern as compared with the state-of-the-art approaches.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple features learning for ship classification in optical imagery

The sea surface vessel/ship classification is a challenging problem with enormous implications to... more The sea surface vessel/ship classification is a challenging problem with enormous implications to the world's global supply chain and militaries. The problem is similar to other well-studied problems in object recognition such as face recognition. However, it is more complex since ships' appearance is easily affected by external factors such as lighting or weather conditions, viewing geometry and sea state. The large within-class variations in some vessels also make ship classification more complicated and challenging. In this paper, we propose an effective multiple features learning (MFL) framework for ship classification, which contains three types of features: Gabor-based multi-scale completed local binary patterns (MS-CLBP), patch-based MS-CLBP and Fisher vector, and combination of Bag of visual words (BOVW) and spatial pyramid matching (SPM). After multiple feature learning, feature-level fusion and decision-level fusion are both investigated for final classification. In the proposed framework, typical support vector machine (SVM) classifier is employed Multimed Tools Appl to provide posterior-probability estimation. Experimental results on remote sensing ship image datasets demonstrate that the proposed approach shows a consistent improvement on performance when compared to some state-of-the-art methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Manifold Constrained Low-Rank Decomposition

Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and mode... more Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and modelling. However , it is a very challenging problem when the image data contains significant occlusion, noise, illumination variation , and misalignment from rotation or viewpoint changes. We leverage the specific structure of data in order to improve the performance of LRD when the data are not ideal. To this end, we propose a new framework that embeds manifold priors into LRD. To implement the framework, we design an alternating direction method of multipliers (ADMM) method which efficiently integrates the manifold constraints during the optimization process. The proposed approach is successfully used to calculate low-rank models from face images, handwritten digits and planar surface images. The results show a consistent increase of performance when compared to the state-of-the-art over a wide range of realistic image misalignments and corruptions.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Action Recognition Using 3D histograms of Texture and A Multi-class Boosting Classifier

Human action recognition is an important yet challenging task. This paper presents a low-cost des... more Human action recognition is an important yet challenging task. This paper presents a low-cost descriptor called 3D Histograms of Texture (3DHoTs) to extract discriminant features from a sequence of depth maps. 3DHoTs are derived from projecting depth frames onto three orthogonal Cartesian planes, i.e., the frontal, side and top planes, and thus compactly characterize the salient information of a specific action, on which texture features are calculated to represent the action. Besides this fast feature descriptor, a new multi-class boosting classifier (MBC) is also proposed to efficiently exploit different kinds of features in a unified framework for action classification. Compared to the existing boosting frameworks, we add a new multi-class constraint into the objective function, which helps to maintain a better margin distribution by maximizing the mean of margin whereas still minimizing the variance of margin. Experiments on the MSRAction3D, MSRGesture3D, MSRActivity3D and UTD-MHAD datasets demonstrate that the proposed system combining 3DHoTs and MBC is superior to the state-of-the-art.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple features learning for ship classification in optical imagery

The sea surface vessel/ship classification is a challenging problem with enormous implications to... more The sea surface vessel/ship classification is a challenging problem with enormous implications to the world's global supply chain and militaries. The problem is similar to other well-studied problems in object recognition such as face recognition. However, it is more complex since ships' appearance is easily affected by external factors such as lighting or weather conditions, viewing geometry and sea state. The large within-class variations in some vessels also make ship classification more complicated and challenging. In this paper, we propose an effective multiple features learning (MFL) framework for ship classification, which contains three types of features: Gabor-based multi-scale completed local binary patterns (MS-CLBP), patch-based MS-CLBP and Fisher vector, and combination of Bag of visual words (BOVW) and spatial pyramid matching (SPM). After multiple feature learning, feature-level fusion and decision-level fusion are both investigated for final classification. In the proposed framework, typical support vector machine (SVM) classifier is employed Multimed Tools Appl to provide posterior-probability estimation. Experimental results on remote sensing ship image datasets demonstrate that the proposed approach shows a consistent improvement on performance when compared to some state-of-the-art methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK

Deep learning features for video action recognition are usually learned from RGB/gray images, ima... more Deep learning features for video action recognition are usually learned from RGB/gray images, image gradients, and optical flows. The single modality of the input data can describe one characteristic of the human action such as appearance structure or motion information. In this paper, we propose a high efficient gradient boundary convolutional network (Con-vNet) to simultaneously learn spatio-temporal feature from the single modality data of gradient boundaries. The gradient boundaries represent both local spacial structure and motion information of action video. The gradient boundaries also have less background noise compared to RGB/gray images and image gradients. Extensive experiments are conducted on two popular and challenging action benchmarks, the UCF101 and the HMDB51 action datasets. The proposed deep gradient boundary feature achieves competitive performances on both benchmarks.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Fusing Local and Global Features for High-Resolution Scene Classification

—In this paper, a fused global saliency-based multiscale multiresolution multistructure local bin... more —In this paper, a fused global saliency-based multiscale multiresolution multistructure local binary pattern (salM 3 LBP) feature and local codebookless model (CLM) feature is proposed for high-resolution image scene classification. First, two different but complementary types of descriptors (pixel intensities and differences) are developed to extract global features, characterizing the dominant spatial features in multiple scale, multiple resolution, and multiple structure manner. The micro/macrostructure information and rotation invariance are guaranteed in the global feature extraction process. For dense local feature extraction, CLM is utilized to model local enrichment scale invariant feature transform descriptor and dimension reduction is conducted via joint low-rank learning with support vector machine. Finally, a fused feature representation between salM 3 LBP and CLM as the scene descriptor to train a kernel-based extreme learning machine for scene classification is presented. The proposed approach is extensively evaluated on three challenging benchmark scene datasets (the 21-class land-use scene, 19-class satellite scene, and a newly available 30-class aerial scene), and the experimental results show that the proposed approach leads to superior classification performance compared with the state-of-the-art classification methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of 3D ACTION RECOGNITION USING MULTI-TEMPORAL SKELETON VISUALIZATION

Action recognition using depth sequences plays important role in many fields, e.g., intelligent s... more Action recognition using depth sequences plays important role in many fields, e.g., intelligent surveillance, content-based video retrieval. Real applications require robust and accurate action recognition method. In this paper, we propose a skeleton visualization method, which efficiently encodes the spatial-temporal information of skeleton joints into a set of color images. These images are served as inputs for convolutional neural networks to extract more discrimina-tive deep features. To enhance the ability of deep features to capture global relationships, we extend the color images into multi-temporal version. Additionally, to solve the effect of view point changes, a spatial transform method is adopted as a preprocessing step. Extensive experiments on NTU RGB+D dataset and ICME2017 challenge show that our method can accurately distinguish similar actions and shows robustness to view variations.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Cross-View Image Matching for Geo-localization in Urban Environments

In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim ... more In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim to estimate the GPS location of a query street view image by finding the matching images in a reference database of geo-tagged bird's eye view images, or vice versa. To this end, we present a new framework for cross-view image geo-localization by taking advantage of the tremendous success of deep convolutional neural networks (CNNs) in image classification and object detection. First, we employ the Faster R-CNN [16] to detect buildings in the query and reference images. Next, for each building in the query image , we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. We evaluate the proposed framework on a new dataset that consists of pairs of street view and bird's eye view images. Experimental results show that the proposed method achieves better geo-localization accuracy than other approaches and is able to generalize to images at unseen locations .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Deep learning has been demonstrated to achieve excellent results for image classification and obj... more Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Cross-View Image Matching for Geo-localization in Urban Environments

In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim ... more In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim to estimate the GPS location of a query street view image by finding the matching images in a reference database of geo-tagged bird's eye view images, or vice versa. To this end, we present a new framework for cross-view image geo-localization by taking advantage of the tremendous success of deep convolutional neural networks (CNNs) in image classification and object detection. First, we employ the Faster R-CNN [16] to detect buildings in the query and reference images. Next, for each building in the query image , we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. We evaluate the proposed framework on a new dataset that consists of pairs of street view and bird's eye view images. Experimental results show that the proposed method achieves better geo-localization accuracy than other approaches and is able to generalize to images at unseen locations .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of LOW-RESOLUTION PEDESTRIAN DETECTION VIA A NOVEL RESOLUTION-SCORE DISCRIMINATIVE SURFACE

Pedestrian detection, as an important task in video surveillance and forensics applications, has ... more Pedestrian detection, as an important task in video surveillance and forensics applications, has been widely studied. However, its performance is unsatisfactory especially in the low resolution conditions. In realistic scenarios, the size of pedestrians in the images is often small, and detection can be challenging. To solve this problem, this paper proposes a novel resolution-score discriminative surface method to investigate the variation behaviors of detection scores under different pedestrian and non-pedestrian image resolutions. The discriminative surface consists of a series of positive and negative resolution-score lines, and each of them is a connected line to depict the variation relationship between pedestrian's detection scores under various image resolutions. On this basis, the resolution-score discriminative surface can classify a resolution-score line as a pedestrian or not according to whether it lies in the positive or the negative region. Experimental results on two public datasets and one campus surveillance dataset demonstrate the effectiveness of the proposed method.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Real-Time Continuous Action Detection and Recognition Using Depth Images and Inertial Signals

— This paper presents an approach to detect and recognize actions of interest in real-time from a... more — This paper presents an approach to detect and recognize actions of interest in real-time from a continuous stream of data that are captured simultaneously from a Kinect depth camera and a wearable inertial sensor. Actions of interest are considered to appear continuously and in a random order among actions of non-interest. Skeleton depth images are first used to separate actions of interest from actions of non-interest based on pause and motion segments. Inertial signals from a wearable inertial sensor are then used to improve the recognition outcome. A dataset consisting of simultaneous depth and inertial data for the smart TV actions of interest occurring continuously and in a random order among actions of non-interest is studied and made publicly available. The results obtained indicate the effectiveness of the developed approach in coping with actions that are performed realistically in a continuous manner. Keywords— Real-time continuous action detection; action recognition from continuous data streams; simultaenous utilization of depth images and inertial signals for action recognition.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Infrared and visible image fusion via detail preserving adversarial learning

Information Fusion, 2019

Targets can be detected easily from the background of infrared images due to their significantly ... more Targets can be detected easily from the background of infrared images due to their significantly discriminative thermal radiations, while visible images contain textural details with high spatial resolution which are
beneficial to the enhancement of target recognition. Therefore, fused images with abundant detail information
and effective target areas are desirable. In this paper, we propose an end-to-end model for infrared and visible
image fusion based on detail preserving adversarial learning. It is able to overcome the limitations of the manual
and complicated design of activity-level measurement and fusion rules in traditional fusion methods. Considering the specific information of infrared and visible images, we design two loss functions including the detail loss
and target edge-enhancement loss to improve the quality of detail information and sharpen the edge of infrared
targets under the framework of generative adversarial network. Our approach enables the fused image to simultaneously retain the thermal radiation with sharpening infrared target boundaries in the infrared image and the
abundant textural details in the visible image. Experiments conducted on publicly available datasets demonstrate
the superiority of our strategy over the state-of-the-art methods in both objective metrics and visual impressions.
In particular, our results look like enhanced infrared images with clearly highlighted and edge-sharpened targets
as well as abundant detail information.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of remote sensing SmokeNet: Satellite Smoke Scene Detection Using Convolutional Neural Network with Spatial and Channel-Wise Attention

A variety of environmental analysis applications have been advanced by the use of satellite remot... more A variety of environmental analysis applications have been advanced by the use of satellite remote sensing. Smoke detection based on satellite imagery is imperative for wildfire detection and monitoring. However, the commonly used smoke detection methods mainly focus on smoke discrimination from a few specific classes, which reduces their applicability in different regions of various classes. To this end, in this paper, we present a new large-scale satellite imagery smoke detection benchmark based on Moderate Resolution Imaging Spectroradiometer (MODIS) data, namely USTC_SmokeRS, consisting of 6225 satellite images from six classes (i.e., cloud, dust, haze, land, seaside, and smoke) and covering various areas/regions over the world. To build a baseline for smoke detection in satellite imagery, we evaluate several state-of-the-art deep learning-based image classification models. Moreover, we propose a new convolution neural network (CNN) model, SmokeNet, which incorporates spatial and channel-wise attention in CNN to enhance feature representation for scene classification. The experimental results of our method using different proportions (16%, 32%, 48%, and 64%) of training images reveal that our model outperforms other approaches with higher accuracy and Kappa coefficient. Specifically, the proposed SmokeNet model trained with 64% training images achieves the best accuracy of 92.75% and Kappa coefficient of 0.9130. The model trained with 16% training images can also improve the classification accuracy and Kappa coefficient by at least 4.99% and 0.06, respectively, over the state-of-the-art models.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of One-two-one networks for compression artifacts reduction in remote sensing

Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Mo... more Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Most recent deep learning based methods have demonstrated superior performance over the previous hand-crafted methods. In this paper, we propose an end-to-end one-two-one (OTO) network, to combine different deep models, i.e., summation and difference models, to solve the CAR problem. Particularly, the difference model motivated by the Laplacian pyramid is designed to obtain the high frequency information , while the summation model aggregates the low frequency information. We provide an in-depth investigation into our OTO architecture based on the Taylor expansion, which shows that these two kinds of information can be fused in a nonlinear scheme to gain more capacity of handling complicated image compression artifacts, especially the blocking effect in compression. Extensive experiments are conducted to demonstrate the superior performance of the OTO networks, as compared to the state-of-the-arts on remote sensing datasets and other benchmark datasets. The source code will be available

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Gabor Convolutional Networks

Steerable properties dominate the design of traditional filters, e.g., Gabor filters, and endow f... more Steerable properties dominate the design of traditional filters, e.g., Gabor filters, and endow features the capability of dealing with spatial transformations. However, such excellent properties have not been well explored in the popular deep convolutional neural networks (DCNNs). In this paper, we propose a new deep model, termed Gabor Convo-lutional Networks (GCNs or Gabor CNNs), which incorporates Gabor filters into DCNNs to enhance the resistance of deep learned features to the orientation and scale changes. By only manipulating the basic element of DCNNs based on Gabor filters, i.e., the convolution operator, GCNs can be easily implemented and are compatible with any popular deep learning architecture. Experimental results demonstrate the super capability of our algorithm in recognizing objects, where the scale and rotation changes occur frequently. The proposed GCNs have much fewer learnable network parameters, and thus is easier to train with an end-to-end pipeline. The source code will be here 1 .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions

IEEE Transactions on Multimedia, 2017

3D action recognition has broad applications in human-computer interaction and intelligent survei... more 3D action recognition has broad applications in human-computer interaction and intelligent surveillance. However , recognizing similar actions remains challenging since previous literature fails to capture motion and shape cues effectively from noisy depth data. In this paper, we propose a novel two-layer Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and jointly encodes both motion and shape cues. First, background clutter is removed by a background modeling method that is designed for depth data. Then, motion and shape cues are jointly used to generate robust and distinctive spatial-temporal interest points (STIPs): motion-based STIPs and shape-based STIPs. In the first layer of our model, a multi-scale 3D local steering kernel (M3DLSK) descriptor is proposed to describe local appearances of cuboids around motion-based STIPs. In the second layer, a spatial-temporal vector (STV) descriptor is proposed to describe the spatial-temporal distributions of shape-based STIPs. Using the Bag-of-Visual-Words (BoVW) model, motion and shape cues are combined to form a fused action representation. Our model performs favorably compared with common STIP detection and description methods. Thorough experiments verify that our model is effective in distinguishing similar actions and robust to background clutter, partial occlusions and pepper noise.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos

Deep learning has been demonstrated to achieve excellent results for image classification and obj... more Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has not been that significant due to complexity of video data and lack of annotations. In addition, training deep neural networks on large scale video datasets is extremely computationally expensive. Previous convolutional neural networks (CNNs) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal features separately. In this paper, we propose an end-to-end 3D CNN for action detection and segmentation in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D CNN features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals. This top-down action detection approach explicitly relies on a set of good tube proposals to perform well and training the bounding box regression usually requires a large number of annotated samples. To remedy this, we further extend the 3D CNN to an encoder-decoder structure and formulate the localization problem as action segmentation. The foreground regions (i.e. action regions) for each frame are segmented first then the segmented foreground maps are used to generate the bounding boxes. This bottom-up approach effectively avoids tube proposal generation by leveraging the pixel-wise annotations of segmentation. The segmentation framework also can be readily applied to a general problem of video object segmentation. Extensive experiments on several video datasets demonstrate the superior performance of our approach for action detection and video object segmentation compared to the state-of-the-arts.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Latent Constrained Correlation Filter

Correlation filters are special classifiers designed for shift-invariant object recognition, whic... more Correlation filters are special classifiers designed for shift-invariant object recognition, which are robust to pattern distortions. The recent literature shows that combining a set of sub-filters trained based on a single or a small group of images obtains the best performance. The idea is equivalent to estimating variable distribution based on the data sampling (bagging), which can be interpreted as finding solutions (variable distribution approximation) directly from sampled data space. However, this methodology fails to account for the variations existed in the data. In this paper, we introduce an intermediate step – solution sampling – after the data sampling step to form a subspace, in which an optimal solution can be estimated. More specifically, we propose a new method, named latent constrained correlation filters (LCCF), by mapping the correlation filters to a given latent subspace, and develop a new learning framework in the latent subspace that embeds distribution-related constraints into the original problem. To solve the optimization problem, we introduce a subspace based alternating direction method of multipliers (SADMM), which is proven to converge at the saddle point. Our approach is successfully applied to three different tasks, including eye localization, car detection and object tracking. Extensive experiments demonstrate that LCCF outperforms the state-of-the-art methods. 1

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Person Reidentification via Discrepancy Matrix and Matrix Metric

—Person reidentification (re-id), as an important task in video surveillance and forensics applic... more —Person reidentification (re-id), as an important task in video surveillance and forensics applications, has been widely studied. Previous research efforts toward solving the person re-id problem have primarily focused on constructing robust vector description by exploiting appearance's characteristic, or learning discriminative distance metric by labeled vectors. Based on the cognition and identification process of human, we propose a new pattern, which transforms the feature description from characteristic vector to discrepancy matrix. In particular , in order to well identify a person, it converts the distance metric from vector metric to matrix metric, which consists of the intradiscrepancy projection and interdiscrepancy projection parts. We introduce a consistent term and a discriminative term to form the objective function. To solve it efficiently, we utilize a simple gradient-descent method under the alternating optimization process with respect to the two projections. Experimental results on public datasets demonstrate the effectiveness of the proposed pattern as compared with the state-of-the-art approaches.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple features learning for ship classification in optical imagery

The sea surface vessel/ship classification is a challenging problem with enormous implications to... more The sea surface vessel/ship classification is a challenging problem with enormous implications to the world's global supply chain and militaries. The problem is similar to other well-studied problems in object recognition such as face recognition. However, it is more complex since ships' appearance is easily affected by external factors such as lighting or weather conditions, viewing geometry and sea state. The large within-class variations in some vessels also make ship classification more complicated and challenging. In this paper, we propose an effective multiple features learning (MFL) framework for ship classification, which contains three types of features: Gabor-based multi-scale completed local binary patterns (MS-CLBP), patch-based MS-CLBP and Fisher vector, and combination of Bag of visual words (BOVW) and spatial pyramid matching (SPM). After multiple feature learning, feature-level fusion and decision-level fusion are both investigated for final classification. In the proposed framework, typical support vector machine (SVM) classifier is employed Multimed Tools Appl to provide posterior-probability estimation. Experimental results on remote sensing ship image datasets demonstrate that the proposed approach shows a consistent improvement on performance when compared to some state-of-the-art methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Manifold Constrained Low-Rank Decomposition

Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and mode... more Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and modelling. However , it is a very challenging problem when the image data contains significant occlusion, noise, illumination variation , and misalignment from rotation or viewpoint changes. We leverage the specific structure of data in order to improve the performance of LRD when the data are not ideal. To this end, we propose a new framework that embeds manifold priors into LRD. To implement the framework, we design an alternating direction method of multipliers (ADMM) method which efficiently integrates the manifold constraints during the optimization process. The proposed approach is successfully used to calculate low-rank models from face images, handwritten digits and planar surface images. The results show a consistent increase of performance when compared to the state-of-the-art over a wide range of realistic image misalignments and corruptions.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Action Recognition Using 3D histograms of Texture and A Multi-class Boosting Classifier

Human action recognition is an important yet challenging task. This paper presents a low-cost des... more Human action recognition is an important yet challenging task. This paper presents a low-cost descriptor called 3D Histograms of Texture (3DHoTs) to extract discriminant features from a sequence of depth maps. 3DHoTs are derived from projecting depth frames onto three orthogonal Cartesian planes, i.e., the frontal, side and top planes, and thus compactly characterize the salient information of a specific action, on which texture features are calculated to represent the action. Besides this fast feature descriptor, a new multi-class boosting classifier (MBC) is also proposed to efficiently exploit different kinds of features in a unified framework for action classification. Compared to the existing boosting frameworks, we add a new multi-class constraint into the objective function, which helps to maintain a better margin distribution by maximizing the mean of margin whereas still minimizing the variance of margin. Experiments on the MSRAction3D, MSRGesture3D, MSRActivity3D and UTD-MHAD datasets demonstrate that the proposed system combining 3DHoTs and MBC is superior to the state-of-the-art.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple features learning for ship classification in optical imagery

The sea surface vessel/ship classification is a challenging problem with enormous implications to... more The sea surface vessel/ship classification is a challenging problem with enormous implications to the world's global supply chain and militaries. The problem is similar to other well-studied problems in object recognition such as face recognition. However, it is more complex since ships' appearance is easily affected by external factors such as lighting or weather conditions, viewing geometry and sea state. The large within-class variations in some vessels also make ship classification more complicated and challenging. In this paper, we propose an effective multiple features learning (MFL) framework for ship classification, which contains three types of features: Gabor-based multi-scale completed local binary patterns (MS-CLBP), patch-based MS-CLBP and Fisher vector, and combination of Bag of visual words (BOVW) and spatial pyramid matching (SPM). After multiple feature learning, feature-level fusion and decision-level fusion are both investigated for final classification. In the proposed framework, typical support vector machine (SVM) classifier is employed Multimed Tools Appl to provide posterior-probability estimation. Experimental results on remote sensing ship image datasets demonstrate that the proposed approach shows a consistent improvement on performance when compared to some state-of-the-art methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of ACTION RECOGNITION WITH GRADIENT BOUNDARY CONVOLUTIONAL NETWORK

Deep learning features for video action recognition are usually learned from RGB/gray images, ima... more Deep learning features for video action recognition are usually learned from RGB/gray images, image gradients, and optical flows. The single modality of the input data can describe one characteristic of the human action such as appearance structure or motion information. In this paper, we propose a high efficient gradient boundary convolutional network (Con-vNet) to simultaneously learn spatio-temporal feature from the single modality data of gradient boundaries. The gradient boundaries represent both local spacial structure and motion information of action video. The gradient boundaries also have less background noise compared to RGB/gray images and image gradients. Extensive experiments are conducted on two popular and challenging action benchmarks, the UCF101 and the HMDB51 action datasets. The proposed deep gradient boundary feature achieves competitive performances on both benchmarks.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Fusing Local and Global Features for High-Resolution Scene Classification

—In this paper, a fused global saliency-based multiscale multiresolution multistructure local bin... more —In this paper, a fused global saliency-based multiscale multiresolution multistructure local binary pattern (salM 3 LBP) feature and local codebookless model (CLM) feature is proposed for high-resolution image scene classification. First, two different but complementary types of descriptors (pixel intensities and differences) are developed to extract global features, characterizing the dominant spatial features in multiple scale, multiple resolution, and multiple structure manner. The micro/macrostructure information and rotation invariance are guaranteed in the global feature extraction process. For dense local feature extraction, CLM is utilized to model local enrichment scale invariant feature transform descriptor and dimension reduction is conducted via joint low-rank learning with support vector machine. Finally, a fused feature representation between salM 3 LBP and CLM as the scene descriptor to train a kernel-based extreme learning machine for scene classification is presented. The proposed approach is extensively evaluated on three challenging benchmark scene datasets (the 21-class land-use scene, 19-class satellite scene, and a newly available 30-class aerial scene), and the experimental results show that the proposed approach leads to superior classification performance compared with the state-of-the-art classification methods.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of 3D ACTION RECOGNITION USING MULTI-TEMPORAL SKELETON VISUALIZATION

Action recognition using depth sequences plays important role in many fields, e.g., intelligent s... more Action recognition using depth sequences plays important role in many fields, e.g., intelligent surveillance, content-based video retrieval. Real applications require robust and accurate action recognition method. In this paper, we propose a skeleton visualization method, which efficiently encodes the spatial-temporal information of skeleton joints into a set of color images. These images are served as inputs for convolutional neural networks to extract more discrimina-tive deep features. To enhance the ability of deep features to capture global relationships, we extend the color images into multi-temporal version. Additionally, to solve the effect of view point changes, a spatial transform method is adopted as a preprocessing step. Extensive experiments on NTU RGB+D dataset and ICME2017 challenge show that our method can accurately distinguish similar actions and shows robustness to view variations.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Cross-View Image Matching for Geo-localization in Urban Environments

In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim ... more In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim to estimate the GPS location of a query street view image by finding the matching images in a reference database of geo-tagged bird's eye view images, or vice versa. To this end, we present a new framework for cross-view image geo-localization by taking advantage of the tremendous success of deep convolutional neural networks (CNNs) in image classification and object detection. First, we employ the Faster R-CNN [16] to detect buildings in the query and reference images. Next, for each building in the query image , we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. We evaluate the proposed framework on a new dataset that consists of pairs of street view and bird's eye view images. Experimental results show that the proposed method achieves better geo-localization accuracy than other approaches and is able to generalize to images at unseen locations .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Deep learning has been demonstrated to achieve excellent results for image classification and obj... more Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Cross-View Image Matching for Geo-localization in Urban Environments

In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim ... more In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim to estimate the GPS location of a query street view image by finding the matching images in a reference database of geo-tagged bird's eye view images, or vice versa. To this end, we present a new framework for cross-view image geo-localization by taking advantage of the tremendous success of deep convolutional neural networks (CNNs) in image classification and object detection. First, we employ the Faster R-CNN [16] to detect buildings in the query and reference images. Next, for each building in the query image , we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. We evaluate the proposed framework on a new dataset that consists of pairs of street view and bird's eye view images. Experimental results show that the proposed method achieves better geo-localization accuracy than other approaches and is able to generalize to images at unseen locations .

Bookmarks Related papers MentionsView impact

Research paper thumbnail of LOW-RESOLUTION PEDESTRIAN DETECTION VIA A NOVEL RESOLUTION-SCORE DISCRIMINATIVE SURFACE

Pedestrian detection, as an important task in video surveillance and forensics applications, has ... more Pedestrian detection, as an important task in video surveillance and forensics applications, has been widely studied. However, its performance is unsatisfactory especially in the low resolution conditions. In realistic scenarios, the size of pedestrians in the images is often small, and detection can be challenging. To solve this problem, this paper proposes a novel resolution-score discriminative surface method to investigate the variation behaviors of detection scores under different pedestrian and non-pedestrian image resolutions. The discriminative surface consists of a series of positive and negative resolution-score lines, and each of them is a connected line to depict the variation relationship between pedestrian's detection scores under various image resolutions. On this basis, the resolution-score discriminative surface can classify a resolution-score line as a pedestrian or not according to whether it lies in the positive or the negative region. Experimental results on two public datasets and one campus surveillance dataset demonstrate the effectiveness of the proposed method.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Real-Time Continuous Action Detection and Recognition Using Depth Images and Inertial Signals

— This paper presents an approach to detect and recognize actions of interest in real-time from a... more — This paper presents an approach to detect and recognize actions of interest in real-time from a continuous stream of data that are captured simultaneously from a Kinect depth camera and a wearable inertial sensor. Actions of interest are considered to appear continuously and in a random order among actions of non-interest. Skeleton depth images are first used to separate actions of interest from actions of non-interest based on pause and motion segments. Inertial signals from a wearable inertial sensor are then used to improve the recognition outcome. A dataset consisting of simultaneous depth and inertial data for the smart TV actions of interest occurring continuously and in a random order among actions of non-interest is studied and made publicly available. The results obtained indicate the effectiveness of the developed approach in coping with actions that are performed realistically in a continuous manner. Keywords— Real-time continuous action detection; action recognition from continuous data streams; simultaenous utilization of depth images and inertial signals for action recognition.

Bookmarks Related papers MentionsView impact