Shikha Dubey - Academia.edu (original) (raw)

Papers by Shikha Dubey

Research paper thumbnail of 3D Convolutional with Attention for Action Recognition

arXiv (Cornell University), Jun 5, 2022

Research paper thumbnail of Speech Enhancement using Adaptive Mean Median Deviation and EMD Technique

2019 IEEE International Conference on Signals and Systems (ICSigSys)

During the acquisition of the speech signal by the non-contact Speech Sensor (SS), the signal is ... more During the acquisition of the speech signal by the non-contact Speech Sensor (SS), the signal is degraded by severe colored noises which are non-linear and non-uniform in nature. Therefore, in this study, a new approach for advancing the speech enhancement technique is proposed to suppress these noises from the acquired speech signal. This technique is based on the adaptive thresholding, which uses Mean Median Deviation (MMD) method to determine adaptive threshold points, and Empirical Mode Decomposition (EMD) method. This algorithm has been validated by simulation data, and results of the proposed algorithm have been compared with other existing enhancement algorithms. Spectrograms and Signal to Noise Ratio (SNR) comparison are used to analyze the quality of the enhanced speech signal. From the results, we demonstrate that the proposed algorithm offers better speech enhancement than previously existing speech enhancement techniques.

Research paper thumbnail of Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

arXiv (Cornell University), Sep 16, 2021

Research paper thumbnail of Image Captioning using Multiple Transformers for Self-Attention Mechanism

Cornell University - arXiv, Feb 14, 2021

Real-time image captioning, along with adequate precision, is the main challenge of this research... more Real-time image captioning, along with adequate precision, is the main challenge of this research field. The present work, Multiple Transformers for Self-Attention Mechanism (MTSM), utilizes multiple transformers to address these problems. The proposed algorithm, MTSM, acquires region proposals using a transformer detector (DETR). Consequently, MTSM achieves the self-attention mechanism by transferring these region proposals and their visual and geometrical features through another transformer and learns the objects' local and global interconnections. The qualitative and quantitative results of the proposed algorithm, MTSM, are shown on the MSCOCO dataset.

Research paper thumbnail of Anomalous Event Recognition in Videos Based on Joint Learning of Motion and Appearance with Multiple Ranking Measures

Applied Sciences, 2021

Given the scarcity of annotated datasets, learning the context-dependency of anomalous events as ... more Given the scarcity of annotated datasets, learning the context-dependency of anomalous events as well as mitigating false alarms represent challenges in the task of anomalous activity detection. We propose a framework, Deep-network with Multiple Ranking Measures (DMRMs), which addresses context-dependency using a joint learning technique for motion and appearance features. In DMRMs, the spatial-time-dependent features are extracted from a video using a 3D residual network (ResNet), and deep motion features are extracted by integrating the motion flow maps’ information with the 3D ResNet. Afterward, the extracted features are fused for joint learning. This data fusion is then passed through a deep neural network for deep multiple instance learning (DMIL) to learn the context-dependency in a weakly-supervised manner using the proposed multiple ranking measures (MRMs). These MRMs consider multiple measures of false alarms, and the network is trained with both normal and anomalous event...

Research paper thumbnail of 3D ResNet with Ranking Loss Function for Abnormal Activity Detection in Videos

2019 International Conference on Control, Automation and Information Sciences (ICCAIS), 2019

Abnormal activity detection is one of the most challenging tasks in the field of computer vision.... more Abnormal activity detection is one of the most challenging tasks in the field of computer vision. This study is motivated by the recent state-of-art work of abnormal activity detection, which utilizes both abnormal and normal videos in learning abnormalities with the help of multiple instance learning by providing the data with video-level information. In the absence of temporal-annotations, such a model is prone to give a false alarm while detecting the abnormalities. For this reason, in this paper, we focus on the task of minimizing the false alarm rate while performing an abnormal activity detection task. The mitigation of these false alarms and recent advancement of 3D deep neural network in video action recognition task collectively give us motivation to exploit the 3D ResNet in our proposed method, which helps to extract spatial-temporal features from the videos. Afterwards, using these features and deep multiple instance learning along with the proposed ranking loss, our model learns to predict the abnormality score at the video segment level. Therefore, our proposed method 3D deep Multiple Instance Learning with ResNet (MILR) along with the new proposed ranking loss function achieves the best performance on the UCF-Crime benchmark dataset, as compared to other state-of-art methods. The effectiveness of our proposed method is demonstrated on the UCF-Crime dataset.

Research paper thumbnail of Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

Cornell University - arXiv, Sep 16, 2021

Automatic transcription of scene understanding in images and videos is a step towards artificial ... more Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. Automated image captioning techniques utilize encoder and decoder architecture, where the encoder extracts features from an image and the decoder generates a transcript. In this work, we investigate two unexplored ideas for image captioning using transformers: First, we demonstrate the enforcement of using objects' relevance in the surrounding environment. Second, learning an explicit association between labels and language constructs. We propose label-attention Transformer with geometrically coherent objects (LATGeO). The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module. Object coherence is defined using the localized ratio of the geometrical properties of the proposals. The label-attention module associates the extracted objects classes to the available dictionary using self-attention layers. The experimentation results show that objects' relevance in surroundings and binding of their visual feature with their geometrically localized ratios combined with its associated labels help in defining meaningful captions. The proposed framework is tested on the MSCOCO dataset, and a thorough evaluation resulting in overall better quantitative scores pronounces its superiority.

Research paper thumbnail of Improving Small Objects Detection using Transformer

General artificial intelligence is a trade-off between the inductive bias of an algorithm and its... more General artificial intelligence is a trade-off between the inductive bias of an algorithm and its out-of-distribution generalization performance. The conspicuous impact of inductive bias is an unceasing trend of improved predictions in various problems in computer vision like object detection. Although a recently introduced object detection technique, based on transformers (DETR), shows results competitive to the conventional and modern object detection models, its accuracy deteriorates for detecting small-sized objects (in perspective). This study examines the inductive bias of DETR and proposes a normalized inductive bias for object detection using a transformer (SOF-DETR). It uses a lazy-fusion of features to sustain deep contextual information of objects present in the image. The features from multiple subsequent deep layers are fused with element-wise-summation and input to a transformer network for object queries that learn the long and short-distance spatial association in th...

Research paper thumbnail of 3D Convolutional with Attention for Action Recognition

arXiv (Cornell University), Jun 5, 2022

Research paper thumbnail of Speech Enhancement using Adaptive Mean Median Deviation and EMD Technique

2019 IEEE International Conference on Signals and Systems (ICSigSys)

During the acquisition of the speech signal by the non-contact Speech Sensor (SS), the signal is ... more During the acquisition of the speech signal by the non-contact Speech Sensor (SS), the signal is degraded by severe colored noises which are non-linear and non-uniform in nature. Therefore, in this study, a new approach for advancing the speech enhancement technique is proposed to suppress these noises from the acquired speech signal. This technique is based on the adaptive thresholding, which uses Mean Median Deviation (MMD) method to determine adaptive threshold points, and Empirical Mode Decomposition (EMD) method. This algorithm has been validated by simulation data, and results of the proposed algorithm have been compared with other existing enhancement algorithms. Spectrograms and Signal to Noise Ratio (SNR) comparison are used to analyze the quality of the enhanced speech signal. From the results, we demonstrate that the proposed algorithm offers better speech enhancement than previously existing speech enhancement techniques.

Research paper thumbnail of Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

arXiv (Cornell University), Sep 16, 2021

Research paper thumbnail of Image Captioning using Multiple Transformers for Self-Attention Mechanism

Cornell University - arXiv, Feb 14, 2021

Real-time image captioning, along with adequate precision, is the main challenge of this research... more Real-time image captioning, along with adequate precision, is the main challenge of this research field. The present work, Multiple Transformers for Self-Attention Mechanism (MTSM), utilizes multiple transformers to address these problems. The proposed algorithm, MTSM, acquires region proposals using a transformer detector (DETR). Consequently, MTSM achieves the self-attention mechanism by transferring these region proposals and their visual and geometrical features through another transformer and learns the objects' local and global interconnections. The qualitative and quantitative results of the proposed algorithm, MTSM, are shown on the MSCOCO dataset.

Research paper thumbnail of Anomalous Event Recognition in Videos Based on Joint Learning of Motion and Appearance with Multiple Ranking Measures

Applied Sciences, 2021

Given the scarcity of annotated datasets, learning the context-dependency of anomalous events as ... more Given the scarcity of annotated datasets, learning the context-dependency of anomalous events as well as mitigating false alarms represent challenges in the task of anomalous activity detection. We propose a framework, Deep-network with Multiple Ranking Measures (DMRMs), which addresses context-dependency using a joint learning technique for motion and appearance features. In DMRMs, the spatial-time-dependent features are extracted from a video using a 3D residual network (ResNet), and deep motion features are extracted by integrating the motion flow maps’ information with the 3D ResNet. Afterward, the extracted features are fused for joint learning. This data fusion is then passed through a deep neural network for deep multiple instance learning (DMIL) to learn the context-dependency in a weakly-supervised manner using the proposed multiple ranking measures (MRMs). These MRMs consider multiple measures of false alarms, and the network is trained with both normal and anomalous event...

Research paper thumbnail of 3D ResNet with Ranking Loss Function for Abnormal Activity Detection in Videos

2019 International Conference on Control, Automation and Information Sciences (ICCAIS), 2019

Abnormal activity detection is one of the most challenging tasks in the field of computer vision.... more Abnormal activity detection is one of the most challenging tasks in the field of computer vision. This study is motivated by the recent state-of-art work of abnormal activity detection, which utilizes both abnormal and normal videos in learning abnormalities with the help of multiple instance learning by providing the data with video-level information. In the absence of temporal-annotations, such a model is prone to give a false alarm while detecting the abnormalities. For this reason, in this paper, we focus on the task of minimizing the false alarm rate while performing an abnormal activity detection task. The mitigation of these false alarms and recent advancement of 3D deep neural network in video action recognition task collectively give us motivation to exploit the 3D ResNet in our proposed method, which helps to extract spatial-temporal features from the videos. Afterwards, using these features and deep multiple instance learning along with the proposed ranking loss, our model learns to predict the abnormality score at the video segment level. Therefore, our proposed method 3D deep Multiple Instance Learning with ResNet (MILR) along with the new proposed ranking loss function achieves the best performance on the UCF-Crime benchmark dataset, as compared to other state-of-art methods. The effectiveness of our proposed method is demonstrated on the UCF-Crime dataset.

Research paper thumbnail of Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

Cornell University - arXiv, Sep 16, 2021

Automatic transcription of scene understanding in images and videos is a step towards artificial ... more Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. Automated image captioning techniques utilize encoder and decoder architecture, where the encoder extracts features from an image and the decoder generates a transcript. In this work, we investigate two unexplored ideas for image captioning using transformers: First, we demonstrate the enforcement of using objects' relevance in the surrounding environment. Second, learning an explicit association between labels and language constructs. We propose label-attention Transformer with geometrically coherent objects (LATGeO). The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module. Object coherence is defined using the localized ratio of the geometrical properties of the proposals. The label-attention module associates the extracted objects classes to the available dictionary using self-attention layers. The experimentation results show that objects' relevance in surroundings and binding of their visual feature with their geometrically localized ratios combined with its associated labels help in defining meaningful captions. The proposed framework is tested on the MSCOCO dataset, and a thorough evaluation resulting in overall better quantitative scores pronounces its superiority.

Research paper thumbnail of Improving Small Objects Detection using Transformer

General artificial intelligence is a trade-off between the inductive bias of an algorithm and its... more General artificial intelligence is a trade-off between the inductive bias of an algorithm and its out-of-distribution generalization performance. The conspicuous impact of inductive bias is an unceasing trend of improved predictions in various problems in computer vision like object detection. Although a recently introduced object detection technique, based on transformers (DETR), shows results competitive to the conventional and modern object detection models, its accuracy deteriorates for detecting small-sized objects (in perspective). This study examines the inductive bias of DETR and proposes a normalized inductive bias for object detection using a transformer (SOF-DETR). It uses a lazy-fusion of features to sustain deep contextual information of objects present in the image. The features from multiple subsequent deep layers are fused with element-wise-summation and input to a transformer network for object queries that learn the long and short-distance spatial association in th...