A selective spatio-temporal interest point detector for human action recognition in complex scenes (original) (raw)
Related papers
Selective spatio-temporal interest points
Computer Vision and Image Understanding
Recent progress in the field of human action recognition points towards the use of Spatio-Temporal Interest Points (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bagof-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.
Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition
Computers & Electrical Engineering, 2018
Human action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.
An evaluation of bags-of-words and spatio-temporal shapes for action recognition
2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011
Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification.
Spatio-temporal action localization and detection for human action recognition in big dataset
Journal of Visual Communication and Image Representation, 2016
Human action recognition is still attracting the computer vision research community due to its various applications. However, despite the variety of methods proposed to solve this problem, some issues still need to be addressed. In this paper, we present a human action detection and recognition process on large datasets based on Interest Points trajectories. In order to detect moving humans in moving field of views, a spatio-temporal action detection is performed basing on optical flow and dense speed-up-robustfeatures (SURF). Then, a video description based on a fusion process that combines motion, trajectory and visual descriptors is proposed. Features within each bounding box are extracted by exploiting the bag-of-words approach. Finally, a support-vector-machine is employed to classify the detected actions. Experimental results on the complex benchmark UCF101, KTH and HMDB51 datasets reveal that the proposed technique achieves better performances compared to some of the existing state-of-the-art action recognition approaches.
Spatio-temporal SURF for Human Action Recognition
Lecture Notes in Computer Science, 2013
In this paper, we propose a new spatio-temporal descriptor called ST-SURF. The latter is based on a novel combination between the speed up robust feature and the optical flow. The Hessian detector is employed to find all interest points. To reduce the computation time, we propose a new methodology for video segmentation, in Frames Packets FPs, based on the interest points trajectory tracking. We consider only moving interest points descriptors to generate robust and powerful discriminative codebook based on K-mean clustering. We use a standard bag-of-visual-words SVM approach for action recognition. For the purpose of evaluation, the experimentations are carried out on KTH and UCF sports Datasets. It is demonstrated that the designed ST-SURF shows promising results. In fact, on KTH Dataset, the proposed method achieves an accuracy of 88.2% which is equivalent to the state-of-the-art. On the more realistic UCF sports Dataset, our method surpasses the performance of the best results of space-time descriptors/Hessian detector with 80.7%.
Improving bag-of-features action recognition with non-local cues
Procedings of the British Machine Vision Conference 2010, 2010
Local space-time features have recently shown promising results within Bag-of-Features (BoF) approach to action recognition in video. Pure local features and descriptors, however, provide only limited discriminative power implying ambiguity among features and sub-optimal classification performance. In this work, we propose to disambiguate local space-time features and to improve action recognition by integrating additional nonlocal cues with BoF representation. For this purpose, we decompose video into region classes and augment local features with corresponding region-class labels. In particular, we investigate unsupervised and supervised video segmentation using (i) motion-based foreground segmentation, (ii) person detection, (iii) static action detection and (iv) object detection. While such segmentation methods might be imperfect, they provide complementary region-level information to local features. We demonstrate how this information can be integrated with BoF representations in a kernel-combination framework. We evaluate our method on the recent and challenging Hollywood-2 action dataset and demonstrate significant improvements.
Human Action Recognition Based on Bag-of-Words
Iraqi Journal of Science
Human action recognition has gained popularity because of its wide applicability, such as in patient monitoring systems, surveillance systems, and a wide diversity of systems that contain interactions between people and electrical devices, including human computer interfaces. The proposed method includes sequential stages of object segmentation, feature extraction, action detection and then action recognition. Effective results of human actions using different features of unconstrained videos was a challenging task due to camera motion, cluttered background, occlusions, complexity of human movements, and variety of same actions performed by distinct subjects. Thus, the proposed method overcomes such problems by using the fusion of features concept for the development of a powerful human action descriptor. This descriptor is modified to create a visual word vocabulary (or codebook) which yields a Bag-of-Words representation. The True Positive Rate (TPR) and False Positive Rate (FPR) ...
Action Recognition Using Spatial-Temporal Context
2010 20th International Conference on Pattern Recognition, 2010
The spatial-temporal local features and the bag of words representation have been widely used in the action recognition field. However, this framework usually neglects the internal spatial-temporal relations between video-words, resulting in ambiguity in action recognition task, especially for videos "in the wild". In this paper, we solve this problem by utilizing the volumetric context around a video-word. Here, a local histogram of video-words distribution is calculated, which is referred as the "context" and further clustered into contextual words.
Spatio-temporal action localization for human action recognition in large dataset
Video Surveillance and Transportation Imaging Applications 2015, 2015
Human action recognition has drawn much attention in the field of video analysis. In this paper, we develop a human action detection and recognition process based on the tracking of Interest Points (IP) trajectory. A pre-processing step that performs spatio-temporal action detection is proposed. This step uses optical flow along with dense speed-up-robust-features (SURF) in order to detect and track moving humans in moving field of views. The video description step is based on a fusion process that combines displacement and spatio temporal descriptors. Experiments are carried out on the big data-set UCF-101. Experimental results reveal that the proposed techniques achieve better performances compared to many existing state-of-the-art action recognition approaches.
Sift Accordion: A Space-Time Descriptor Applied To Human Action Recognition
2011
Recognizing human action from videos is an active field of research in computer vision and pattern recognition. Human activity recognition has many potential applications such as video surveillance, human machine interaction, sport videos retrieval and robot navigation. Actually, local descriptors and bag of visuals words models achieve state-of-the-art performance for human action recognition. The main challenge in features description is how to represent efficiently the local motion information. Most of the previous works focus on the extension of 2D local descriptors on 3D ones to describe local information around every interest point. In this paper, we propose a new spatio-temporal descriptor based on a spacetime description of moving points. Our description is focused on an Accordion representation of video which is well-suited to recognize human action from 2D local descriptors without the need to 3D extensions. We use the bag of words approach to represent videos. We quantify...