Action recognition using bag of features extracted from a beam of trajectories (original) (raw)

Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition

Computers & Electrical Engineering, 2018

Human action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.

Action recognition via local descriptors and holistic features

2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009

In this paper we propose a unified action recognition framework fusing local descriptors and holistic features. The motivation is that the local descriptors and holistic features emphasize different aspects of actions and are suitable for the different types of action databases. The proposed unified framework is based on frame differencing, bag-of-words and feature fusion. We extract two kinds of local descriptors, i.e. 2D and 3D SIFT feature descriptors, both based on 2D SIFT interest points. We apply Zernike moments to extract two kinds of holistic features, one is based on single frames and the other is based on motion energy image. We perform action recognition experiments on the KTH and Weizmann databases, using Support Vector Machines. We apply the leave-one-out and pseudo leave-N-out setups, and compare our proposed approach with state-of-the-art results. Experiments show that our proposed approach is effective. Compared with other approaches our approach is more robust, more versatile, easier to compute and simpler to understand.

Action recognition by dense trajectories

CVPR 2011, 2011

Feature trajectories have shown to be efficient for representing videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajectories is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sample dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajectories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the motion information in videos well.

Trajectory feature fusion for human action recognition

This paper addresses the problem of human action detection /recognition by investigating interest points (IP) trajectory cues and by reducing undesirable small camera motion. We first detect speed up robust feature (SURF) to segment video into frame volume (FV) that contains small actions. This segmentation relies on IP trajectory tracking. Then, for each FV, we extract optical flow of every detected SURF. Finally, a parametrization of the optical flow leads to displacement segments. These features are concatenated into a trajectory feature in order to describe the trajectory of IP upon a FV. We reduce the impact of camera motion by considering moving IPs beyond a minimum motion angle and by using motion boundary histogram (MBH). Feature-fusion based action recognition is performed to generate robust and discriminative codebook using K-mean clustering. We employ a bag-of-visual-words Support Vector Machine (SVM) approach for the learning /testing step. Through an extensive experimental evaluation carried out on the challenging UCF sports datasets, we show the efficiency of the proposed method by achieving 83.5% of accuracy.

Exploiting the Kinematic of the Trajectories of the Local Descriptors to Improve Human Action Recognition

Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2016

This paper presents a video representation that exploits the properties of the trajectories of local descriptors in human action videos. We use spatial-temporal information, which is led by trajectories to extract kinematic properties: tangent vector, normal vector, bi-normal vector and curvature. The results show that the proposed method provides comparable results compared to the state-of-the-art methods. In turn, it outperforms compared methods in terms of time complexity.

An evaluation of bags-of-words and spatio-temporal shapes for action recognition

2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011

Bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are two very popular approaches for action recognition from video. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been done. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. This paper compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We observe that the STS approach leads to better results in all datasets whose background is of little relevance to action classification.

Human Action Recognition using Ensemble of Shape, Texture and Motion features

2018

Even though many approaches have been proposed for Human Action Recognition, challenges like illumination variation, occlusion, camera view and background clutter keep this topic open for further research. Devising a robust descriptor for representing an action to give good classification accuracy is a demanding task. In this work, a new feature descriptor is introduced which is named ‘Spatio Temporal Shape-Texture-Motion’ (STSTM) descriptor. STSTM feature descriptor uses hybrid approach by combining local and global features. Salient points are extracted using Spatio Temporal Interest Points (STIP) algorithm which are further encoded using Discrete Wavelet Transform (DWT). DWT coefficients thus extracted represent local motion information of the object. Shape and texture features are extracted using Histogram of Oriented Gradient (HOG) and Local Binary Pattern (LBP) algorithms respectively. To achieve dimensionality reduction, Principal Component analysis is applied separately to t...

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

International Journal of Computer Vision, 2013

This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

Human Action Recognition Based on Spatio-temporal Features

Lecture Notes in Computer Science, 2009

This paper studies the technique of human action recognition using spatio-temporal features. We concentrate on the motion and the shape patterns produced by different actions for action recognition. The motion patterns generated by the actions are captured by the optical flows. The Shape information is obtained by Viola-Jones features. Spatial features comprises of motion and shape information from a single frame. Spatio-temporal descriptor patterns are formed to improve the accuracy over spatial features. Adaboost learns and classifies the descriptor patterns. We report the accuracy of our system on a standard Weizmann dataset.

Recognizing Human Actions: A Local SVM Approach

2004

Local space-time features capture local events in video and can be adapted to the size, the frequency and the velocity of moving patterns. In this paper we demonstrate how such features can be used for recognizing complex motion patterns. We construct video representations in terms of local space-time features and integrate such representations with SVM classification schemes for recognition. For the purpose of evaluation we introduce a new video database containing 2391 sequences of six human actions performed by 25 people in four different scenarios. The presented results of action recognition justify the proposed method and demonstrate its advantage compared to other relative approaches for action recognition.