Author manuscript, published in "CVPR- International Conference on Computer Vision and Pattern Recognition (2010)" Better exploiting motion for better action recognition (original) (raw)
Related papers
Better Exploiting Motion for Better Action Recognition
2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013
Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports.
Improved Motion Description for Action Classification
Frontiers in ICT, 2016
Even though the importance of explicitly integrating motion characteristics in video descriptions has been demonstrated by several recent papers on action classification, our current work concludes that adequately decomposing visual motion into dominant and residual motions, i.e., camera and scene motion, significantly improves action recognition algorithms. This holds true both for the extraction of the space-time trajectories and for computation of descriptors. We designed a new motion descriptor-the DCS descriptorthat captures additional information on local motion patterns enhancing results based on differential motion scalar quantities, divergence, curl, and shear features. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. These findings are complementary to each other and they outperformed all previously reported results by a significant margin on three challenging datasets: Hollywood 2, HMDB51, and Olympic Sports as reported in Jain et al. (2013). These results were further improved by Oneata et al. (2013), Wang and Schmid (2013), and Zhu et al. (2013) through the use of the Fisher vector encoding. We therefore also employ Fisher vector in this paper, and we further enhance our approach by combining trajectories from both optical flow and compensated flow. We as well provide additional details of DCS descriptors, including visualization. For extending the evaluation, a novel dataset with 101 action classes, UCF101, was added.
Dense Trajectories and Motion Boundary Descriptors for Action Recognition
International Journal of Computer Vision, 2013
This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.
Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2016
This paper presents a video representation that exploits the properties of the trajectories of local descriptors in human action videos. We use spatial-temporal information, which is led by trajectories to extract kinematic properties: tangent vector, normal vector, bi-normal vector and curvature. The results show that the proposed method provides comparable results compared to the state-of-the-art methods. In turn, it outperforms compared methods in terms of time complexity.
Action Recognition from a Small Number of Frames
In this paper, we present an efficient system for action recognition from very short sequences. For action recognition typically appearance and/or motion information of an action is analyzed using a large number of frames, which is often not sufficient, if very fast actions (e.g., in sport analysis) have to be analyzed. To overcome this limitation, we propose a method that uses a single-frame representation for actions based on appearance and on motion information. In particular, we estimate Histograms of Oriented Gradients (HOGs) for the current sample as well as for a flow field. The thus obtained descriptors are then efficiently represented by the coefficients of a Non-negative Matrix Factorization (NMF). Actions are classified using an one-vs-all Support Vector Machine. Since the flow can be estimated from two frames, in the evaluation stage only two consecutive frames are required for the action analysis. Both, the optical flow as well as the HOGs, can be computed very efficiently. In the experiments, we compare the proposed approach to state-of-the-art methods and show that it yield competitive results. In addition, we demonstrate action recognition for beach-volleyball sequences.
Action recognition using interest points capturing differential motion information
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
Human action recognition has been a challenging task in computer vision because of intra-class variability. State-of-theart methods have shown good performance for constrained videos but have failed to achieve good results for complex scenes. Reasons for their failing include treating spatial and temporal dimensions without distinction as well as not capturing temporal information during feature extraction or video representation. To address these problems we propose principled changes to an action recognition framework that is based on video interest points (IP) detection with capturing differential motion as the central theme. First, we propose to detect points with high curl of optical flow, which captures relative motion boundaries in a frame. We track these points to form dense trajectories. Second, we discard points on the trajectories that do not represent change in motion of the same object, yielding temporally localized IPs. Third, we propose a video representation based on spatio-temporal arrangement of IPs with respect to their neighboring IPs. The proposed approach yields a compact and information-dense representation without using any local descriptor around the detected IPs. It significantly outperforms state-of-the-art methods on UCF youtube dataset, which has complex action classes, as well as on KTH dataset, which has simple action classes.
I.A.: Action recognition by matching clustered trajectories of motion vectors
2013
Abstract: A framework for action representation and recognition based on the description of an action by time series of optical flow motion features is presented. In the learning step, the motion curves representing each action are clustered using Gaussian mixture modeling (GMM). In the recognition step, the optical flow curves of a probe sequence are also clustered using a GMM and the probe curves are matched to the learned curves using a non-metric similarity function based on the longest common subsequence which is robust to noise and provides an intuitive notion of similarity between trajectories. Finally, the probe sequence is categorized to the learned action with the maximum similarity using a nearest neighbor classification scheme. Experimental results on common action databases demonstrate the effectiveness of the proposed method. 1
Action Recognition in Broadcast Tennis Video
2006
Motion analysis in broadcast sports video is a challenging problem especially for player action recognition due to the low resolution of players in the frames. This paper presents a novel approach to recognize the basic player actions in broadcast tennis video where the player is about 30 pixels tall. Two research challenges, motion representation and action recognition, are addressed. A new motion descriptor, which is a group of histograms based on optical flow, is proposed for motion representation. The optical flow here is treated as spatial pattern of noisy measurement instead of precise pixel displacement. To recognize the action performed by the player, support vector machine is employed to train the classifier where the concatenation of histograms is formed as the input features. The experimental results demonstrate that our method is promising.
IET Computer Vision, 2017
This work presents a spatio-temporal motion descriptor that is computed from a spatiallyconstrained decomposition and applied to online classification and recognition of human activities. The method starts by computing a multi-scale dense optical flow that provides instantaneous velocity information for every pixel without explicit spatial regularization. Potential human actions are detected at each frame as spatially consistent moving regions and marked as Regions of Interest (RoIs). Each of these RoIs is then sequentially partitioned to obtain a spatial representation of small overlapped subregions with different sizes. Each of these region parts is characterized by a set of flow orientation histograms. A particular RoI is then described along the time by a set of recursively calculated statistics, that collect information from the temporal history of orientation histograms, to form the action descriptor. At any time, the whole descriptor can be extracted and labelled by a previously trained support vector machine. The method was evaluated using three different public datasets: (1) The VISOR dataset was used for two purposes: first, for global classification of short sequences containing individual actions, a task for which the method reached an average accuracy of 95% (sequence rate). Also, this dataset was used for recognition of multiple actions in long sequences, achieving an average per-frame accuracy of 92.3%. (2) the KTH dataset was used for global classification of activities and (3) the UT-datasets were used for evaluating the recognition task, obtaining an average accuracy of 80% (frame rate).
Action recognition by dense trajectories
CVPR 2011, 2011
Feature trajectories have shown to be efficient for representing videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajectories is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sample dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajectories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the motion information in videos well.