A STATISTICAL MODEL FOR ANNOTATING VIDEOS WITH HUMAN ACTIONS (original) (raw)
Related papers
An enhanced method for human action recognition
This paper presents a fast and simple method for human action recognition. The proposed technique relies on detecting interest points using SIFT (scale invariant feature transform) from each frame of the video. A fine-tuning step is used here to limit the number of interesting points according to the amount of details. Then the popular approach Bag of Video Words is applied with a new normalization technique. This normalization technique remarkably improves the results. Finally a multi class linear Support Vector Machine (SVM) is utilized for classification. Experiments were conducted on the KTH and Weizmann datasets. The results demonstrate that our approach outperforms most existing methods, achieving accuracy of 97.89% for KTH and 96.66% for Weizmann.
Automatic Human Action Recognition from Video Using Hidden Markov Model
2015 IEEE 18th International Conference on Computational Science and Engineering, 2015
Posture classification is a key process for evaluating the behaviors of human being. Computer vision techniques can play a vital role in automating the overall process, however, occlusions, cluttered environment and illumination changes can make the whole task difficult. Using multiple cameras and warping known object appearance into the occluded view can solve the occlusion problem. In this paper, we present an automatic human detection and action recognition system using Hidden Markov Model and bag of Words. Background subtraction is performed using Gaussian mixture model. The algorithm is able to perform robust detection in the cluttered environment and severe occlusions. The novelty of this work is the dataset used. A private dataset has been created for this research at university of Minho. The experimental results show promising results.
Human Action Recognition Based on Bag-of-Words
Iraqi Journal of Science
Human action recognition has gained popularity because of its wide applicability, such as in patient monitoring systems, surveillance systems, and a wide diversity of systems that contain interactions between people and electrical devices, including human computer interfaces. The proposed method includes sequential stages of object segmentation, feature extraction, action detection and then action recognition. Effective results of human actions using different features of unconstrained videos was a challenging task due to camera motion, cluttered background, occlusions, complexity of human movements, and variety of same actions performed by distinct subjects. Thus, the proposed method overcomes such problems by using the fusion of features concept for the development of a powerful human action descriptor. This descriptor is modified to create a visual word vocabulary (or codebook) which yields a Bag-of-Words representation. The True Positive Rate (TPR) and False Positive Rate (FPR) ...
Machine Recognition of Human Activities: A Survey
IEEE Transactions on Circuits and Systems for Video Technology, 2000
The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the activities occurring in the video. The analysis of human activities in videos is an area with increasingly important consequences from security and surveillance to entertainment and personal archiving. Several challenges at various levels of processing-robustness against errors in low-level processing, view and rate-invariant representations at midlevel processing and semantic representation of human activities at higher level processing-make this problem hard to solve. In this review paper, we present a comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications. We discuss the problem at two major levels of complexity: 1) "actions" and 2) "activities." "Actions" are characterized by simple motion patterns typically executed by a single human. "Activities" are more complex and involve coordinated actions among a small number of humans. We will discuss several approaches and classify them according to their ability to handle varying degrees of complexity as interpreted above. We begin with a discussion of approaches to model the simplest of action classes known as atomic or primitive actions that do not require sophisticated dynamical modeling. Then, methods to model actions with more complex dynamics are discussed. The discussion then leads naturally to methods for higher level representation of complex activities.
Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition
Computers & Electrical Engineering, 2018
Human action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.
Automatic annotation of human actions in video
2009 IEEE 12th International Conference on Computer Vision, 2009
This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision. To this end we consider two associated problems: (a) weakly-supervised learning of action models from readily available annotations, and (b) temporal localization of human actions in test videos. To avoid the prohibitive cost of manual annotation for training, we use movie scripts as a means of weak supervision. Scripts, however, provide only implicit, noisy, and imprecise information about the type and location of actions in video. We address this problem with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data. Using the obtained action samples, we train temporal action detectors and apply them to locate actions in the raw video data. Our experiments demonstrate that the proposed method for weakly-supervised learning of action models leads to significant improvement in action detection. We present detection results for three action classes in four feature length movies with challenging and realistic video data.
Multi action recognition using machine learning
International journal of health sciences
One of the most exciting and useful computer vision research topics is automated human activity identification. The majority of existing research, as well as most traditional approaches and classic neural networks, disregard the In video sequences, the appearance and patterns of motion are important. Individuals are unableIn video sequences, the appearance and patterns of motion are important.. This paper outlines a system for detecting, recognising, and summarising diverse human actions.The multiple action detection method takes the silhouettes of human bodies and then uses motion detection and tracking to create a unique sequence for each one.Based on the Each of the recovered sequences is then separated into shots based on the similarity between each pair of frames.show the sequence's homogenous activity. The activity is identified by looking at thedata oriented gradient (HOG) histogram of the frames in each shot's Temporal Difference Map (TDMap). comparing generated HOG ...
Action Recognition using High-Level Action Units
— Vision-based human recognition is the process of naming image sequences with action labels. In this project, a model is developed for human activity detection using high-level action units to represent human activity. Training phase learns the model for action units and action classifiers. Testing phase uses the learned model for action prediction.Three components are used to classify activities such as New spatial-temporal descriptor, Statistics of the context-aware descriptors, Suppress noise in the action units. Representing human activities by a set of intermediary concepts called action units which are automatically learned from the training data. At low-level, we have existing a locally weighted word context descriptor to progress the traditional interest-point-based representation. The proposed descriptor incorporates the neighborhood details effectively. At high-level, we have introduced the GNMF-based action units to bridge the semantic gap in activity representation. Moreover, we have proposed a new joint l2,1-norm based sparse model for action unit selection in a discriminative manner. Broad experiments have been passed out to authorize our claims and have confirmed our intuition that the action unit based representation is dangerous for modeling difficult activities from videos.
Human Action Classification Using N-Grams Visual Vocabulary
Lecture Notes in Computer Science, 2014
Human action classification is an important task in computer vision. The Bag-of-Words model is a representation method very used in action classification techniques. In this work we propose an approach based on mid-level features representation for human action description. First, an optimal vocabulary is created without a preliminary number of visual words, which is a known problem of the K-means method. We introduce a graph-based video representation using the interest points relationships, in order to take into account the spatial and temporal layout. Finally, a second visual vocabulary based on n-grams is used for classification. This combines the representational power of graphs with the efficiency of the bag-of-words representation. The representation method was tested on the KTH dataset using STIP and MoSIFT descriptors and multi-class SVM with a chi-square kernel. The experimental results show that our approach using STIP descriptor outperforms the best results of state-of-art, meanwhile using MoSIFT descriptor are comparable to them.
Proceedings of the 10th International Conference on Computer Vision Theory and Applications, 2015
Classifying web videos using a Bag of Words (BoW) representation has received increased attention due to its computational simplicity and good performance. The increasing number of categories, including actions with high confusion, and the addition of significant contextual information has lead to most of the authors focusing their efforts on the combination of descriptors. In this field, we propose to use the multikernel Support Vector Machine (SVM) with a contrasted selection of kernels. It is widely accepted that using descriptors that give different kind of information tends to increase the performance. To this end, our approach introduce contextual information, i.e. objects directly related to performed action by pre-selecting a set of points belonging to objects to calculate the codebook. In order to know if a point is part of an object, the objects are previously tracked by matching consecutive frames, and the object bounding box is calculated and labeled. We code the action videos using BoW representation with the object codewords and introduce them to the SVM as an additional kernel. Experiments have been carried out on two action databases, KTH and HMDB, the results provide a significant improvement with respect to other similar approaches.