Context-Aware Modeling and Recognition of Activities in Video (original) (raw)
Related papers
Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification
Computer Vision – ECCV 2010, 2010
Much recent research in human activity recognition has focused on the problem of recognizing simple repetitive (walking, running, waving) and punctual actions (sitting up, opening a door, hugging). However, many interesting human activities are characterized by a complex temporal composition of simple actions. Automatic recognition of such complex actions can benefit from a good understanding of the temporal structures. We present in this paper a framework for modeling motion by exploiting the temporal structure of the human activities. In our framework, we represent activities as temporal compositions of motion segments. We train a discriminative model that encodes a temporal decomposition of video sequences, and appearance models for each motion segment. In recognition, a query video is matched to the model according to the learned appearances and motion segment decomposition. Classification is made based on the quality of matching between the motion segment classifiers and the temporal segments in the query sequence. To validate our approach, we introduce a new dataset of complex Olympic Sports activities. We show that our algorithm performs better than other state of the art methods.
Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities
2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
This paper proposes a framework for recognizing complex human activities in videos. Our method describes human activities in a hierarchical discriminative model that operates at three semantic levels. At the lower level, body poses are encoded in a representative but discriminative pose dictionary. At the intermediate level, encoded poses span a space where simple human actions are composed. At the highest level, our model captures temporal and spatial compositions of actions into complex human activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. By formulating model learning in a maxmargin framework, our approach achieves powerful multiclass discrimination while providing useful annotations at the intermediate semantic level. We show how our hierarchical compositional model provides natural handling of occlusions. To evaluate the effectiveness of our proposed framework, we introduce a new dataset of composed human activities. We provide empirical evidence that our method achieves state-of-the-art activity classification performance on several benchmark datasets.
Human-like Relational Models for Activity Recognition in Video
2021
Video activity recognition by deep neural networks is impressive for many classes. However, it falls short of human performance, especially for challenging to discriminate activities. Humans differentiate these complex activities by recognising critical spatiotemporal relations among explicitly recognised objects and parts, for example, an object entering the aperture of a container. Deep neural networks can struggle to learn such critical relationships effectively. Therefore we propose a more human-like approach to activity recognition, which interprets a video in sequential temporal phases and extracts specific relationships among objects and hands in those phases. Random forest classifiers are learnt from these extracted relationships. We apply the method to a challenging subset of the something-something dataset [9] and achieve a more robust performance against neural network baselines on challenging activities.
Context-Aware Activity Modeling using Hierarchical Conditional Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015
In this paper, rather than modeling activities in videos individually, we jointly model and recognize related activities in a scene using both motion and context features. This is motivated from the observations that activities related in space and time rarely occur independently and can serve as the context for each other. We propose a two-layer conditional random field model, that represents the action segments and activities in a hierarchical manner. The model allows the integration of both motion and various context features at different levels and automatically learns the statistics that capture the patterns of the features. With weakly labeled training data, the learning problem is formulated as a max-margin problem and is solved by an iterative algorithm. Rather than generating activity labels for individual activities, our model simultaneously predicts an optimum structural label for the related activities in the scene. We show promising results on the UCLA Office Dataset and VIRAT Ground Dataset that demonstrate the benefit of hierarchical modeling of related activities using both motion and context features.
Information Forensics and Security, IEEE Transactions on, 2013
Surveillance videos in unconstrained environments typically consist of long duration sequences of activities which occur at different spatio-temporal locations and can involve multiple people acting simultaneously. Often, the activities have contextual relationships with one another. Although context has been studied in the past for the purpose of activity recognition, the use of context in recognition of activities in such challenging environments is relatively unexplored. In this paper, we propose a novel method for capturing the spatio-temporal context between activities in a Markov random field. The structure of the MRF is improvised upon during test time and not predefined, unlike many approaches that model the contextual relationships between activities. Given a collection of videos and a set of weak classifiers for individual activities, the spatio-temporal relationships between activities are represented as probabilistic edge weights in the MRF. This model provides a generic...
Temporal segmentation and assignment of successive actions in a long-term video
Pattern Recognition Letters, 2013
Temporal segmentation of successive actions in a long-term video sequence has been a long-standing problem in computer vision. In this paper, we exploit a novel learning-based framework. Given a video sequence, only a few characteristic frames are selected by the proposed selection algorithm, and then the likelihood to trained models is calculated in a pair-wise way, and finally segmentation is obtained as the optimal model sequence to realize the maximum likelihood. The average accuracy on IXMAS dataset reached to 80.5% at frame level, using only 16.5% of all frames in computation time of 1.57 s per video which has 1160 frames on the average.
Learning Hierarchical Models of Complex Daily Activities from Annotated Videos
2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018
Effective recognition of complex long-term activities is becoming an increasingly important task in artificial intelligence. In this paper, we propose a novel approach for building models of complex long-term activities. First, we automatically learn the hierarchical structure of activities by learning about the 'parent-child' relation of activity components from a video using the variability in annotations acquired using multiple annotators. This variability allows for extracting the inherent hierarchical structure of the activity in a video. We consolidate hierarchical structures of the same activity from different videos into a unified stochastic grammar describing the overall activity. We then describe an inference mechanism to interpret new instances of activities. We use three datasets, which have been annotated by multiple annotators, of daily activity videos to demonstrate the effectiveness of our system.
Coupling video segmentation and action recognition
IEEE Winter Conference on Applications of Computer Vision, 2014
Recently a lot of progress has been made in the field of video segmentation. The question then arises whether and how these results can be exploited for this other video processing challenge, action recognition. In this paper we show that a good segmentation is actually very important for recognition. We propose and evaluate several ways to integrate and combine the two tasks: i) recognition using a standard, bottom-up segmentation, ii) using a top-down segmentation geared towards actions, iii) using a segmentation based on inter-video similarities (co-segmentation), and iv) tight integration of recognition and segmentation via iterative learning. Our results clearly show that, on the one hand, the two tasks are interdependent and therefore an iterative optimization of the two makes sense and gives better results. On the other hand, comparable results can also be obtained with two separate steps but mapping the feature-space with a non-linear kernel.
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011
ABSTRACT Modelling human activities as temporal sequences of their constituent actions has been the object of much research effort in recent years. However, most of this work concentrates on tasks where the action vocabulary is relatively small and/or each activity can be performed in a limited number of ways. In this work, we propose a novel and robust framework for analysing prolonged activities arising in tasks which can be effectively achieved in a variety of ways, which we name mid-term activities. We show that we are able to efficiently analyse and recognise such activities and also detect potential errors in their execution. To achieve this, we introduce an activity classification method which we name the Key Action Discovery system. We demonstrate that this method combined with temporal modelling of activities' constituent actions with the aid of hierarchical graphical models offers higher classification accuracy compared to current activity identification schemes.