Unsupervised discovery of action hierarchies in large collections of activity videos (original) (raw)
Related papers
Unsupervised view and rate invariant clustering of video sequences
Computer Vision and Image Understanding, 2009
Videos play an ever increasing role in our everyday lives with applications ranging from news, entertainment, scientific research, security and surveillance. Coupled with the fact that cameras and storage media are becoming less expensive, it has resulted in people producing more video content than ever before. This necessitates the development of efficient indexing and retrieval algorithms for video data. Most state-of-the-art techniques index videos according to the global content in the scene such as color, texture, brightness, etc. In this paper, we discuss the problem of activity-based indexing of videos. To address the problem, first we describe activities as a cascade of dynamical systems which significantly enhances the expressive power of the model while retaining many of the computational advantages of using dynamical models. Second, we also derive methods to incorporate view and rate-invariance into these models so that similar actions are clustered together irrespective of the viewpoint or the rate of execution of the activity. We also derive algorithms to learn the model parameters from a video stream and demonstrate how a single video sequence may be clustered into different clusters where each cluster represents an activity. Experimental results for five different databases show that the clusters found by the algorithm correspond to semantically meaningful activities.
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011
ABSTRACT Modelling human activities as temporal sequences of their constituent actions has been the object of much research effort in recent years. However, most of this work concentrates on tasks where the action vocabulary is relatively small and/or each activity can be performed in a limited number of ways. In this work, we propose a novel and robust framework for analysing prolonged activities arising in tasks which can be effectively achieved in a variety of ways, which we name mid-term activities. We show that we are able to efficiently analyse and recognise such activities and also detect potential errors in their execution. To achieve this, we introduce an activity classification method which we name the Key Action Discovery system. We demonstrate that this method combined with temporal modelling of activities' constituent actions with the aid of hierarchical graphical models offers higher classification accuracy compared to current activity identification schemes.
GAIDON et al.: MINING VISUAL ACTIONS FROMMOVIES 1
2012
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A graph-based approach for detecting common actions in motion capture data and videos
Pattern Recognition, 2018
We present a novel solution to the problem of detecting common actions in time series of motion capture data and videos. Given two action sequences, our method discovers all pairs of common subsequences, i.e. subsequences that represent the same or similar action. This is achieved in a completely unsupervised manner, i.e., without any prior knowledge of the type of actions, their number and their duration. These common subsequences (commonalities) may be located anywhere in the original sequences, may differ in duration and may be performed under different conditions e.g., by a different actor. The proposed method performs a very efficient graph-based search on the matrix of pairwise distances of frames of the two sequences. This search is supported by an objective function that captures the trade off between the similarity of the common subsequences and their lengths. The proposed method has been evaluated quantitatively on challenging datasets and in comparison to state of the art approaches. The obtained results demonstrate that the proposed method outperforms the state of the art methods both in the quality of the obtained solutions and in computational performance.
Unsupervised and Explainable Assessment of Video Similarity
2019
We propose a novel unsupervised method that assesses the similarity of two videos on the basis of the estimated relatedness of the objects and their behavior, and provides arguments supporting this assessment. A video is represented as a complete undirected action graph that encapsulates information on the types of objects and the way they (inter)act. The similarity of a pair of videos is estimated based on the bipartite Graph Edit Distance (GED) of the corresponding action graphs. As a consequence, on-top of estimating a quantitative measure of video similarity, our method establishes spatiotemporal correspondences between objects across videos if these objects are semantically related, if/when they interact similarly, or both. We consider this an important step towards explainable assessment of video and action similarity. The proposed method is evaluated on a publicly available dataset on the tasks of activity classification and ranking and is shown to compare favorably to state ...
Sorting Atomic Activities for Discovering Spatio-temporal Patterns in Dynamic Scenes
We present a novel non-object centric approach for discovering activity patterns in dynamic scenes. We build on previous works on video scene understanding. We first compute simple visual cues and individuate elementary activities. Then we divide the video into clips, compute clip histograms and cluster them to discover spatio-temporal patterns. A recently proposed clustering algorithm, which uses as objective function the Earth Mover's Distance (EMD), is adopted. In this way the similarity among elementary activities is taken into account. This paper presents three crucial improvements with respect to previous works: (i) we consider a variant of EMD with a robust ground distance, (ii) clips are represented with circular histograms and an optimal bin order, reflecting the atomic activities'similarity, is automatically computed, (iii) the temporal dynamics of elementary activities is considered when clustering clips. Experimental results on publicly available datasets show that our method compares favorably with state-of-the-art approaches.
Guest Editorial: Analysis and Retrieval of Events/Actions and Workflows in Video Streams
Multimedia Tools and Applications, 2016
Cognitive video supervision and event analysis in video sequences is a critical task in many multimedia applications. Methods, tools, and algorithms that aim to detect and recognize highlevel concepts and their respective spatiotemporal and causal relations in order to identify semantic video activities, actions, and procedures have been in the focus of the research community over the last years. This research area has strong impact on many real-life applications such as service quality assurance, compliance to the designed procedures in industrial plants, surveillance of people-dense areas (e.g., thematic parks, critical public infrastructures), crisis management in public service areas (e.g., train stations, airports), security (detection of abnormal behaviors in surveillance videos), semantic characterization, and annotation of video streams in various domains (e.g., broadcast or user-generated videos). For instance, the dynamic capture of situational awareness concerning crowds in specific mass gathering venues and its intelligent enablement into emergency management information
Unsupervised Learning of Human Action Categories Using Spatial-Temporal
International Journal of Computer Vision, 2008
We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
From Videos to Verbs: Mining Videos for Events using a Cascade of Dynamical Systems
2000
Clustering video sequences in order to infer and extract events from a single video stream is an extremely impor- tant problem and has significant potential in video index- ing, surveillance, activity discovery and event recognition. Clustering a video sequence into events requires one to si- multaneously recognize event boundaries (event consistent subsequences) and cluster these event subsequences. In or- der
Action Categorization from Video Sequences
2000
This article presents a framework for extracting relevant qualitative chunks from a video sequence. The notion of qualitative descriptors, used to perform the qualitative extraction, will be first described. A grouping algorithm operates on the qualitative descriptions to generate a real-time qualitative segmentation of the image flow. Then, simple pattern recognition methods are used to extract abstract description of basic actions such as "push", "take" or "pull". The method proposed here provides an unsupervised learning technique to generate abstract description of actions from a video sequence.