Unsupervised learning of event classes from video (original) (raw)

Relational graph mining for learning events from video

2010

In this work, we represent complex video activities as one large activity graph and propose a constraint based graph mining technique to discover a partonomy of classes of subgraphs corresponding to event classes. Events are defined as subgraphs of the activity graph that represent what we regard as interesting interactions, that is, where all objects are actively engaged and are characterized by frequent occurrences in the activity graph. Subgraphs with these two properties are mined using a level-wise algorithm, and then partitioned into equivalence classes which we regard as event classes. Moreover, a taxonomy of these event classes naturally emerges from the level-wise mining procedure. Experimental results in an aircraft turnaround apron scenario show that the proposed technique has considerable potential for characterizing and mining events from video.

Probabilistic relational learning of event models from video

This paper investigates the application of an inductive logic programming system, allied with Markov Logic Networks (MLNs), to the task of learning event models from large video datasets. A learning from interpretations setting is used to learn event models efficiently, these models define the structure of a MLN. The network parameters are obtained from discriminative learning and probabilistic inference is used to query the MLN for event recognition.

Learning Relational Event Models from Video

Journal of Artificial Intelligence Research

Event models obtained automatically from video can be used in applications ranging from abnormal event detection to content based video retrieval. When multiple agents are involved in the events, characterizing events naturally suggests encoding interactions as relations. Learning event models from this kind of relational spatio-temporal data using relational learning techniques such as Inductive Logic Programming (ILP) hold promise, but have not been successfully applied to very large datasets which result from video data. In this paper, we present a novel framework REMIND (Relational Event Model INDuction) for supervised relational learning of event models from large video datasets using ILP. Efficiency is achieved through the learning from interpretations setting and using a typing system that exploits the type hierarchy of objects in a domain. The use of types also helps prevent over generalization. Furthermore, we also present a type-refining operator and prove that it is optim...

From Videos to Verbs: Mining Videos for Events using a Cascade of Dynamical Systems

2000

Clustering video sequences in order to infer and extract events from a single video stream is an extremely impor- tant problem and has significant potential in video index- ing, surveillance, activity discovery and event recognition. Clustering a video sequence into events requires one to si- multaneously recognize event boundaries (event consistent subsequences) and cluster these event subsequences. In or- der

Unsupervised discovery of action hierarchies in large collections of activity videos

2007

Given a large collection of videos containing activities, we investigate the problem of organizing it in an unsupervised fashion into a hierarchy based on the similarity of actions embedded in the videos. We use spatio-temporal volumes of filtered motion vectors to compute appearance-invariant action similarity measures efficiently-and use these similarity measures in hierarchical agglomerative clustering to organize videos into a hierarchy such that neighboring nodes contain similar actions. This naturally leads to a simple automatic scheme for selecting videos of representative actions (exemplars) from the database and for efficiently indexing the whole database. We compute a performance metric on the hierarchical structure to evaluate goodness of the estimated hierarchy, and show that this metric has potential for predicting the clustering performance of various joining criteria used in building hierarchies. Our results show that perceptually meaningful hierarchies can be constructed based on action similarities with minimal user supervision, while providing favorable clustering performance and retrieval performance.

UNSUPERVISED MINING OF STATISTICAL TEMPORAL STRUCTURES IN VIDEO

2003

In this paper, we present algorithms for unsupervised mining of structures in video using multiscale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores the link between the observations and the semantics using supervised learning, we propose unsupervised structure mining algorithms that aim at alleviating the burden of labelling and training, as well as providing a scalable solution for generalizing video indexing techniques to heterogeneous content collections such as surveillance and consumer videos. Existing unsupervised video structuring works primarily use clustering techniques, while the rich statistical characteristics in the temporal dimension at different granularity remain unexplored. Authomatically identifying structures from an unknown domain poses significant challenges when domain knowledge is not explicitly present to assist algorithm design, model selection, and feature selection. In this work, we model multi-level statistical structures with hierarchical hidden Markov models based on a multi-level Markov dependency assumption. The parameters of the model are efficiently estimated using the EM algorithm, we have also developed a model structure learning algorithm that uses stochastic sampling techniques to find the optimal model structure, and a feature selection algorithm that automatically finds compact relevant feature sets using hybrid wrappeer-filter methods. When tested on sports videos, the unsupervised learning scheme achieves very promising results: (1) The automatically selected feature set for soccer and baseball vides matches the ones that are manually selected with domain knowledge, (2) The system automatically discovers high-level structures that matches the semantic events in the video, (3) The system achieves even slightly better accuracy in detecting semantic events in unlabelled soccer videos than a competing supervised approach designed and trained with domain knowledge.

Event model learning from complex videos using ILP

Proceeding of the 2010 …, 2010

Learning event models from videos has applications ranging from abnormal event detection to content based video retrieval. Relational learning techniques such as Inductive Logic Programming (ILP) hold promise for building such models, but have not been successfully applied to the very large datasets which result from video data. In this paper we present a novel supervised learning framework to learn event models from large video datasets(∼ 2.5 million frames) using ILP. Efficiency is achieved via the learning from interpretations setting and using a typing system. This allows learning to take place in a reasonable time frame with reduced false positives. The experimental results on video data from an airport apron where events such as Loading, Unloading, Jet-Bridge Parking etc are learned suggests that the techniques are suitable to real world scenarios.

Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2000

Understanding Video Events, the translation of low-level content in video sequences into highlevel semantic concepts, is a research topic that has received much interest in recent years. Important applications of this work include smart surveillance systems, semantic video database indexing, and interactive systems. This technology can be applied to several video domains including: airport terminal, parking lot, traffic, subway stations, aerial surveillance, and sign language data. In this work we survey the two main components of the event understanding process: Abstraction and Event modeling. Abstraction is the process of molding the data into informative units to be used as input to the event model. Event modeling is devoted to describing events of interest formally and enabling recognition of these events as they occur in the video sequence. Event modeling can be further decomposed in the categories of Pattern Recognition Methods, State Event Models, and Semantic Event Models. In this survey we discuss this proposed taxonomy of the literature, offer a unifying terminology, and discuss popular abstraction schemes (e.g. Motion History Images) and event modeling formalisms (e.g. Hidden Markov Model) and their use in video event understanding using extensive examples from the literature. Finally we consider the application domain of video event understanding in light of the proposed taxonomy, and propose future directions for research in this field. 1

Unsupervised Learning of Human Action Categories Using Spatial-Temporal

International Journal of Computer Vision, 2008

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

Video Activity Extraction and Reporting with Incremental Unsupervised Learning

2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010

The present work presents a new method for activity extraction and reporting from video based on the aggregation of fuzzy relations. Trajectory clustering is first employed mainly to discover the points of entry and exit of mobiles appearing in the scene. In a second step, proximity relations between resulting clusters of detected mobiles and contextual elements from the scene are modeled employing fuzzy relations. These can then be aggregated employing typical soft-computing algebra. A clustering algorithm based on the transitive closure calculation of the fuzzy relations allows building the structure of the scene and characterize the ongoing different activities of the scene. Discovered activity zones can be reported as activity maps with different granularities thanks to the analysis of the transitive closure matrix. Taking advantage of the soft relation properties, activity zones and related activities can be labeled in a more human-like language. We present results obtained on real videos corresponding to apron monitoring in the Toulouse airport in France.