Unsupervised and Explainable Assessment of Video Similarity (original) (raw)

Unsupervised discovery of action hierarchies in large collections of activity videos

2007

Given a large collection of videos containing activities, we investigate the problem of organizing it in an unsupervised fashion into a hierarchy based on the similarity of actions embedded in the videos. We use spatio-temporal volumes of filtered motion vectors to compute appearance-invariant action similarity measures efficiently-and use these similarity measures in hierarchical agglomerative clustering to organize videos into a hierarchy such that neighboring nodes contain similar actions. This naturally leads to a simple automatic scheme for selecting videos of representative actions (exemplars) from the database and for efficiently indexing the whole database. We compute a performance metric on the hierarchical structure to evaluate goodness of the estimated hierarchy, and show that this metric has potential for predicting the clustering performance of various joining criteria used in building hierarchies. Our results show that perceptually meaningful hierarchies can be constructed based on action similarities with minimal user supervision, while providing favorable clustering performance and retrieval performance.

A graph-based approach for detecting common actions in motion capture data and videos

Pattern Recognition, 2018

We present a novel solution to the problem of detecting common actions in time series of motion capture data and videos. Given two action sequences, our method discovers all pairs of common subsequences, i.e. subsequences that represent the same or similar action. This is achieved in a completely unsupervised manner, i.e., without any prior knowledge of the type of actions, their number and their duration. These common subsequences (commonalities) may be located anywhere in the original sequences, may differ in duration and may be performed under different conditions e.g., by a different actor. The proposed method performs a very efficient graph-based search on the matrix of pairwise distances of frames of the two sequences. This search is supported by an objective function that captures the trade off between the similarity of the common subsequences and their lengths. The proposed method has been evaluated quantitatively on challenging datasets and in comparison to state of the art approaches. The obtained results demonstrate that the proposed method outperforms the state of the art methods both in the quality of the obtained solutions and in computational performance.

Estimating Human Actions Affinities Across Views

Proceedings of the 10th International Conference on Computer Vision Theory and Applications, 2015

This paper deals with the problem of estimating the affinity level between different types of human actions observed from different viewpoints. We analyse simple repetitive upper body human actions with the goal of producing a view-invariant model from simple motion cues, that have been inspired by studies on the human perception. We adopt a simple descriptor that summarizes the evolution of spatio-temporal curvature of the trajectories, which we use for evaluating the similarity between actions pair on a multi-level matching. We experimentally verified the presence of semantic connections between actions across views, inferring a relations graph that shows such affinities.

GAIDON et al.: MINING VISUAL ACTIONS FROMMOVIES 1

2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Unsupervised Human Action Detection by Action Matching

We propose a new task of unsupervised action detection by action matching. Given two long videos, the objective is to temporally detect all pairs of matching video segments. A pair of video segments are matched if they share the same human action. The task is category independent—it does not matter what action is being performed—and no supervision is used to discover such video segments. Unsuper-vised action detection by action matching allows us to align videos in a meaningful manner. As such, it can be used to discover new action categories or as an action proposal technique within, say, an action detection pipeline. Moreover , it is a useful pre-processing step for generating video highlights, e.g., from sports videos. We present an effective and efficient method for unsu-pervised action detection. We use an unsupervised temporal encoding method and exploit the temporal consistency in human actions to obtain candidate action segments. We evaluate our method on this challenging task using three activity recognition benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action detection benchmark and a new dataset called the IKEA dataset. On the MPII Cooking dataset we detect action segments with a precision of 21.6% and recall of 11.7% over 946 long video pairs and over 5000 ground truth action segments. Similarly , on THUMOS dataset we obtain 18.4% precision and 25.1% recall over 5094 ground truth action segment pairs.

Coaction discovery: segmentation of common actions across multiple videos

2012

We introduce a new problem called coaction discovery: the task of discovering and segmenting the common actions (coactions) between videos that may contain several actions. This paper presents an approach for coaction discovery; the key idea of our approach is to compute an action proposal map for each video based jointly on dynamic object-motion and static appearance semantics, and unsupervisedly cluster each video into atomic action clips, called actoms. Subsequently, we use a temporally coherent discriminative clustering framework for extracting the coactions. We apply our coaction discovery approach to two datasets and demonstrate convincing and superior performance to three baseline methods.

Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

2022

We present a novel approach for the visual prediction of human-object interactions in videos. Rather than forecasting the human and object motion or the future hand-object contact points, we aim at predicting (a) the class of the ongoing human-object interaction and (b) the class(es) of the next active object(s) (NAOs), i.e., the object(s) that will be involved in the interaction in the near future as well as the time the interaction will occur. Graph matching relies on the efficient Graph Edit distance (GED) method. The experimental evaluation of the proposed approach was conducted using two well-established video datasets that contain human-object interactions, namely the MSR Daily Activities and the CAD120. High prediction accuracy was obtained for both action prediction and NAO forecasting.

Benchmarking qualitative spatial calculi for video activity analysis

2011

This paper presents a general way of addressing problems in video activity understanding using graph based relational learning. Video activities are described using relational spatio-temporal graphs, that represent qualitative spatiotemporal relations between interacting objects. A wide range of spatio-temporal relations are introduced, as being well suited for describing video activities. Then, a formulation is proposed, in which standard problems in video activity understanding such as event detection, are naturally mapped to problems in graph based relational learning. Experiments on video understanding tasks, for a video dataset consisting of common outdoor verbs, validate the significance of the proposed approach.

Co-recognition of Actions in Video Pairs

2010 20th International Conference on Pattern Recognition, 2010

In this paper, we present a method that recognizes single or multiple common actions between a pair of video sequences. We establish an energy function that evaluates geometric and photometric consistency, and solve the action recognition problem by optimizing the energy function. The proposed stochastic inference algorithm based on the Monte Carlo method explores the video pair from the local spatiotemporal interest point matches to find the common actions. Our algorithm works in unsupervised way without prior knowledge about the type and the number of common actions. Experiments show that our algorithm produces promising results on single and multiple action recognition.

Discovering Underlying Similarities in Video

2008

The concept of interrogating similarities within a data set has a long history in fields ranging from medicinal chemistry to image analysis. We define a descriptor as an entropic measure of similarity for an image and the neighborhood of images surrounding it. Differential changes in descriptor values imply differential changes in the structure underlying the images. For example, at the location of a zero crossing in the descriptor values, the corresponding image is a watershed image clearly sitting between two dissimilar groups of images. This paper describes a fast algorithm for image sequence clustering based on the above concept. Developed initially for an adaptive system for capture, analysis and storage of lengthy dynamic visual processes, the algorithm uncovers underlying spatio-temporal structures without a priori information or segmentation. 9 The algorithm estimates the average amount of information each image, in an ordered set, conveys about the structure underlying the ordered set. Such signatures enable capture of relevant and salient time periods directly leading to reductions in cost of followup analysis and storage. As a part of the video capture system, the above characterization may provide predictive feedback to an adaptive capture subsystem controlling temporal sampling, frame-rate and exposure. Details of the algorithm, examples of its application to quantification of biological motion, and video identification and recognition are presented. Prior to the workshop, an efficient implementation will be posted as a web service to generate characterization of unknown videos online.