A Model for Interpreting Social Interactions in Local Image Regions (original) (raw)

View dependencies in the visual recognition of social interactions

Frontiers in Psychology, 2013

Recognizing social interactions, e.g., two people shaking hands, is important for obtaining information about other people and the surrounding social environment. Despite the visual complexity of social interactions, humans have often little difficulties to visually recognize social interactions. What is the visual representation of social interactions and the bodily visual cues that promote this remarkable human ability? Viewpoint dependent representations are considered to be at the heart of the visual recognition of many visual stimuli including objects , and biological motion patterns . Here we addressed the question whether complex social actions acted out between pairs of people, e.g., hugging, are also represented in a similar manner. To this end, we created 3-D models from motion captured actions acted out by two people, e.g., hugging. These 3-D models allowed to present the same action from different viewpoints. Participants' task was to discriminate a target action from distractor actions using a one-interval-forced-choice (1IFC) task. We measured participants' recognition performance in terms of reaction times (RT) and d-prime (d'). For each tested action we found one view that led to superior recognition performance compared to other views. This finding demonstrates view-dependent effects of visual recognition, which are in line with the idea of a view-dependent representation of social interactions. Subsequently, we examined the degree to which velocities of joints are able to predict the recognition performance of social interactions in order to determine candidate visual cues underlying the recognition of social interactions. We found that the velocities of the arms, both feet, and hips correlated with recognition performance. Keywords: visual recognition, view dependent, social interactions, visual cues, action observation www.frontiersin.org October 2013 | Volume 4 | Article 752 | 1 de la Rosa et al. Social interaction recognition

Full interpretation of minimal images

Cognition, 2018

The goal in this work is to model the process of 'full interpretation' of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of 'minimal configurations': these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition. 2

No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques

ArXiv, 2018

We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more sophisticated approaches on human-object interaction detection. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (i) eliminating train-inference mismatch; (ii) rejecting easy negatives during mini-batch training; and (iii) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches while constructing training mini-batches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.

Discriminative key-component models for interaction detection and recognition

Computer Vision and Image Understanding, 2015

Not all frames are equal-selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a "handshaking" interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.

Exploiting visual search theory to infer social interactions

Multimedia Content and Mobile Devices, 2013

In this paper we propose a new method to infer human social interactions using typical techniques adopted in literature for visual search and information retrieval. The main piece of information we use to discriminate among different types of interactions is provided by proxemics cues acquired by a tracker, and used to distinguish between intentional and casual interactions. The proxemics information has been acquired through the analysis of two different metrics: on the one hand we observe the current distance between subjects, and on the other hand we measure the O-space synergy between subjects. The obtained values are taken at every time step over a temporal sliding window, and processed in the Discrete Fourier Transform (DFT) domain. The features are eventually merged into an unique array, and clustered using the K-means algorithm. The clusters are reorganized using a second larger temporal window into a Bag Of Words framework, so as to build the feature vector that will feed the SVM classifier.

Spatially Conditioned Graphs for Detecting Human–Object Interactions

2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021

We address the problem of detecting human-object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state-of-the-art on fine-tuned detections.

No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019

We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset [4].

Interpretation of group behaviour in visually mediated interaction

Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, 2000

While full computer understanding of dynamic visual scenes containing several people may be currently unattainable, we propose a computationally efficient approach to determine areas of interest in such scenes. We present methods for modelling and interpretation of multi-person human behaviour in real time to control video cameras for visually mediated interaction. b6