Improving Visual Relation Detection using Depth Maps (original) (raw)

2.5D Visual Relationship Detection

ArXiv, 2021

Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the camera’s viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. We analyze this dataset and conduct extensive experiments including benchmarking multiple state-of-the-art VRD models on this task. Our results show that existing models largely rely on semantic cues and simple...

Relationship Detection Based on Object Semantic Inference and Attention Mechanisms

Proceedings of the 2019 on International Conference on Multimedia Retrieval

Detecting relations among objects is a crucial task for image understanding. However, each relationship involves different objects pair combinations, and different objects pair combinations express diverse interactions. This makes the relationships, based just on visual features, a challenging task. In this paper, we propose a simple yet effective relationship detection model, which is based on object semantic inference and attention mechanisms. Our model is trained to detect relation triples, such as , <horse, carry, bag>. To overcome the high diversity of visual appearances, the semantic inference module and the visual features are combined to complement each others. We also introduce two different attention mechanisms for object feature refinement and phrase feature refinement. In order to derive a more detailed and comprehensive representation for each object, the object feature refinement module refines the representation of each object by querying over all the other objects in the image. The phrase feature refinement module is proposed in order to make the phrase feature more effective, and to automatically focus on relative parts, to improve the visual relationship detection task. We validate our model on Visual Genome Relationship dataset. Our proposed model achieves competitive results compared to the state-of-the-art method MOTIFNET. CCS CONCEPTS • Computing methodologies → Scene understanding; Semantic networks; Image representations.

RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory

ArXiv, 2021

Visual relationship recognition (VRR) is a fundamental scene understanding task. The structure that VRR provides is essential to improve the AI interpretability in downstream tasks such as image captioning and visual question answering. Several recent studies showed that the long-tail problem in VRR is even more critical than that in object recognition due to the compositional complexity and structure. To overcome this limitation, we propose a novel transformerbased framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels. We assume that more abundant contextual features can generate more accurate and discriminative relationships, which can be useful when sufficient training data are lacking. The key feature of our model is its ability to aggregate three different-level features (local context, scene, and dataset-level) to compositionally predict the visual relationship. We evaluate our model on the visual ge...

Large-Scale Visual Relationship Understanding

Proceedings of the AAAI Conference on Artificial Intelligence

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of 〈subject, relation, object〉 triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. We learn a visual and a semantic module that map features from the two modalities into a shared space, where matched pairs of features have to discriminate against those unmatched, but also maintain close distances to semantically similar ones. Benefiting from that, our model can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution. We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that...

From Saturation to Zero-Shot Visual Relationship Detection Using Local Context

2020

Visual relationship detection has been motivated by the “insufficiency of objects to describe rich visual knowledge”. However, we find that training and testing on current popular datasets may not support such statements; most approaches can be outperformed by a naive image-agnostic baseline that fuses language and spatial features. We visualize the errors of numerous existing detectors, to discover that most of them are caused by the coexistence and penalization of antagonizing predicates that could describe the same interaction. Such annotations hurt the dataset’s causality and models tend to overfit the dataset biases, resulting in a saturation of accuracy to artificially low levels. We construct a simple architecture and explore the effect of using language on generalization. Then, we introduce adaptive local-context-aware classifiers, that are built on-the-fly based on the objects’ categories. To improve context awareness, we mine and learn predicate synonyms, i.e. different pr...

Visual Relationship Detection using Scene Graphs: A Survey

ArXiv, 2020

Understanding a scene by decoding the visual relationships depicted in an image has been a long studied problem. While the recent advances in deep learning and the usage of deep neural networks have achieved near human accuracy on many tasks, there still exists a pretty big gap between human and machine level performance when it comes to various visual relationship detection tasks. Developing on earlier tasks like object recognition, segmentation and captioning which focused on a relatively coarser image understanding, newer tasks have been introduced recently to deal with a finer level of image understanding. A Scene Graph is one such technique to better represent a scene and the various relationships present in it. With its wide number of applications in various tasks like Visual Question Answering, Semantic Image Retrieval, Image Generation, among many others, it has proved to be a useful tool for deeper and better visual relationship understanding. In this paper, we present a de...

Exploring Long Tail Visual Relationship Recognition with Large Vocabulary

2020

Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first largescale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., “rabbit grazing on grass”). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRRrelated benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existi...

Scenes and Surroundings: Scene Graph Generation using Relation Transformer

2021

The identification of objects in an image, together with their mutual relationships as a scene graph, can lead to a deep understanding of image content. Despite the recent advancement in deep learning, the detection and labeling of visual object relationships remain a challenging task. In this work, a novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions. Our hierarchical multi-head attention based approach efficiently captures dependencies between objects and predicts contextual relationships. In comparison to state-of-the-art approaches, we have achieved an overall mean 4.85% improvement and new benchmark across all the scene graph generation tasks on the Visual Genome dataset.

Evaluating the progress of deep learning for visual relational concepts

Journal of Vision, 2021

Convolutional Neural Networks (CNNs) have become the state of the art method for image classification in the last ten years. Despite the fact that they achieve superhuman classification accuracy on many popular datasets, they often perform much worse on more abstract image classification tasks. We will show that these difficult tasks are linked to relational concepts from cognitive psychology and that despite progress over the last few years, such relational reasoning tasks still remain difficult for current neural network architectures. We will review deep learning research that is linked to relational concept learning, even if it was not originally presented from this angle. Reviewing the current literature, we will argue that some form of attention will be an important component of future systems to solve relational tasks. In addition, we will point out the shortcomings of currently used datasets, and we will recommend steps to make future datasets more relevant for testing systems on relational reasoning.