RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory (original) (raw)

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

2021

generalizes well on many VRR benchmarks. Our model outperforms the best-performing models on two large-scale long-tail VRR benchmarks, VG8K-LT (+2.0% overall acc) and GQA-LT (+26.0% overall acc), both having a highly skewed distribution towards the tail. It also achieves strong results on the VG200 relation detection task. Our code is available at https://github.com/Vision-CAIR/ RelTransformer.

Large-Scale Visual Relationship Understanding

Proceedings of the AAAI Conference on Artificial Intelligence

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of 〈subject, relation, object〉 triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. We learn a visual and a semantic module that map features from the two modalities into a shared space, where matched pairs of features have to discriminate against those unmatched, but also maintain close distances to semantically similar ones. Benefiting from that, our model can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution. We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that...

Contrastive Visual and Language Translational Embeddings for Visual Relationship Detection

2022

Visual relationship detection aims to understand real-world interactions between object pairs by detecting visual relation triples written in the form of (subject, predicate, object). Previous work has explored the use of contrastive learning to generate joint visual and language embeddings that aid the detection of both seen and unseen visual relation triples. However, these contrastive approaches often learned the mapping functions implicitly and did not fully consider the underlying structure of visual relation triples, limiting the models’ use cases and their ability to generalize to unseen compositions. This ongoing work aims to construct joint visual and language embedding models that can capture such hierarchical structure between objects and predicates by explicitly imposing structural loss constraints. In this short paper, we propose VLTransE, a novel embedding model that applies translational loss in conjunction with the visual-language contrastive loss to learn transferab...

Exploring Long Tail Visual Relationship Recognition with Large Vocabulary

2020

Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first largescale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., “rabbit grazing on grass”). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRRrelated benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existi...

From Saturation to Zero-Shot Visual Relationship Detection Using Local Context

2020

Visual relationship detection has been motivated by the “insufficiency of objects to describe rich visual knowledge”. However, we find that training and testing on current popular datasets may not support such statements; most approaches can be outperformed by a naive image-agnostic baseline that fuses language and spatial features. We visualize the errors of numerous existing detectors, to discover that most of them are caused by the coexistence and penalization of antagonizing predicates that could describe the same interaction. Such annotations hurt the dataset’s causality and models tend to overfit the dataset biases, resulting in a saturation of accuracy to artificially low levels. We construct a simple architecture and explore the effect of using language on generalization. Then, we introduce adaptive local-context-aware classifiers, that are built on-the-fly based on the objects’ categories. To improve context awareness, we mine and learn predicate synonyms, i.e. different pr...

Relationship Detection Based on Object Semantic Inference and Attention Mechanisms

Proceedings of the 2019 on International Conference on Multimedia Retrieval

Detecting relations among objects is a crucial task for image understanding. However, each relationship involves different objects pair combinations, and different objects pair combinations express diverse interactions. This makes the relationships, based just on visual features, a challenging task. In this paper, we propose a simple yet effective relationship detection model, which is based on object semantic inference and attention mechanisms. Our model is trained to detect relation triples, such as , <horse, carry, bag>. To overcome the high diversity of visual appearances, the semantic inference module and the visual features are combined to complement each others. We also introduce two different attention mechanisms for object feature refinement and phrase feature refinement. In order to derive a more detailed and comprehensive representation for each object, the object feature refinement module refines the representation of each object by querying over all the other objects in the image. The phrase feature refinement module is proposed in order to make the phrase feature more effective, and to automatically focus on relative parts, to improve the visual relationship detection task. We validate our model on Visual Genome Relationship dataset. Our proposed model achieves competitive results compared to the state-of-the-art method MOTIFNET. CCS CONCEPTS • Computing methodologies → Scene understanding; Semantic networks; Image representations.

Scenes and Surroundings: Scene Graph Generation using Relation Transformer

2021

The identification of objects in an image, together with their mutual relationships as a scene graph, can lead to a deep understanding of image content. Despite the recent advancement in deep learning, the detection and labeling of visual object relationships remain a challenging task. In this work, a novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions. Our hierarchical multi-head attention based approach efficiently captures dependencies between objects and predicts contextual relationships. In comparison to state-of-the-art approaches, we have achieved an overall mean 4.85% improvement and new benchmark across all the scene graph generation tasks on the Visual Genome dataset.

Image Semantic Relation Generation

arXiv (Cornell University), 2022

Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only distil complex image information but also correct the bias of visual models using semantic-level relations, which has broad application prospects. However, the heavy labour cost of constructing graph annotations may hinder the application of PSG in practical scenarios. Inspired by the observation that people usually identify the subject and object first and then determine the relationship between them, we proposed to decouple the scene graphs generation task into two sub-tasks: 1) an image segmentation task to pick up the qualified objects. 2) a restricted auto-regressive text generation task to generate the relation between given objects. Therefore, in this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model, which achieved 31 points on the OpenPSG dataset and outperforms strong baselines respectively by 16 points (ResNet-50) and 5 points (CLIP).

Long Tail Visual Relationship Recognition with Hubless Regularized Relmix

arXiv: Computer Vision and Pattern Recognition, 2020

Several approaches have been proposed in recent literature to alleviate the long-tail problem, mostly in the object classification task. We propose to study the task of Long-Tail Visual Relationship Recognition (LTVRR), which aims at generalizing on the structured long-tail distribution of visual relationships (e.g., "rabbit grazing on grass"). In this setup, subject, relation, and object classes individually follow a long-tail distribution. We first introduce two large-scale long-tail visual relationship recognition benchmarks to study this task, dubbed as VG8K-LT (5330 objects, 2000 relationships) and GQA-LT (1703 objects, 310 relations). VG8K-LT and GQA-LT are built upon the widely used Visual Genome and GQA datasets. In contrast to existing benchmarks, some classes appear at a very low frequency ($1-14$ examples). We use these benchmarks to study the performance of several state-of-the-art long-tail models on LTVRR setup. We developed a visiolinguistic hubless (ViLHub)...

Evaluating the progress of deep learning for visual relational concepts

Journal of Vision, 2021

Convolutional Neural Networks (CNNs) have become the state of the art method for image classification in the last ten years. Despite the fact that they achieve superhuman classification accuracy on many popular datasets, they often perform much worse on more abstract image classification tasks. We will show that these difficult tasks are linked to relational concepts from cognitive psychology and that despite progress over the last few years, such relational reasoning tasks still remain difficult for current neural network architectures. We will review deep learning research that is linked to relational concept learning, even if it was not originally presented from this angle. Reviewing the current literature, we will argue that some form of attention will be an important component of future systems to solve relational tasks. In addition, we will point out the shortcomings of currently used datasets, and we will recommend steps to make future datasets more relevant for testing systems on relational reasoning.