VQA-LOL: Visual Question Answering under the Lens of Logic (original) (raw)

IQ-VQA: Intelligent Visual Question Answering

2020

Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answer the generated implication correctly. As a part of the cyclic framework, we propose a novel implication generator which can generate implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human annotated VQA-Implications dataset. The dataset consists of ~30k questions containing implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA v2.0 validation dataset. We show that our framework improves consistency of VQA models by ~15% on the rule-based dataset, ~7% on VQA-Implications data...

A Literature Survey on Image Linguistic Visual Question Answering

IRJET, 2022

VQA is a task that for given text-based questions about an image the system needs to infer the answer for each question, by picking an answer from multiple choices. Many of the VQA systems that have lately been developed contain attention or memory components that facilitate reasoning. This paper aims to develop a model that achieves higher performance than the current state-of-the-art solutions. Also, this paper questions the value of these common practices and aims to develop a simple alternative. We will be exploring the different existing models and developing a custom model to overcome the shortcomings of the existing solutions. We will be benchmarking different models on the Visual7W dataset.

Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) Tasks

2021

GQA (Hudson and Manning, 2019) is a dataset for real-world visual reasoning and compositional question answering. We found that many answers predicted by the best visionlanguage models on the GQA dataset do not match the ground-truth answer but still are semantically meaningful and correct in the given context. In fact, this is the case with most existing visual question answering (VQA) datasets where they assume only one ground-truth answer for each question. We propose Alternative Answer Sets (AAS) of ground-truth answers to address this limitation, which is created automatically using off-the-shelf NLP tools. We introduce a semantic metric based on AAS and modify top VQA solvers to support multiple plausible answers for a question. We implement this approach on the GQA dataset and show the performance improvements.

Coarse-to-Fine Reasoning for Visual Question Answering

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Bridging the semantic gap between image and question is an important step to improve the accuracy of the Visual Question Answering (VQA) task. However, most of the existing VQA methods focus on attention mechanisms or visual relations for reasoning the answer, while the features at different semantic levels are not fully utilized. In this paper, we present a new reasoning framework to fill the gap between visual features and semantic clues in the VQA task. Our method first extracts the features and predicates from the image and question. We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-tofine manner. The intensively experimental results on three large-scale VQA datasets show that our proposed approach achieves superior accuracy comparing with other state-ofthe-art methods. Furthermore, our reasoning framework also provides an explainable way to understand the decision of the deep neural network when predicting the answer. Our source code and trained models are available at https://github.com/aioz-ai/CRF\_VQA

Question-Agnostic Attention for Visual Question Answering

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g. linear sum) to more complex ones (e.g. Block [1]. The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an ‘object map’ and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can b...

Question-Guided Hybrid Convolution for Visual Question Answering

Computer Vision – ECCV 2018, 2018

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Experiments on VQA datasets validate the effectiveness of QGHC.

Visual Question Answering using Convolutional Neural Networks

Turkish Journal of Computer and Mathematics Education (TURCOMAT)

The ability of a computer system to be able to understand surroundings and elements and to think like a human being to process the information has always been the major point of focus in the field of Computer Science. One of the ways to achieve this artificial intelligence is Visual Question Answering. Visual Question Answering (VQA) is a trained system which can answer the questions associated to a given image in Natural Language. VQA is a generalized system which can be used in any image-based scenario with adequate training on the relevant data. This is achieved with the help of Neural Networks, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this study, we have compared different approaches of VQA, out of which we are exploring CNN based model. With the continued progress in the field of Computer Vision and Question answering system, Visual Question Answering is becoming the essential system which can handle multiple scenarios with their re...

Dual Recurrent Attention Units for Visual Question Answering

2018

Visual Question Answering (VQA) requires AI models to comprehend data in two domains, vision and text. Current state-of-the-art models use learned attention mechanisms to extract relevant information from the input domains to answer a certain question. Thus, robust attention mechanisms are essential for powerful VQA models. In this paper, we propose a recurrent attention mechanism and show its benefits compared to the traditional convolutional approach. We perform two ablation studies to evaluate recurrent attention. First, we introduce a baseline VQA model with visual attention and test the performance difference between convolutional and recurrent attention on the VQA 2.0 dataset. Secondly, we design an architecture for VQA which utilizes dual (textual and visual) Recurrent Attention Units (RAUs). Using this model, we show the effect of all possible combinations of recurrent and convolutional dual attention. Our single model outperforms the first place winner on the VQA 2016 challenge and to the best of our knowledge, it is the second best performing single model on the VQA 1.0 dataset. Furthermore, our model noticeably improves upon the winner of the VQA 2017 challenge. Moreover, we experiment replacing attention mechanisms in state-of-the-art models with our RAUs and show increased performance.

Non-monotonic Logical Reasoning and Deep Learning for Explainable Visual Question Answering

Proceedings of the 6th International Conference on Human-Agent Interaction, 2018

State of the art visual question answering (VQA) methods rely heavily on deep network architectures. These methods require a large labeled dataset for training, which is not available in many domains. Also, it is difficult to explain the working of deep networks learned from such datasets. Towards addressing these limitations, this paper describes an architecture inspired by research in cognitive systems that integrates commonsense logical reasoning with deep learning algorithms. In the context of answering explanatory questions about scenes and the underlying classification problems, the architecture uses deep networks for processing images and for generating answers to queries. Between these deep networks, it embeds components for non-monotonic logical reasoning with incomplete commonsense domain knowledge and for decision tree induction. Experimental results show that this architecture outperforms an architecture based only on deep networks when the training dataset is small, provides comparable performance on larger datasets, and provides intuitive answers to explanatory questions. CCS CONCEPTS • Computing methodologies → Nonmonotonic, default reasoning and belief revision; Logic programming and answer set programming; Scene understanding; Neural networks;