Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (original) (raw)

Show and tell: A neural image caption generator

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.

Image Caption Generator Using Attention Based Neural Networks

International Journal for Research in Applied Science and Engineering Technology, 2023

Image caption generation is a method used to create sentences that describe the scene depicted in a given image. The process includes identifying objects within the image, carrying out various operations, and identifying the most important features of the image. Once the system has identified this information, it generates the most relevant and concise description of the image, which is both grammatically and semantically correct. With the progress in deep-learning techniques, algorithms are able to generate text in the form of natural sentences that can effectively describe an image. However, replicating the natural human ability to comprehend image content and produce descriptive text is a difficult task for machines. The uses of image captioning are vast and of great significance, as it involves creating succinct captions utilizing a variety of techniques such as Natural Language Processing (NLP), Computer Vision(CV), and Deep Learning (DL) techniques. The current study presents a system that employs an attention mechanism, in addition to an encoder and a decoder, to generate captions. It utilizes a pretrained CNN, Inception V3, to extract features from the image and a RNN, GRU, to produce a relevant caption. The attention mechanism used in this model is Bahdanau attention, and the Flickr-8Kdataset is utilized for training the model. The results demonstrate the model's capability to understand images and generate text in a reasonable manner.

Image captioning model using attention and object features to mimic human image understanding

Journal of Big Data, 2022

Image captioning spans the fields of computer vision and natural language processing. The image captioning task generalizes object detection where the descriptions are a single word. Recently, most research on image captioning has focused on deep learning techniques, especially Encoder-Decoder models with Convolutional Neural Network (CNN) feature extraction. However, few works have tried using object detection features to increase the quality of the generated captions. This paper presents an attention-based, Encoder-Decoder deep architecture that makes use of convolutional features extracted from a CNN model pre-trained on ImageNet (Xception), together with object features extracted from the YOLOv4 model, pre-trained on MS COCO. This paper also introduces a new positional encoding scheme for object features, the “importance factor”. Our model was tested on the MS COCO and Flickr30k datasets, and the performance is compared to performance in similar works. Our new feature extraction...

A Sequential Guiding Network with Attention for Image Captioning

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

The recent advances of deep learning in both computer vision (CV) and natural language processing (NLP) provide us a new way of understanding semantics, by which we can deal with more challenging tasks such as automatic description generation from natural images. In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder. In this paper, we introduce a sequential guiding network that guides the decoder during word generation. The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions pairs. We validate our approach by conducting extensive experiments on a benchmark dataset, i.e., MS COCO Captions. The proposed model achieves significant improvement comparing to the other state-ofthe-art deep learning models.

Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Journal of Big Data

Automatic caption generation with attention mechanisms aims at generating more descriptive captions containing coarser to finer semantic contents in the image. In this work, we use an encoder-decoder framework employing Wavelet transform based Convolutional Neural Network (WCNN) with two level discrete wavelet decomposition for extracting the visual feature maps highlighting the spatial, spectral and semantic details from the image. The Visual Attention Prediction Network (VAPN) computes both channel and spatial attention for obtaining visually attentive features. In addition to these, local features are also taken into account by considering the contextual spatial relationship between the different objects. The probability of the appropriate word prediction is achieved by combining the aforementioned architecture with Long Short Term Memory (LSTM) decoder network. Experiments are conducted on three benchmark datasets—Flickr8K, Flickr30K and MSCOCO datasets and the evaluation result...

A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

ArXiv, 2022

Image captioning is a fast-growing research field of computer vision and natural language processing that involves creating text explanations for images. This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN). To encode an image into a feature vector as graphical attributes, we employed multiple pre-trained convolutional neural networks. Following that, a language model known as GRU was chosen as the decoder to construct the descriptive sentence. In order to increase performance, we merged the Bahdanau attention model with GRU to allow learning to be focused on a specific portion of the image. On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.

End-to-End Attention-based Image Captioning

ArXiv, 2021

In this paper, we address the problem of image captioning specifically for molecular translation where the result would be a predicted chemical notation in InChI format for a given molecular structure. Current approaches mainly follow rule-based or CNN+RNN based methodology. However, they seem to underperform on noisy images and images with small number of distinguishable features. To overcome this, we propose an end-to-end transformer model. When compared to attention-based techniques, our proposed model outperforms on molecular datasets. Index Terms — Image captioning, Transformer, Molecular-Translation

Attention-Based Deep Learning Model for Image Captioning: A Comparative Study

International Journal of Image, Graphics and Signal Processing, 2019

Image captioning is the description generated from images. Generating the caption of an image is one part of computer vision or image processing from artificial intelligence (AI). Image captioning is also the bridge between the vision process and natural language process. In image captioning, there are two parts: sentence based generation and single word generation. Deep Learning has become the main driver of many new applications and is also much more accessible in terms of the learning curve. Image captioning by applying deep learning model can enhance the description accuracy. Attention mechanisms are the upward trend in the model of deep learning for image caption generation. This paper proposes the comparative study for attention-based deep learning model for image captioning. This presents the basic analyzing techniques for performance, advantages, and weakness. This also discusses the datasets for image captioning and the evaluation metrics to test the accuracy.

Deep learning for image captioning: an encoder-decoder architecture with soft attention

2019

Automatic image captioning, the task of automatically producing a natural-language description for an image, has the potential to assist those with visual impairments by explaining images using text-to-speech systems. However, accurate image captioning is a challenging task that requires integrating and pushing further the latest improvements at the intersection of computer vision and natural language processing fields This work aims at building an advanced model based on neural networks and deep learning for the automated generation of image captions.

Human Attention in Image Captioning: Dataset and Analysis

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/ Human-Attention-in-Image-Captioning.