IRJET- Image Captioning using Attention Mechanism with ResNet, VGG and Inception Models (original) (raw)

Attention-Based Deep Learning Model for Image Captioning: A Comparative Study

International Journal of Image, Graphics and Signal Processing, 2019

Image captioning is the description generated from images. Generating the caption of an image is one part of computer vision or image processing from artificial intelligence (AI). Image captioning is also the bridge between the vision process and natural language process. In image captioning, there are two parts: sentence based generation and single word generation. Deep Learning has become the main driver of many new applications and is also much more accessible in terms of the learning curve. Image captioning by applying deep learning model can enhance the description accuracy. Attention mechanisms are the upward trend in the model of deep learning for image caption generation. This paper proposes the comparative study for attention-based deep learning model for image captioning. This presents the basic analyzing techniques for performance, advantages, and weakness. This also discusses the datasets for image captioning and the evaluation metrics to test the accuracy.

Image Caption Generator Using Attention Based Neural Networks

International Journal for Research in Applied Science and Engineering Technology, 2023

Image caption generation is a method used to create sentences that describe the scene depicted in a given image. The process includes identifying objects within the image, carrying out various operations, and identifying the most important features of the image. Once the system has identified this information, it generates the most relevant and concise description of the image, which is both grammatically and semantically correct. With the progress in deep-learning techniques, algorithms are able to generate text in the form of natural sentences that can effectively describe an image. However, replicating the natural human ability to comprehend image content and produce descriptive text is a difficult task for machines. The uses of image captioning are vast and of great significance, as it involves creating succinct captions utilizing a variety of techniques such as Natural Language Processing (NLP), Computer Vision(CV), and Deep Learning (DL) techniques. The current study presents a system that employs an attention mechanism, in addition to an encoder and a decoder, to generate captions. It utilizes a pretrained CNN, Inception V3, to extract features from the image and a RNN, GRU, to produce a relevant caption. The attention mechanism used in this model is Bahdanau attention, and the Flickr-8Kdataset is utilized for training the model. The results demonstrate the model's capability to understand images and generate text in a reasonable manner.

IRJET- Image Caption Generation System using Neural Network with Attention Mechanism

IRJET, 2020

Generating captions of an image automatically is a task very close to the scene understanding which is used to solve the computer vision challenges of determining which objects are in an image, and are also capable of capturing and expressing relationships between them in a natural language. It is also used in Content Based Image Retrieval (CBIR). Also it can be used to help visually impaired peoples to understand their surroundings. It represents a model based on a deep recurrent Neural Network that combines computer vision techniques that can be used to generate natural sentences describing an image. Model is divided into two parts Encoder and Decoder and Dataset used is Flickr8k. We are using Convolutional Neural Network (CNN) as encoder to extract features from images and Decoder as Long Short Term Memory (LSTM) to generates words describing image. Simultaneously using Attention Mechanism to provide more attention on details of every portion of image to generate more descriptive caption. To construct Optimal sentence from these words Optimal Beam Search is used. Further, generated sentence is being converted to audio which is found to help the visually impaired people. Thus, our system helps the user to get descriptive caption for the given input image.

Image captioning model using attention and object features to mimic human image understanding

Journal of Big Data, 2022

Image captioning spans the fields of computer vision and natural language processing. The image captioning task generalizes object detection where the descriptions are a single word. Recently, most research on image captioning has focused on deep learning techniques, especially Encoder-Decoder models with Convolutional Neural Network (CNN) feature extraction. However, few works have tried using object detection features to increase the quality of the generated captions. This paper presents an attention-based, Encoder-Decoder deep architecture that makes use of convolutional features extracted from a CNN model pre-trained on ImageNet (Xception), together with object features extracted from the YOLOv4 model, pre-trained on MS COCO. This paper also introduces a new positional encoding scheme for object features, the “importance factor”. Our model was tested on the MS COCO and Flickr30k datasets, and the performance is compared to performance in similar works. Our new feature extraction...

Human Attention in Image Captioning: Dataset and Analysis

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/ Human-Attention-in-Image-Captioning.

Efficient Image Captioning Based on Vision Transformer Models

Computers, Materials & Continua

Image captioning is an emerging field in machine learning. It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image. Image captioning requires a complex machine learning process as it involves two sub models: a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions. Attention-based vision transformers models have a great impact in vision field recently. In this paper, we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO (self-distillation with no labels). The second is PVT (Pyramid Vision Transformer) which is a vision transformer that is not using convolutional layers. The third is XCIT (cross-Covariance Image Transformer) which changes the operation in self-attention by focusing on feature dimension instead of token dimensions. The last one is SWIN (Shifted windows), it is a vision transformer which, unlike the other transformers, uses shifted-window in splitting the image. For a deeper evaluation, the four mentioned vision transformers have been tested with their different versions and different configuration, we evaluate the use of DINO model with five different backbones, PVT with two versions: PVT_v1and PVT_v2, one model of XCIT, SWIN transformer. The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.

Attention Based Image Caption Generation (ABICG) using Encoder-Decoder Architecture

2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), 2023

The image captioning is utilized to develop the explanations of the sentences describing the series of scenes captured in the image or picture forms. The practice of using image captioning is vast although it is a tedious task for the machine to learn what a human is capable of. The model must be built in a way such that when it reads the scene, it recognizes and reproduce to-the-point captions or descriptions. The generated descriptions must be semantically and syntactically accurate. Hence, availability of Artificial Intelligence (AI) and Machine Learning algorithms viz. Natural Language Processing (NLP), Deep Learning (DL) etc. makes the task easier. In the proposed paper, anew introduction to attention mechanism called Bahdanau's along with Encoder-Decoder architecture is being used so as to reflect the captions of the image. A pre-trained Convolutional Neural Network (CNN) called InceptionV3 architecture is used to gather the features of images and then a Recurrent Neural Network (RNN) called Gated Recurrent Unit (GRU) architecture so as to develop captions is utilized. The results obtained from this model is trained on Flickr8k dataset with the improvement in accuracy around 10% with the present state of the art.

Panoptic Segmentation-Based Attention for Image Captioning

Applied Sciences

Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, the rectangular attention regions are not fine-grained, as they contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions. To address this issue, we propose panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms. Moreover, in order to process features of different classes independently, we propose a dual-attention module which is generic and can be applied to other frameworks. Experimental results showed that our model could recognize the overlappe...

IRJET- A Survey of Image Captioning Models

IRJET, 2021

Image caption generation has been a challenging problem for a long time.Numerous attempts have been made at the difficult task of image captioning, which includes the complexities of both computer vision and natural language processing. Deep Learning models have the capability to perform the intricate task of image captioning. In this survey paper, we aim to give a complete review of the various image captioning techniques that have been implemented till date. We discuss the structure of the various models, their performance, advantages and limitations. The different datasets and evaluation metrics that are frequently used in image captioning models have also been discussed.

IRJET- Automated Image Captioning using CNN and RNN

IRJET, 2021

With the evolution of generation picture captioning is a totally essential issue of virtually all industries regarding information abstraction. To interpret such information by a machine may be very complex and timeconsuming. For a device to apprehend the context and surroundings info of an photo, it wishes a higher understanding of the outline projected from the picture. Many deep gaining knowledge of techniques have now not followed conventional strategies however are changing the manner a machine is familiar with and translates. Majorly the usage of Captions and attaining a properly-described vocabulary linked to images. With the improvements in technology and ease of computation of extensive information has made it possible for us to without problems observe deep gaining knowledge of in several projects the usage of our non-public computer. A solution calls for each that the content of the photograph is thought and translated to that means within the phrases of words, and that the phrases must string collectively to be comprehensible. It combines both laptop imaginative and prescient using deep mastering and herbal language processing and marks a virtually tough trouble in broader synthetic intelligence.