On combining image features and word embeddings for image captioning (original) (raw)

A Benchmark for Feature-injection Architectures in Image Captioning

European Journal of Science and Technology, 2021

Describing an image with a grammatically and semantically correct sentence, known as image captioning, has been improved significantly with recent advances in computer vision (CV) and natural language processing (NLP) communities. The integration of these communities leads to the development of feature-injection architectures, which define how extracted features are used in captioning. In this paper, a benchmark of feature-injection architectures that utilize CV and NLP techniques is reported for encoderdecoder based captioning. Benchmark evaluations include Inception-v3 convolutional neural network to extract image features in the encoder while the feature-injection architectures such as init-inject, pre-inject, par-inject and merge are applied with a multi-layer gated recurrent unit (GRU) to generate captions in the decoder. Architectures have been evaluated extensively on the MSCOCO dataset across eight performance metrics. It has been concluded that the init-inject architecture with 3-layer GRU outperforms the other architectures in terms of captioning accuracy.

A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 2022

Image captioning is a task that can provide a description of an image in natural language. Image captioning can be used for a variety of applications, such as image indexing and virtual assistants. In this research, we compared the performance of three different word embeddings, namely, GloVe, Word2Vec, FastText and six CNN-based feature extraction architectures such as, Inception V3, InceptionResNet V2, ResNet152 V2, EfficientNet B3 V1, EfficientNet B7 V1, and NASNetLarge which then will be combined with LSTM as the decoder to perform image captioning. We used ten different household objects (bed, cell phone, chair, couch, oven, potted plant, refrigerator, sink, table, and tv) that were obtained from MSCOCO dataset to develop the model. Then, we created five new captions in Bahasa Indonesia for the selected images. The captions might contain details about the name, the location, the color, the size, and the characteristics of an object and its surrounding area. In our 18 experiment...