Bangla language textual image description by hybrid neural network model (original) (raw)

Hybrid deep neural network for Bangla automated image descriptor

International Journal of Advances in Intelligent Informatics, 2020

Automated image to text generation is a computationally challenging computer vision task which requires sufficient comprehension of both syntactic and semantic meaning of an image to generate a meaningful description. Until recent times, it has been studied to a limited scope due to the lack of visual-descriptor dataset and functional models to capture intrinsic complexities involving features of an image. In this study, a novel dataset was constructed by generating Bangla textual descriptor from visual input, called Bangla Natural Language Image to Text (BNLIT), incorporating 100 classes with annotation. A deep neural network-based image captioning model was proposed to generate image description. The model employs Convolutional Neural Network (CNN) to classify the whole dataset, while Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) capture the sequential semantic representation of text-based sentences and generate pertinent description based on the modular complexities of an image. When tested on the new dataset, the model accomplishes significant enhancement of centrality execution for image semantic recovery assignment. For the experiment of that task, we implemented a hybrid image captioning model, which achieved a remarkable result for a new self-made dataset, and that task was new for the Bangladesh perspective. In brief, the model provided benchmark precision in the characteristic Bangla syntax reconstruction and comprehensive numerical analysis of the model execution results on the dataset.

TextMage: The Automated Bangla Caption Generator Based On Deep Learning

Neural Networks and Deep Learning have seen an upsurge of research in the past decade due to the improved results. Generates text from the given image is a crucial task that requires the combination of both sectors which are computer vision and natural language processing in order to understand an image and represent it using a natural language. However existing works have all been done on a particular lingual domain and on the same set of data. This leads to the systems being developed to perform poorly on images that belong to specific locales' geographical context. TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context and use its knowledge to represent what it understands in Bengali. Hence, we have trained a model on our previously developed and published dataset named BanglaLekhaImageCaptions. This dataset contains 9,154 images along with two annotations for each image. In order to access performance, the prop...

Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

2021

Automatic Image Captioning is the never-ending effort of creating syntactically and validating the accuracy of textual descriptions of an image in natural language with context. The encoder-decoder structure used throughout existing Bengali Image Captioning (BIC) research utilized abstract image feature vectors as the encoder’s input. We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all existing Bengali Image Captioning work and sets a new benchmark by scoring 0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337 on METEOR.

Automatic Bangla Image Captioning Based on Transformer Model in Deep Learning

International Journal of Advanced Computer Science and Applications (IJACSA), 2023

Indeed, Image Captioning has become a crucial aspect of contemporary artificial intelligence because it has tackled two crucial parts of the AI field: Computer Vision and Natural Language Processing. Currently, Bangla stands as the seventh most widely spoken language globally. Due to this, image captioning has gained recognition for its significant research accomplishments. Many established datasets are found in English but no standard datasets in Bangla. For our research, we have used the BAN-Cap dataset which contains 8091 images with 40455 sentences. Many effective encoder-decoder and Visual Attention approaches are used for image captioning where CNN is utilized for the encoder and RNN is used for the decoder. However, we suggested a transformer-based image captioning model in this study with different pre-train image feature extraction models like Resnet50, InceptionV3, and VGG16 using the BAN-Cap dataset and find out its effective efficiency and accuracy based on many performances measured methods like BLEU, METEOR, ROUGE, CIDEr and also find out the drawbacks of others model.

ImageToText: Image Caption Generation Using Hybrid Recurrent Neural Network

Final year thesis-Preprint

Generating a natural language description from images is an important problem at the section of computer vision, natural language processing, artificial intelligence and image processing. Observing many recent works in deep learning sector, we introduced a hybrid RNN model which is generating text from the given input images. We presented the learning model that generates natural language of images. The model utilized the connections between natural language and visual data by produced text line based contents from a given image. Our Hybrid Recurrent Neural Network model is based on the combination of Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and Bi-directional Recurrent Neural Network (BRNN) models. We used three benchmark datasets: Flickr8K, Flickr30K and MS COCO for training our model and observed the accuracy improvement comparing with the state of the art work. A new Bangla dataset is also created which we named as BNLIT (Bangla Natural Language Image to Text) is made to generate Bangla caption from given input image. This dataset contains 8,700 images and all the images are in Bangladesh perspective images. Our hybrid model learns from a new set of data and annotations that reflect the Bangladeshi geographical context.

CapNet: An Encoder-Decoder based Neural Network Model for Automatic Bangla Image Caption Generation

International Journal of Advanced Computer Science and Applications

Automatic caption generation from images has become an active research topic in the field of Computer Vision (CV) and Natural Language Processing (NLP). Machine generated image caption plays a vital role for the visually impaired people by converting the caption to speech to have a better understanding of their surrounding. Though significant amount of research has been conducted for automatic caption generation in other languages, far too little effort has been devoted to Bangla image caption generation. In this paper, we propose an encoder-decoder based model which takes an image as input and generates the corresponding Bangla caption as output. The encoder network consists of a pretrained image feature extractor called ResNet-50, while the decoder network consists of Bidirectional LSTMs for caption generation. The model has been trained and evaluated using a Bangla image captioning dataset named BanglaLekhaIm-ageCaptions. The proposed model achieved a training accuracy of 91% and BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores of 0.81, 0.67, 0.57, and 0.51 respectively. Moreover, a comparative study for different pretrained feature extractors such as VGG-16 and Xception is presented. Finally, the proposed model has been deployed on an embedded device for analysing the inference time and power consumption.

An Efficient Deep Learning based Hybrid Model for Image Caption Generation

International Journal of Advanced Computer Science and Applications

In the recent yeas, with the increase in the use of different social media platforms, image captioning approach play a major role in automatically describe the whole image into natural language sentence. Image captioning plays a significant role in computer-based society. Image captioning is the process of automatically generating the natural language textual description of the image using artificial intelligence techniques. Computer vision and natural language processing are the key aspect of the image processing system. Convolutional Neural Network (CNN) is a part of computer vision and used object detection and feature extraction and on the other side Natural Language Processing (NLP) techniques help in generating the textual caption of the image. Generating suitable image description by machine is challenging task as it is based upon object detection, location and their semantic relationships in a human understandable language such as English. In this paper our aim to develop an encoder-decoder based hybrid image captioning approach using VGG16, ResNet50 and YOLO. VGG16 and ResNet50 are the pre-trained feature extraction model which are trained on millions of images. YOLO is used for real time object detection. It first extracts the image features using VGG16, ResNet50 and YOLO and concatenate the result in to single file. At last LSTM and BiGRU are used for textual description of the image. Proposed model is evaluated by using BLEU, METEOR and RUGE score.

Natural language description of images using hybrid recurrent neural network

Institute of Advanced Engineering and Science (IAES), 2019

We presented a learning model that generated natural language description of images. The model utilized the connections between natural language and visual data by produced text line based contents from a given image. Our Hybrid Recurrent Neural Network model is based on the intricacies of Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Bi-directional Recurrent Neural Network (BRNN) models. We conducted experiments on three benchmark datasets, e.g., Flickr8K, Flickr30K, and MS COCO. Our hybrid model utilized LSTM model to encode text line or sentences independent of the object location and BRNN for word representation, this reduced the computational complexities without compromising the accuracy of the descriptor. The model produced better accuracy in retrieving natural language based description on the dataset.

Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System

Computers, Materials & Continua

The recent developments in Multimedia Internet of Things (MIoT) devices, empowered with Natural Language Processing (NLP) model, seem to be a promising future of smart devices. It plays an important role in industrial models such as speech understanding, emotion detection, home automation, and so on. If an image needs to be captioned, then the objects in that image, its actions and connections, and any silent feature that remains under-projected or missing from the images should be identified. The aim of the image captioning process is to generate a caption for image. In next step, the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct. In this scenario, computer vision model is used to identify the objects and NLP approaches are followed to describe the image. The current study develops a Natural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System (NLPODL-IICS). The aim of the presented NLPODL-IICS model is to produce a proper description for input image. To attain this, the proposed NLPODL-IICS follows two stages such as encoding and decoding processes. Initially, at the encoding side, the proposed NLPODL-IICS model makes use of Hunger Games Search (HGS) with Neural Search Architecture Network (NASNet) model. This model represents the input data appropriately by inserting it into a predefined length vector. Besides, during decoding phase, Chimp Optimization Algorithm (COA) with deeper Long Short Term Memory (LSTM) approach is followed to concatenate the description sentences 4436 CMC, 2023, vol.74, no.2 produced by the method. The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively. The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets. A widespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.

Oboyob: A sequential-semantic Bengali image captioning engine

Journal of Intelligent & Fuzzy Systems, 2019

Understanding the context with generation of textual description from an input image is an active and challenging research topic in computer vision and natural language processing. However, in the case of Bengali language, the problem is still unexplored. In this paper, we address a standard approach for Bengali image caption generation though subsampling the machine translated dataset. Later, we use several pre-processing techniques with the state-of-the-art CNN-LSTM architecturebased models. The experiment is conducted on standard Flickr-8K dataset, along with several modifications applied to adapt with the Bengali language. The training caption subsampled dataset is computed for both Bengali and English languages for further experiments with 16 distinct models developed in the entire training process. The trained models for both languages are analyzed with respect to several caption evaluation metrics. Further, we establish a baseline performance in Bengali image captioning defining the limitation of current word embedding approaches compared to internal local embedding.