Exploiting the Relationship Between Visual and Textual Features in Social Networks for Image Classification with Zero-Shot Deep Learning (original) (raw)

Zero-shot Text Classification via Knowledge Graph Embedding for Social Media Data

IEEE Internet of Things Journal, 2021

The idea of ‘citizen sensing’and ‘human as sensors’is crucial for social Internet of Things, an integral part of cyber-physical-social systems (CPSS). Social media data, which can be easily collected from the social world, has become a valuable resource for research in many different disciplines, e.g. crisis/disaster assessment, social event detection, or the recent COVID-19 analysis. Useful information, or knowledge derived from social data, could better serve the public if it could be processed and analyzed in more efficient and reliable ways. Advances in deep neural networks have significantly improved the performance of many social media analysis tasks. However, deep learning models typically require a large amount of labeled data for model training, while most CPSS data is not labeled, making it impractical to build effective learning models using traditional approaches. In addition, the current state-of-the-art, pre-trained Natural Language Processing (NLP) models do not make ...

Enhancing Multi-Label Image Classification with Integrated Visual-Textual Data Analysis: A Comparative Study of CLIP and Hybrid Neural Network Architectures

University of Sydney, 2024

This study examines multi-label image classification challenges within a Kagle competition context, employing two distinct neural network architectures: the CLIP model, utilizing a transformer-based approach for integrating visual and textual data, and a hybrid model that combines GoogLeNet with LSTM to process image features alongside sequential text. The dataset, characterized by significant class imbalance, is addressed through sophisticated preprocessing techniques, the implementation of Focal Loss, and strategic data resampling to enhance model training. Experimental results demonstrate the superiority of the CLIP model in terms of precision and validation accuracy. This model's success can be attributed to its ability to generalize effectively from diverse and extensive pre-training on internet-sourced data, which is essential given the multimodal nature and imbalance present in the dataset. The CLIP model's dual encoding capability efficiently handles both visual and textual inputs, providing a robust framework for understanding complex data interactions. Additionally, the research highlights the critical role of precise model selection, advanced data preprocessing, and the optimization of loss functions and optimizers in improving performance in multi-label classification tasks. By leveraging textual information to refine visual data classification, the study advances computational tools and strategies, ensuring higher predictive accuracy and robustness in handling real-world data complexities.

Visual Sentiment Analysis Using Deep Learning Models with Social Media Data

Applied Sciences, 2022

Analyzing the sentiments of people from social media content through text, speech, and images is becoming vital in a variety of applications. Many existing research studies on sentiment analysis rely on textual data, and similar to the sharing of text, users of social media share more photographs and videos. Compared to text, images are said to exhibit the sentiments in a much better way. So, there is an urge to build a sentiment analysis model based on images from social media. In our work, we employed different transfer learning models, including the VGG-19, ResNet50V2, and DenseNet-121 models, to perform sentiment analysis based on images. They were fine-tuned by freezing and unfreezing some of the layers, and their performance was boosted by applying regularization techniques. We used the Twitter-based images available in the Crowdflower dataset, which contains URLs of images with their sentiment polarities. Our work also presents a comparative analysis of these pre-trained mode...

Deep Neural Networks in Fully Connected CRF for Image Labeling with Social Network Metadata

2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

We propose a novel method for predicting image labels by fusing image content descriptors with the social media context of each image. An image uploaded to a social media site such as Flickr often has meaningful, associated information, such as comments and other images the user has uploaded, that is complementary to pixel content and helpful in predicting labels. Prediction challenges such as Ima-geNet [3] and MSCOCO [16] use only pixels, while other methods make predictions purely from social media context [18]. Our method is based on a novel fully connected Conditional Random Field (CRF) framework, where each node is an image, and consists of two deep Convolutional Neural Networks (CNN) and one Recurrent Neural Network (RNN) that model both textual and visual node/image information. The edge weights of the CRF graph represent textual similarity and link-based metadata such as user sets and image groups. We model the CRF as an RNN for both learning and inference, and incorporate the weighted ranking loss and cross entropy loss into the CRF parameter optimization to handle the training data imbalance issue. Our proposed approach is evaluated on the MIR-9K dataset and experimentally outperforms current state-of-the-art approaches.

Zero-shot Text Classification With Generative Language Models

2019

This work investigates the use of natural language to enable zero-shot model adaptation to new tasks. We use text and metadata from social commenting platforms as a source for a simple pretraining task. We then provide the language model with natural language descriptions of classification tasks as input and train it to generate the correct answer in natural language via a language modeling objective. This allows the model to generalize to new classification tasks without the need for multiple multitask classification heads. We show the zero-shot performance of these generative language models, trained with weak supervision, on six benchmark text classification datasets from the torchtext library. Despite no access to training data, we achieve up to a 45% absolute improvement in classification accuracy over random or majority class baselines. These results show that natural language can serve as simple and powerful descriptors for task adaptation. We believe this points the way to n...

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

ArXiv, 2021

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zeroor few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.1

Visual Sentiment Analysis for Social Images Using Transfer Learning Approach

Visual sentiment analysis framework can predict the sentiment of an image by analyzing the image contents. Nowadays, people are uploading millions of images in social networks such as Twitter, Facebook, Google Plus, and Flickr. These images play a crucial part in expressing emotions of users in online social networks. As a result, image sentiment analysis has become important in the area of online multimedia big data research. Several research works are focusing on analyzing the sentiment of the textual contents. However, little investigation has been done to develop models that can predict sentiment of visual content. In this paper, we propose a novel visual sentiment analysis framework using transfer learning approach to predict sentiment. We use hyper-parameters learned from a very deep convolutional neural network to initialize our network model to prevent overfitting. We conduct extensive experiments on a Twitter image dataset and prove that our model achieves better performance than the current state-of-the-art.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

2021

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures ba...

From Image to Text in Sentiment Analysis via Regression and Deep Learning

Proceedings - Natural Language Processing in a Deep Learning World

Images and text represent types of content which are used together for conveying user emotions in online social networks. These contents are usually associated with a sentiment category. In this paper, we investigate an approach for mapping images to text for three types of sentiment categories: positive, neutral and negative. The mapping from images to text is performed using a Kernel Ridge Regression model. We considered two types of image features: i) RGB pixel-values features, and ii) features extracted with a deep learning approach. The experimental evaluation was performed on a Twitter data set containing both text and images and the sentiment associated with these. The experimental results show a difference in performance for different sentiment categories, in particular the mapping that we propose performs better for the positive sentiment category in comparison with the neutral and negative ones. Furthermore, the experimental results show that the more complex deep learning features perform better than the RGB pixel-value features for all sentiment categories and for larger training sets.

Classification of Instagram photos: topic modelling vs transfer learning

Proceedings of the 12th Hellenic Conference on Artificial Intelligence

The existence of pre-trained deep learning models for image classification, such as those trained on the well-known Resnet-50 architecture, allows for easy application of transfer learning to several domains including image retrieval. Recently, we proposed topic modelling for the retrieval of Instagram photos based on the associated hashtags. In this paper we compare content-based image classification, based on transfer learning, with the classification based on topic modelling of Instagram hashtags for a set of 24 different concepts. The comparison was performed on a set of 1944 Instagram photos, 81 per concept. Despite the excellent performance of the pre-trained deep learning models, it appears that text-based retrieval, as performed by the topic models of Instagram hashtags, stills perform better. CCS CONCEPTS • Computing methodologies → Information extraction; Natural language processing.