Tushar Karayil | TU Kaiserslautern (original) (raw)

Papers by Tushar Karayil

Research paper thumbnail of Image Captioning in the Wild

Automatic image captioning is a well-known problem in the field of artificial intelligence. To so... more Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

Research paper thumbnail of The Focus-Aspect-Polarity Model for Predicting Subjective Noun Attributes in Images

arXiv (Cornell University), Oct 15, 2018

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective-or attributelabels from images. However, most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Polarity model to structure the process of capturing subjectivity in image processing, and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication in several cases outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Conditional GANs for Image Captioning with Sentiments

Lecture Notes in Computer Science, 2019

The area of automatic image captioning has witnessed much progress recently. However, generating ... more The area of automatic image captioning has witnessed much progress recently. However, generating captions with sentiment, which is a common dimension in human generated captions, still remains a challenge. This work presents a generative approach that combines sentiment (positive/negative) and variation for caption generation. The presented approach consists of a Generative Adversarial Network which takes as input, an image and a binary vector indicating the sentiment of the caption to be generated. We evaluate our model quantitatively on the state-of-the-art image caption dataset and qualitatively using a crowdsourcing platform. Our results, along with human evaluation prove that we competitively succeed in the task of creating variations and sentiment in image captions.

Research paper thumbnail of Generating Affective Captions using Concept And Syntax Transition Networks

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of The Focus-Aspect-Value Model for Explainable Prediction of Subjective Visual Interpretation

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective- or attribute-labels from images. However,most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is alack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Value (FAV) model to structure the process of capturing subjectivity in image processing,and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Fusion Strategies for Learning User Embeddings with Neural Networks

arXiv (Cornell University), Jan 8, 2019

Growing amounts of online user data motivate the need for automated processing techniques. In cas... more Growing amounts of online user data motivate the need for automated processing techniques. In case of user ratings, one interesting option is to use neural networks for learning to predict ratings given an item and a user. While training for prediction, such an approach at the same time learns to map each user to a vector, a so-called user embedding. Such embeddings can for example be valuable for estimating user similarity. However, there are various ways how item and user information can be combined in neural networks, and it is unclear how the way of combining affects the resulting embeddings. In this paper, we run an experiment on movie ratings data, where we analyze the effect on embedding quality caused by several fusion strategies in neural networks. For evaluating embedding quality, we propose a novel measure, Pair-Distance Correlation, which quantifies the condition that similar users should have similar embedding vectors. We find that the fusion strategy affects results in terms of both prediction performance and embedding quality. Surprisingly, we find that prediction performance not necessarily reflects embedding quality. This suggests that if embeddings are of interest, the common tendency to select models based on their prediction ability should be reconsidered.

Research paper thumbnail of The Focus–Aspect–Value model for predicting subjective visual attributes

International Journal of Multimedia Information Retrieval, Jan 2, 2020

Predicting subjective visual interpretation is important for several prominent tasks in computer ... more Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus-Aspect-Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.

Research paper thumbnail of AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style

arXiv (Cornell University), Mar 25, 2021

Research paper thumbnail of Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

Parts of the research and material (including figures, tables and algorithms) in this thesis have... more Parts of the research and material (including figures, tables and algorithms) in this thesis have already been published in (or accepted in):

Research paper thumbnail of Introducing Concept And Syntax Transition Networks for Image Captioning

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a novel graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of The Focus-Aspect-Polarity Model for Predicting Subjective Noun Attributes in Images

arXiv (Cornell University), Oct 15, 2018

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective-or attributelabels from images. However, most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Polarity model to structure the process of capturing subjectivity in image processing, and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication in several cases outperforms the default way of information fusion (concatenation).

Research paper thumbnail of The Focus-Aspect-Value Model for Explainable Prediction of Subjective Visual Interpretation

Proceedings of the 2019 on International Conference on Multimedia Retrieval, 2019

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective- or attribute-labels from images. However,most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is alack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Value (FAV) model to structure the process of capturing subjectivity in image processing,and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

Parts of the research and material (including figures, tables and algorithms) in this thesis have... more Parts of the research and material (including figures, tables and algorithms) in this thesis have already been published in (or accepted in):

Research paper thumbnail of Approach for Printed Devanagari Script Recognition

ant con­ juncts and consonant-vowel combinations take different forms based on their position in ... more ant con­ juncts and consonant-vowel combinations take different forms based on their position in the word. We also in­ troduce a new database, Deva-DB, of Devanagari script (free of cost) to aid the research towards a robust De­ vanagari OCR system. On this database, LSTM-based OCRopus system yields error rates ranging from 1.2% to 9.0% depending upon the complexity of the training and test data. Comparison with open-source Tesseract system is also presented for the same database.

Research paper thumbnail of Fusion Strategies for Learning User Embeddings with Neural Networks

2019 International Joint Conference on Neural Networks (IJCNN), 2019

Growing amounts of online user data motivate the need for automated processing techniques. In cas... more Growing amounts of online user data motivate the need for automated processing techniques. In case of user ratings, one interesting option is to use neural networks for learning to predict ratings given an item and a user. While training for prediction, such an approach at the same time learns to map each user to a vector, a so-called user embedding. Such embeddings can for example be valuable for estimating user similarity. However, there are various ways how item and user information can be combined in neural networks, and it is unclear how the way of combining affects the resulting embeddings. In this paper, we run an experiment on movie ratings data, where we analyze the effect on embedding quality caused by several fusion strategies in neural networks. For evaluating embedding quality, we propose a novel measure, Pair-Distance Correlation, which quantifies the condition that similar users should have similar embedding vectors. We find that the fusion strategy affects results in terms of both prediction performance and embedding quality. Surprisingly, we find that prediction performance not necessarily reflects embedding quality. This suggests that if embeddings are of interest, the common tendency to select models based on their prediction ability should be reconsidered.

Research paper thumbnail of Generating Affective Captions using Concept And Syntax Transition Networks

Proceedings of the 24th ACM international conference on Multimedia, 2016

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of Introducing Concept And Syntax Transition Networks for Image Captioning

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a novel graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of Conditional GANs for Image Captioning with Sentiments

Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series, 2019

The area of automatic image captioning has witnessed much progress recently. However, generating ... more The area of automatic image captioning has witnessed much progress recently. However, generating captions with sentiment, which is a common dimension in human generated captions, still remains a challenge. This work presents a generative approach that combines sentiment (positive/negative) and variation for caption generation. The presented approach consists of a Generative Adversarial Network which takes as input, an image and a binary vector indicating the sentiment of the caption to be generated. We evaluate our model quantitatively on the state-of-the-art image caption dataset and qualitatively using a crowdsourcing platform. Our results, along with human evaluation prove that we competitively succeed in the task of creating variations and sentiment in image captions.

Research paper thumbnail of Image Captioning in the Wild

Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, 2017

Automatic image captioning is a well-known problem in the field of artificial intelligence. To so... more Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

Research paper thumbnail of The Focus–Aspect–Value model for predicting subjective visual attributes

International Journal of Multimedia Information Retrieval, 2020

Predicting subjective visual interpretation is important for several prominent tasks in computer ... more Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus-Aspect-Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.

Research paper thumbnail of Image Captioning in the Wild

Automatic image captioning is a well-known problem in the field of artificial intelligence. To so... more Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

Research paper thumbnail of The Focus-Aspect-Polarity Model for Predicting Subjective Noun Attributes in Images

arXiv (Cornell University), Oct 15, 2018

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective-or attributelabels from images. However, most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Polarity model to structure the process of capturing subjectivity in image processing, and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication in several cases outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Conditional GANs for Image Captioning with Sentiments

Lecture Notes in Computer Science, 2019

The area of automatic image captioning has witnessed much progress recently. However, generating ... more The area of automatic image captioning has witnessed much progress recently. However, generating captions with sentiment, which is a common dimension in human generated captions, still remains a challenge. This work presents a generative approach that combines sentiment (positive/negative) and variation for caption generation. The presented approach consists of a Generative Adversarial Network which takes as input, an image and a binary vector indicating the sentiment of the caption to be generated. We evaluate our model quantitatively on the state-of-the-art image caption dataset and qualitatively using a crowdsourcing platform. Our results, along with human evaluation prove that we competitively succeed in the task of creating variations and sentiment in image captions.

Research paper thumbnail of Generating Affective Captions using Concept And Syntax Transition Networks

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of The Focus-Aspect-Value Model for Explainable Prediction of Subjective Visual Interpretation

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective- or attribute-labels from images. However,most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is alack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Value (FAV) model to structure the process of capturing subjectivity in image processing,and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Fusion Strategies for Learning User Embeddings with Neural Networks

arXiv (Cornell University), Jan 8, 2019

Growing amounts of online user data motivate the need for automated processing techniques. In cas... more Growing amounts of online user data motivate the need for automated processing techniques. In case of user ratings, one interesting option is to use neural networks for learning to predict ratings given an item and a user. While training for prediction, such an approach at the same time learns to map each user to a vector, a so-called user embedding. Such embeddings can for example be valuable for estimating user similarity. However, there are various ways how item and user information can be combined in neural networks, and it is unclear how the way of combining affects the resulting embeddings. In this paper, we run an experiment on movie ratings data, where we analyze the effect on embedding quality caused by several fusion strategies in neural networks. For evaluating embedding quality, we propose a novel measure, Pair-Distance Correlation, which quantifies the condition that similar users should have similar embedding vectors. We find that the fusion strategy affects results in terms of both prediction performance and embedding quality. Surprisingly, we find that prediction performance not necessarily reflects embedding quality. This suggests that if embeddings are of interest, the common tendency to select models based on their prediction ability should be reconsidered.

Research paper thumbnail of The Focus–Aspect–Value model for predicting subjective visual attributes

International Journal of Multimedia Information Retrieval, Jan 2, 2020

Predicting subjective visual interpretation is important for several prominent tasks in computer ... more Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus-Aspect-Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.

Research paper thumbnail of AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style

arXiv (Cornell University), Mar 25, 2021

Research paper thumbnail of Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

Parts of the research and material (including figures, tables and algorithms) in this thesis have... more Parts of the research and material (including figures, tables and algorithms) in this thesis have already been published in (or accepted in):

Research paper thumbnail of Introducing Concept And Syntax Transition Networks for Image Captioning

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a novel graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of The Focus-Aspect-Polarity Model for Predicting Subjective Noun Attributes in Images

arXiv (Cornell University), Oct 15, 2018

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective-or attributelabels from images. However, most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Polarity model to structure the process of capturing subjectivity in image processing, and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication in several cases outperforms the default way of information fusion (concatenation).

Research paper thumbnail of The Focus-Aspect-Value Model for Explainable Prediction of Subjective Visual Interpretation

Proceedings of the 2019 on International Conference on Multimedia Retrieval, 2019

Subjective visual interpretation is a challenging yet important topic in computer vision. Many ap... more Subjective visual interpretation is a challenging yet important topic in computer vision. Many approaches reduce this problem to the prediction of adjective- or attribute-labels from images. However,most of these do not take attribute semantics into account, or only process the image in a holistic manner. Furthermore, there is alack of relevant datasets with fine-grained subjective labels. In this paper, we propose the Focus-Aspect-Value (FAV) model to structure the process of capturing subjectivity in image processing,and introduce a novel dataset following this way of modeling. We run experiments on this dataset to compare several deep learning methods and find that incorporating context information based on tensor multiplication outperforms the default way of information fusion (concatenation).

Research paper thumbnail of Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

Parts of the research and material (including figures, tables and algorithms) in this thesis have... more Parts of the research and material (including figures, tables and algorithms) in this thesis have already been published in (or accepted in):

Research paper thumbnail of Approach for Printed Devanagari Script Recognition

ant con­ juncts and consonant-vowel combinations take different forms based on their position in ... more ant con­ juncts and consonant-vowel combinations take different forms based on their position in the word. We also in­ troduce a new database, Deva-DB, of Devanagari script (free of cost) to aid the research towards a robust De­ vanagari OCR system. On this database, LSTM-based OCRopus system yields error rates ranging from 1.2% to 9.0% depending upon the complexity of the training and test data. Comparison with open-source Tesseract system is also presented for the same database.

Research paper thumbnail of Fusion Strategies for Learning User Embeddings with Neural Networks

2019 International Joint Conference on Neural Networks (IJCNN), 2019

Growing amounts of online user data motivate the need for automated processing techniques. In cas... more Growing amounts of online user data motivate the need for automated processing techniques. In case of user ratings, one interesting option is to use neural networks for learning to predict ratings given an item and a user. While training for prediction, such an approach at the same time learns to map each user to a vector, a so-called user embedding. Such embeddings can for example be valuable for estimating user similarity. However, there are various ways how item and user information can be combined in neural networks, and it is unclear how the way of combining affects the resulting embeddings. In this paper, we run an experiment on movie ratings data, where we analyze the effect on embedding quality caused by several fusion strategies in neural networks. For evaluating embedding quality, we propose a novel measure, Pair-Distance Correlation, which quantifies the condition that similar users should have similar embedding vectors. We find that the fusion strategy affects results in terms of both prediction performance and embedding quality. Surprisingly, we find that prediction performance not necessarily reflects embedding quality. This suggests that if embeddings are of interest, the common tendency to select models based on their prediction ability should be reconsidered.

Research paper thumbnail of Generating Affective Captions using Concept And Syntax Transition Networks

Proceedings of the 24th ACM international conference on Multimedia, 2016

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of Introducing Concept And Syntax Transition Networks for Image Captioning

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

The area of image captioning i.e. the automatic generation of short textual descriptions of image... more The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a novel graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

Research paper thumbnail of Conditional GANs for Image Captioning with Sentiments

Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series, 2019

The area of automatic image captioning has witnessed much progress recently. However, generating ... more The area of automatic image captioning has witnessed much progress recently. However, generating captions with sentiment, which is a common dimension in human generated captions, still remains a challenge. This work presents a generative approach that combines sentiment (positive/negative) and variation for caption generation. The presented approach consists of a Generative Adversarial Network which takes as input, an image and a binary vector indicating the sentiment of the caption to be generated. We evaluate our model quantitatively on the state-of-the-art image caption dataset and qualitatively using a crowdsourcing platform. Our results, along with human evaluation prove that we competitively succeed in the task of creating variations and sentiment in image captions.

Research paper thumbnail of Image Captioning in the Wild

Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, 2017

Automatic image captioning is a well-known problem in the field of artificial intelligence. To so... more Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

Research paper thumbnail of The Focus–Aspect–Value model for predicting subjective visual attributes

International Journal of Multimedia Information Retrieval, 2020

Predicting subjective visual interpretation is important for several prominent tasks in computer ... more Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus-Aspect-Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.