Designing Multi-Modal Embedding Fusion-Based Recommender

Multi-modal Embedding Fusion-based Recommender

ArXiv, 2020

Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multiple types of interaction data with multiple modalities of metadata natively. This is achieved through multi-modal fusion of various data representations. We deployed the platform into multiple e-commerce stores of different kinds, e.g. food and beverages, shoes, fashion items, telecom operators. Here, we present our system, its flexibility and performance. We also show benchmark results on open datasets, that significantly outperform state-of-the-art prior work.

Multi-Modal Recommendation System with Auxiliary Information

Cornell University - arXiv, 2022

Context-aware recommendation systems improve upon classical recommender systems by including, in the modelling, a user's behaviour. Research into context-aware recommendation systems has previously only considered the sequential ordering of items as contextual information. However, there is a wealth of unexploited additional multimodal information available in auxiliary knowledge related to items. This study extends the existing research by evaluating a multi-modal recommendation system that exploits the inclusion of comprehensive auxiliary knowledge related to an item. The empirical results explore extracting vector representations (embeddings) from unstructured and structured data using data2vec. The fused embeddings are then used to train several state-of-the-art transformer architectures for sequential user-item representations. The analysis of the experimental results shows a statistically significant improvement in prediction accuracy, which confirms the effectiveness of including auxiliary information in a context-aware recommendation system. We report a 4% and 11% increase in the NDCG score for long and short user sequence datasets, respectively.

Multi-Modal Recommender Systems: Hands-On Exploration

Fifteenth ACM Conference on Recommender Systems

Recommender systems typically learn from user-item preference data such as ratings and clicks. This information is sparse in nature, i.e., observed user-item preferences often represent less than 5% of possible interactions. One promising direction to alleviate data sparsity is to leverage auxiliary information that may encode additional clues on how users consume items. Examples of such data (referred to as modalities) are social networks, item's descriptive text, product images. The objective of this tutorial is to offer a comprehensive review of recent advances to represent, transform and incorporate the different modalities into recommendation models. Moreover, through practical hands-on sessions, we consider cross model/modality comparisons to investigate the importance of different methods and modalities. The hands-on exercises are conducted with Cornac (https://cornac.preferred.ai), a comparative framework for multimodal recommender systems. The materials are made available on https://preferred.ai/recsys21-tutorial/.

Specializing Joint Representations for the task of Product Recommendation

Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems - DLRS 2017, 2017

We propose a uni ed product embedded representation that is optimized for the task of retrieval-based product recommendation. To this end, we introduce a new way to fuse modality-speci c product embeddings into a joint product embedding, in order to leverage both product content information, such as textual descriptions and images, and product collaborative ltering signal. By introducing the fusion step at the very end of our architecture, we are able to train each modality separately, allowing us to keep a modular architecture that is preferable in real-world recommendation deployments. We analyze our performance on normal and hard recommendation setups such as cold-start and cross-category recommendations and achieve good performance on a large product shopping dataset. CCS CONCEPTS •Computing methodologies →Machine learning; Neural networks;

Distributed Vector Representation Of Shopping Items, The Customer And Shopping Cart To Build A Three Fold Recommendation System

ArXiv, 2017

The main idea of this paper is to represent shopping items through vectors because these vectors act as the base for building em- beddings for customers and shopping carts. Also, these vectors are input to the mathematical models that act as either a recommendation engine or help in targeting potential customers. We have used exponential family embeddings as the tool to construct two basic vectors - product embeddings and context vectors. Using the basic vectors, we build combined embeddings, trip embeddings and customer embeddings. Combined embeddings mix linguistic properties of product names with their shopping patterns. The customer embeddings establish an understand- ing of the buying pattern of customers in a group and help in building customer profile. For example a customer profile can represent customers frequently buying pet-food. Identifying such profiles can help us bring out offers and discounts. Similarly, trip embeddings are used to build trip profiles. People happen ...

Collectively Embedding Multi-Relational Data for Predicting User Preferences

2015

Matrix factorization has found incredible success and widespread application as a collaborative filtering based approach to recommendations. Unfortunately, incorporating additional sources of evidence, especially ones that are incomplete and noisy, is quite difficult to achieve in such models, however, is often crucial for obtaining further gains in accuracy. For example, additional information about businesses from reviews, categories, and attributes should be leveraged for predicting user preferences, even though this information is often inaccurate and partially-observed. Instead of creating customized methods that are specific to each type of evidences, in this paper we present a generic approach to factorization of relational data that collectively models all the relations in the database. By learning a set of embeddings that are shared across all the relations, the model is able to incorporate observed information from all the relations, while also predicting all the relations of interest. Our evaluation on multiple Amazon and Yelp datasets demonstrates effective utilization of additional information for held-out preference prediction, but further, we present accurate models even for the cold-starting businesses and products for which we do not observe any ratings or reviews. We also illustrate the capability of the model in imputing missing information and jointly visualizing words, categories, and attribute factors.

Multimodal data fusion framework based on autoencoders for top-N recommender systems

Applied Intelligence, 2019

In this paper, we present a novel multimodal framework for video recommendation based on deep learning. Unlike most common solutions, we formulate video recommendations by exploiting simultaneously two data modalities, particularly: (i) the visual (i.e., image sequence) and (ii) the textual modalities, which in conjunction with the audio stream constitute the elementary data of a video document. More specifically, our framework firstly describe textual data by using the bagof-words and TF-IDF models, fusing those features with deep convolutional descriptors extracted from the visual data. As result, we obtain a multimodal descriptor for each video document, from which we construct a low-dimensional sparse representation by using autoencoders. To qualify the recommendation task, we extend a sparse linear method with side information (SSLIM), by taking into account the sparse representations of video descriptors previously computed. By doing this, we are able to produce a ranking of the top-N most relevant videos to the user. Note that our framework is flexible, i.e., one may use other types of modalities, autoencoders, and fusion architectures. Experimental results obtained on three real datasets (MovieLens-1M, MovieLens-10M and Vine), containing 3,320, 8,400 and 18,576 videos, respectively, show that our framework can improve up to 60.6% the recommendation results, when compared to a single modality recommendation model and up to 31%, when compared to state-of-the art methods used as baselines in our study, demonstrating the effectiveness of our framework and highlighting the usefulness of multimodal information in recommender system.

Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018

Recommender systems (RSs) have been the most important technology for increasing the business in Taobao, the largest online consumer-to-consumer (C2C) platform in China. There are three major challenges facing RS in Taobao: scalability, sparsity and cold start. In this paper, we present our technical solutions to address these three challenges. The methods are based on a wellknown graph embedding framework. We first construct an item graph from users' behavior history, and learn the embeddings of all items in the graph. The item embeddings are employed to compute pairwise similarities between all items, which are then used in the recommendation process. To alleviate the sparsity and cold start problems, side information is incorporated into the graph embedding framework. We propose two aggregation methods to integrate the embeddings of items and the corresponding side information. Experimental results from offline experiments show that methods incorporating side information are superior to those that do not. Further, we describe the platform upon which the embedding methods are deployed and the workflow to process the billion-scale data in Taobao. Using A/B test, we show that the online Click-Through-Rates (CTRs) are improved comparing to the previous collaborative filtering based methods widely used in Taobao, further demonstrating the effectiveness and feasibility of our proposed methods in Taobao's live production environment.

Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

arXiv (Cornell University), 2016

Classifying products into categories precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification using text and image inputs. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves the top-1 accuracy % over both networks on a real-world large-scale product classification dataset that we collected from Walmart.com. While we focus on image-text fusion that characterizes e-commerce domains, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.

Adaptive Multi-Modalities Fusion in Sequential Recommendation Systems

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

In sequential recommendation, multi-modal information (e.g., text or image) can provide a more comprehensive view of an item's profile. The optimal stage (early or late) to fuse modality features into item representations is still debated. We propose a graph-based approach (named MMSR) to fuse modality features in an adaptive order, enabling each modality to prioritize either its inherent sequential nature or its interplay with other modalities. MMSR represents each user's history as a graph, where the modality features of each item in a user's history sequence are denoted by cross-linked nodes. The edges between homogeneous nodes represent intra-modality sequential relationships, and the ones between heterogeneous nodes represent inter-modality interdependence relationships. During graph propagation, MMSR incorporates dual attention, differentiating homogeneous and heterogeneous neighbors. To adaptively assign nodes with distinct fusion orders, MMSR allows each node's representation to be asynchronously updated through an update gate. In scenarios where modalities exhibit stronger sequential relationships, the update gate prioritizes updates among homogeneous nodes. Conversely, when the interdependent relationships between modalities are more pronounced, the update gate prioritizes updates among heterogeneous nodes. Consequently, MMSR establishes a fusion order that spans a spectrum from early to late modality fusion. In experiments across six datasets, MMSR consistently outperforms state-of-the-art models, and our graph propagation methods surpass other graph neural networks. Additionally, MMSR naturally manages missing modalities.

Designing Multi-Modal Embedding Fusion-Based Recommender (original) (raw)

Related papers