Romain Beaumont | CNRS - LIMSI (original) (raw)
Uploads
Papers by Romain Beaumont
arXiv (Cornell University), Dec 14, 2022
RÉSUMÉ. Les bases de connaissances du Web sémantique sont généralement représentées sous forme de... more RÉSUMÉ. Les bases de connaissances du Web sémantique sont généralement représentées sous forme de triplets RDF formant un graphe. Leur interrogation passe par un langage de type SPARQL, langage non maîtrisé des utilisateurs non experts, qui requiert de connaître le schéma de la base. C'est pourquoi les systèmes d'interrogation en langage naturel se développent ac-tuellement. Se pose alors le problème de construction automatique de requêtes, devant intégrer des problèmes de distance lexicale entre les mots de la question et les relations de la base de connaissances. Dans cet article, nous proposons une nouvelle méthode d'analyse des questions qui opère par transformation de graphes, fondée sur des contraintes très générales sur la structure des requêtes, et qui résout les ambiguïtés sémantiques au plus tard par interrogation de la base. Nous proposons également une méthode d'identification des relations en nous fondant sur l'utilisation de WordNet. Nous obtenons d...
Cornell University - arXiv, Oct 15, 2022
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training ... more Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B-a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. 1
ArXiv, 2021
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP... more Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zeroor few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.1
The Semantic Web
The correct identification of the link between an entity mention in a text and a known entity in ... more The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context. The relative improvement of this approach is experimentally validated on a recent benchmark corpus from the TAC-EDL 2015 evaluation campaign.
Current recommendation approaches help online merchants predict, for each visiting user, which su... more Current recommendation approaches help online merchants predict, for each visiting user, which subset of their existing products is the most relevant. However, besides being interested in matching users with existing products, merchants are also interested in understanding their users’ underlying preferences. This could indeed help them produce or acquire better matching products in the future. We argue that existing recommendation models cannot directly be used to predict the optimal combination of features that will make new products serve better the needs of the target audience. To tackle this, we turn to generative models, which allow us to learn explicitly distributions over product feature combinations both in text and visual space. We develop WARHOL, a product generation and recommendation architecture that takes as input past user shopping activity and generates relevant textual and visual descriptions of novel products. We show that WARHOL can approach the performance of st...
Lecture Notes in Computer Science, 2017
The correct identification of the link between an entity mention in a text and a known entity in ... more The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context. The relative improvement of this approach is experimentally validated on a recent benchmark corpus from the TAC-EDL 2015 evaluation campaign.
For our participation to QALD-5, we developed a system for answering questions on a knowledge bas... more For our participation to QALD-5, we developed a system for answering questions on a knowledge base. We proposed an unsupervised method for the semantic analysis of questions, that generates queries, based on graph transformations, in two steps. First step is independent of the knowledge base schema and makes use of very general constraints on the query structure that allows us to maintain semantic ambiguities in different graphs. Ambiguities are then solved globally at the final step when querying the knowledge base.
Les bases de connaissances du Web sémantique sont généralement représentées sous forme de triplet... more Les bases de connaissances du Web sémantique sont généralement représentées sous
forme de triplets RDF formant un graphe. Leur interrogation passe par un langage de type
SPARQL, langage non maîtrisé des utilisateurs non experts, qui requiert de connaître le schéma
de la base. C’est pourquoi les systèmes d’interrogation en langage naturel se développent actuellement.
Se pose alors le problème de construction automatique de requêtes, devant intégrer
des problèmes de distance lexicale entre les mots de la question et les relations de la base de
connaissances. Dans cet article, nous proposons une nouvelle méthode d’analyse des questions
qui opère par transformation de graphes, fondée sur des contraintes très générales sur la structure
des requêtes, et qui résout les ambiguïtés sémantiques au plus tard par interrogation de
la base. Nous proposons également une méthode d’identification des relations en nous fondant
sur l’utilisation de WordNet. Nous obtenons de bons résultats pour l’identification de relations
et des résultats prometteurs pour le système global, évalué sur une tâche de QALD3.
arXiv (Cornell University), Dec 14, 2022
RÉSUMÉ. Les bases de connaissances du Web sémantique sont généralement représentées sous forme de... more RÉSUMÉ. Les bases de connaissances du Web sémantique sont généralement représentées sous forme de triplets RDF formant un graphe. Leur interrogation passe par un langage de type SPARQL, langage non maîtrisé des utilisateurs non experts, qui requiert de connaître le schéma de la base. C'est pourquoi les systèmes d'interrogation en langage naturel se développent ac-tuellement. Se pose alors le problème de construction automatique de requêtes, devant intégrer des problèmes de distance lexicale entre les mots de la question et les relations de la base de connaissances. Dans cet article, nous proposons une nouvelle méthode d'analyse des questions qui opère par transformation de graphes, fondée sur des contraintes très générales sur la structure des requêtes, et qui résout les ambiguïtés sémantiques au plus tard par interrogation de la base. Nous proposons également une méthode d'identification des relations en nous fondant sur l'utilisation de WordNet. Nous obtenons d...
Cornell University - arXiv, Oct 15, 2022
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training ... more Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B-a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. 1
ArXiv, 2021
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP... more Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zeroor few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.1
The Semantic Web
The correct identification of the link between an entity mention in a text and a known entity in ... more The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context. The relative improvement of this approach is experimentally validated on a recent benchmark corpus from the TAC-EDL 2015 evaluation campaign.
Current recommendation approaches help online merchants predict, for each visiting user, which su... more Current recommendation approaches help online merchants predict, for each visiting user, which subset of their existing products is the most relevant. However, besides being interested in matching users with existing products, merchants are also interested in understanding their users’ underlying preferences. This could indeed help them produce or acquire better matching products in the future. We argue that existing recommendation models cannot directly be used to predict the optimal combination of features that will make new products serve better the needs of the target audience. To tackle this, we turn to generative models, which allow us to learn explicitly distributions over product feature combinations both in text and visual space. We develop WARHOL, a product generation and recommendation architecture that takes as input past user shopping activity and generates relevant textual and visual descriptions of novel products. We show that WARHOL can approach the performance of st...
Lecture Notes in Computer Science, 2017
The correct identification of the link between an entity mention in a text and a known entity in ... more The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context. The relative improvement of this approach is experimentally validated on a recent benchmark corpus from the TAC-EDL 2015 evaluation campaign.
For our participation to QALD-5, we developed a system for answering questions on a knowledge bas... more For our participation to QALD-5, we developed a system for answering questions on a knowledge base. We proposed an unsupervised method for the semantic analysis of questions, that generates queries, based on graph transformations, in two steps. First step is independent of the knowledge base schema and makes use of very general constraints on the query structure that allows us to maintain semantic ambiguities in different graphs. Ambiguities are then solved globally at the final step when querying the knowledge base.
Les bases de connaissances du Web sémantique sont généralement représentées sous forme de triplet... more Les bases de connaissances du Web sémantique sont généralement représentées sous
forme de triplets RDF formant un graphe. Leur interrogation passe par un langage de type
SPARQL, langage non maîtrisé des utilisateurs non experts, qui requiert de connaître le schéma
de la base. C’est pourquoi les systèmes d’interrogation en langage naturel se développent actuellement.
Se pose alors le problème de construction automatique de requêtes, devant intégrer
des problèmes de distance lexicale entre les mots de la question et les relations de la base de
connaissances. Dans cet article, nous proposons une nouvelle méthode d’analyse des questions
qui opère par transformation de graphes, fondée sur des contraintes très générales sur la structure
des requêtes, et qui résout les ambiguïtés sémantiques au plus tard par interrogation de
la base. Nous proposons également une méthode d’identification des relations en nous fondant
sur l’utilisation de WordNet. Nous obtenons de bons résultats pour l’identification de relations
et des résultats prometteurs pour le système global, évalué sur une tâche de QALD3.