Manuel Faysse (original) (raw)

PhD Candidate

Paris, France

Hey! I am Manu, a last year PhD student working on LLM and information retrieval research, but curious about (way too) many other things!

I am nearing the end of my academic post-training phase as a PhD student at CentraleSupélec (with Pierre Colombo) and most recently worked under the distilled supervision of Hervé Jégou at Meta FAIR Paris. My research focuses on practical applications of large language models, with focus on Visual Document Retrieval (ColPali, ViDoRe), LLM pretraining (CroissantLLM, Long Context Modeling at Meta), as well as multimodality, automatic evaluation, model memorization, or confidence estimation and contextualization techniques for neural information retrieval.

My work has been published in top international venues (ICLR, ICML, EMNLP, TMLR, COLM), has been featured in the press (MIT Tech Review, Nature Magazine, Usine Digitale, etc.), gave way to many invited talks (Meta, Amazon, IBM, Naver, LlamaIndex, etc.) and has been listed as a top AI innovation of 2024 (State of AI, Tech Radar). Importantly to me, my work is largely used across the industry, both in early stage startups, established large tech companies or government agencies.

My PhD is funded through the CIFRE French program in collaboration with Illuin Technology, where before joining Meta, I held a Staff Research Scientist position, and spent a share of my time advising and accompanying various R&D efforts in the LLM and Vision LLM space. Don’t hesitate to reach out on X.

news

selected publications

2025

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
2025
A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.

2024

ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, and 4 more authors
2024
Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, and 13 more authors
2024
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

2023

Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
Manuel Faysse, Gautier Viaud, Céline Hudelot, and 1 more author
In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023
Instruction Fine-Tuning (IFT) is a powerful paradigm that strengthens the zero-shot capabilities of Large Language Models (LLMs), but in doing so induces new evaluation metric requirements. We show LLM-based metrics to be well adapted to these requirements, and leverage them to conduct an investigation of task-specialization strategies, quantifying the trade-offs that emerge in practical industrial settings. Our findings offer practitioners actionable insights for real-world IFT model deployment.