A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets. Python 2.1k 238
NoMIRACL: A multilingual hallucination evaluation dataset to evaluate LLM robustness in RAG against first-stage retrieval errors on 18 languages. Python 27 4
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Python 2k 502
State-of-the-Art Text Embeddings Python 18.5k 2.8k
Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25) Python 10 2