Multilingual datasets (original) (raw)
🍻 Multilingual BEIR datasets
Eversince the release of the BEIR benchmark, we are expanding publicly available datasets in different languages. We focus on the monolingual (i.e. queries and documents in the same language) evaluation of different datasets.
We convert these existing datasets into the BEIR format and host them publicly on our platform:
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi-QA, covering eleven typologically diverse languages.
- mMARCO is a multilingual version of the MS MARCO passage ranking dataset across 14 languages.
- GermanQuAD is a German dataset constructed similar to the English SQuAD dataset with the German Wikipedia.
- ViHealthQA is a Vietnamese dataset constructed using questions from health-interested users asked on health websites and answers from highly qualified experts.
🍻 Multilingual Datasets
| Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
|---|---|---|---|---|---|---|---|---|---|
| Vietnamese | ViHealthQA | Homepage | vihealthqa | traindevtest | 2,013 | 10K | 2.5 | Link | 7685a360a837934624a8018d826c383f |
| German | GermanQuAD | Homepage | germanquad | test | 2,044 | 2.80M | 1.0 | Link | 95a581c3162d10915a418609bcce851b |
| Arabic | Mr.TyDI | Homepage | mrtydi/arabic | traindevtest | 1,081 | 2.1M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Bengali | Mr.TyDI | Homepage | mrtydi/bengali | traindevtest | 111 | 304K | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Finnish | Mr.TyDI | Homepage | mrtydi/finnish | traindevtest | 1,254 | 1.9M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Indonesian | Mr.TyDI | Homepage | mrtydi/indonesian | traindevtest | 829 | 1.47M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Japanese | Mr.TyDI | Homepage | mrtydi/japanese | traindevtest | 720 | 7M | 1.3 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Korean | Mr.TyDI | Homepage | mrtydi/korean | traindevtest | 421 | 1.5M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Russian | Mr.TyDI | Homepage | mrtydi/russian | traindevtest | 995 | 9.6M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Swahili | Mr.TyDI | Homepage | mrtydi/swahili | traindevtest | 670 | 136K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Telugu | Mr.TyDI | Homepage | mrtydi/telugu | traindevtest | 646 | 548K | 1.0 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
| Thai | Mr.TyDI | Homepage | mrtydi/thai | traindevtest | 1,190 | 568K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
🍻 Translated (Multilingual) Datasets
| Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
|---|---|---|---|---|---|---|---|---|---|
| Spanish | mMARCO | Homepage | mmarco/spanish | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| French | mMARCO | Homepage | mmarco/french | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| Portuguese | mMARCO | Homepage | mmarco/portuguese | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| Italian | mMARCO | Homepage | mmarco/italian | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| Indonesian | mMARCO | Homepage | mmarco/indonesian | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| German | mMARCO | Homepage | mmarco/german | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| Russian | mMARCO | Homepage | mmarco/russian | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
| Chinese | mMARCO | Homepage | mmarco/chinese | traindev | 6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |