HuggingFaceFW/finewiki · Datasets at Hugging Face (original) (raw)

FineWiki

This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages.

This dataset:

Visualize and Compare

You can explore the dataset, compare it to wikimedia/Wikipedia and preview the live Wikipedia pages on our space.

Available subsets

Subset Name Size Pages
en English 35.1 GB 6,614,655
de German 13.1 GB 2,713,646
fr French 12.1 GB 2,566,183
ru Russian 10.7 GB 1,817,813
ja Japanese 9.9 GB 1,354,269
es Spanish 8.5 GB 1,948,965
it Italian 7.4 GB 1,799,759
uk Ukrainian 5.4 GB 1,239,253
zh Chinese (writtenvernacular Chinese) 5.1 GB 1,295,955
pl Polish 4.4 GB 1,543,918
ceb Cebuano 4.4 GB 5,647,436
pt Portuguese 4.3 GB 1,135,383
nl Dutch 3.5 GB 2,072,865
ca Catalan 3.5 GB 962,290
ar Arabic 3.4 GB 1,230,456
sv Swedish 2.9 GB 2,470,063
cs Czech 2.2 GB 534,563
fa Persian 2.2 GB 1,021,336
vi Vietnamese 2.1 GB 1,279,087
hu Hungarian 2.1 GB 515,004
ko Korean 2.0 GB 582,035
he Hebrew 2.0 GB 372,053
sr Serbian 2.0 GB 664,345
id Indonesian 1.8 GB 723,099
tr Turkish 1.6 GB 629,762
fi Finnish 1.5 GB 572,900
no Norwegian (Bokmål) 1.3 GB 620,802
el Greek 1.2 GB 242,517
hy Armenian 1.2 GB 309,820
ro Romanian 1.2 GB 493,462
...
Total 184.7 GB 61,550,610

A detailed list is available here.

How to download and use 🌐 FineWiki

See the tables above for the subset of the language you want to download.

We currently do not provide smaller sample versions, but by setting limit or using streaming=True you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.

Using 🏭 datatrove

from datatrove.pipeline.readers import ParquetReader

# limit determines how many documents will be streamed (remove for all)
# this will fetch the Portuguese data
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) 
for document in data_reader():
    # do something with document
    print(document)

###############################    
# OR for a processing pipeline:
###############################

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter

pipeline_exec = LocalPipelineExecutor(
    pipeline=[
        ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
        LambdaFilter(lambda doc: "hugging" in doc.text),
        JsonlWriter("some-output-path")
    ],
    tasks=10
)
pipeline_exec.run()

Using huggingface_hub

from huggingface_hub import snapshot_download
folder = snapshot_download(
                "HuggingFaceFW/finewiki", 
                repo_type="dataset",
                local_dir="./finewiki/",
                # download the English subset
                allow_patterns=["data/enwiki/*"])

Using datasets

from datasets import load_dataset
# get Spanish data
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)

Dataset Structure

Data Instances

Example from the English subset (values truncated for readability):

{
  "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...",
  "id": "enwiki/32552979",
  "wikiname": "enwiki",
  "page_id": 32552979,
  "title": "10th Tank Corps",
  "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
  "date_modified": "2023-07-26T12:32:03Z",
  "in_language": "en",
  "wikidata_id": "Q12061605",
  "bytes_html": 115017,
  "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...",
  "version": 1167219203,
  "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]",
  "has_math": false
}

Data Fields

Data Processing

The full pipeline processing code is available here. It runs on datatrove. While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.

Downloading

We used the Wikimedia Enterprise HTML dump API (https://api.enterprise.wikimedia.com/v2/snapshots) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.

Extraction

We heavily adapted mwparserfromhtml to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a has_math flag) as well as tables, where much of the Wikipedia knowledge is contained.

Filtering

One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.

Licensing Information

This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html

Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.

Citation Information

@dataset{penedo2025finewiki,
  author    = {Guilherme Penedo},
  title     = {FineWiki},
  year      = {2025},
  publisher = {Hugging Face Datasets},
  url       = {https://huggingface.co/datasets/HuggingFaceFW/finewiki},
  urldate   = {2025-10-20},
  note      = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.}
}

Downloads last month

11,062

Models trained or fine-tuned on HuggingFaceFW/finewiki

Spaces using HuggingFaceFW/finewiki 5

Collection including HuggingFaceFW/finewiki