HuggingFaceFW/finewiki · Datasets at Hugging Face (original) (raw)

This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages.

This dataset:

fully renders templates as it was extracted from HTML and not markdown dumps
removes redirects, disambiguation, and other non main article pages
includes detailed metadata such as page ID, title, last modified date, wikidate ID, version and markdown version of the text
preserves elements and formatting such as headings, lists, code/pre blocks, tables and math content
notably, wikimedia/Wikipedia removes all tables and math content
excludes most of the "References", "See also", "Notes", "External links", and similar citations/notes sections across all languages
besides keeping all math content, pages containing math are flagged with a has_math metadata attribute
extracts infoboxes (the summary high-level information boxes on the right of some wikipedia pages) in a structured format into the metadata, for RAG and other uses
only keeps pages whose script (writing alphabet) matches the expected list for that language
for non-English wikis, any page fully or mostly in English is removed (common issue for Language Identifiers/classifiers training)

Visualize and Compare

You can explore the dataset, compare it to wikimedia/Wikipedia and preview the live Wikipedia pages on our space.

Available subsets

Subset	Name	Size	Pages
en	English	35.1 GB	6,614,655
de	German	13.1 GB	2,713,646
fr	French	12.1 GB	2,566,183
ru	Russian	10.7 GB	1,817,813
ja	Japanese	9.9 GB	1,354,269
es	Spanish	8.5 GB	1,948,965
it	Italian	7.4 GB	1,799,759
uk	Ukrainian	5.4 GB	1,239,253
zh	Chinese (writtenvernacular Chinese)	5.1 GB	1,295,955
pl	Polish	4.4 GB	1,543,918
ceb	Cebuano	4.4 GB	5,647,436
pt	Portuguese	4.3 GB	1,135,383
nl	Dutch	3.5 GB	2,072,865
ca	Catalan	3.5 GB	962,290
ar	Arabic	3.4 GB	1,230,456
sv	Swedish	2.9 GB	2,470,063
cs	Czech	2.2 GB	534,563
fa	Persian	2.2 GB	1,021,336
vi	Vietnamese	2.1 GB	1,279,087
hu	Hungarian	2.1 GB	515,004
ko	Korean	2.0 GB	582,035
he	Hebrew	2.0 GB	372,053
sr	Serbian	2.0 GB	664,345
id	Indonesian	1.8 GB	723,099
tr	Turkish	1.6 GB	629,762
fi	Finnish	1.5 GB	572,900
no	Norwegian (Bokmål)	1.3 GB	620,802
el	Greek	1.2 GB	242,517
hy	Armenian	1.2 GB	309,820
ro	Romanian	1.2 GB	493,462
...
Total	184.7 GB	61,550,610

A detailed list is available here.

How to download and use 🌐 FineWiki

See the tables above for the subset of the language you want to download.

We currently do not provide smaller sample versions, but by setting limit or using streaming=True you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.

Using 🏭 datatrove

from datatrove.pipeline.readers import ParquetReader

# limit determines how many documents will be streamed (remove for all)
# this will fetch the Portuguese data
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) 
for document in data_reader():
    # do something with document
    print(document)

###############################    
# OR for a processing pipeline:
###############################

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter

pipeline_exec = LocalPipelineExecutor(
    pipeline=[
        ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
        LambdaFilter(lambda doc: "hugging" in doc.text),
        JsonlWriter("some-output-path")
    ],
    tasks=10
)
pipeline_exec.run()

Using `huggingface_hub`

from huggingface_hub import snapshot_download
folder = snapshot_download(
                "HuggingFaceFW/finewiki", 
                repo_type="dataset",
                local_dir="./finewiki/",
                # download the English subset
                allow_patterns=["data/enwiki/*"])

Using `datasets`

from datasets import load_dataset
# get Spanish data
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)

Dataset Structure

Data Instances

Example from the English subset (values truncated for readability):

{
  "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...",
  "id": "enwiki/32552979",
  "wikiname": "enwiki",
  "page_id": 32552979,
  "title": "10th Tank Corps",
  "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
  "date_modified": "2023-07-26T12:32:03Z",
  "in_language": "en",
  "wikidata_id": "Q12061605",
  "bytes_html": 115017,
  "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...",
  "version": 1167219203,
  "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]",
  "has_math": false
}

Data Fields

text (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists)
id (string): dataset‑unique identifier; typically <wikiname>/<page_id>
wikiname (string): wiki project name, e.g., enwiki, ptwiki
page_id (int): MediaWiki page identifier
title (string): article title
url (string): canonical article URL
date_modified (string): ISO‑8601 timestamp of the last page revision
in_language (string): article language code (e.g., en, pt)
wikidata_id (string|null): Wikidata QID associated with the page
bytes_html (int): size in bytes of the original HTML body
wikitext (string): original wikitext when available
version (int|string): revision/version identifier of the page
infoboxes (string): JSON‑encoded array of extracted infobox objects with title and key‑value data
has_math (bool): whether math content was detected on the page

Data Processing

The full pipeline processing code is available here. It runs on datatrove. While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.

Downloading

We used the Wikimedia Enterprise HTML dump API (https://api.enterprise.wikimedia.com/v2/snapshots) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.

Extraction

We heavily adapted mwparserfromhtml to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a has_math flag) as well as tables, where much of the Wikipedia knowledge is contained.

Filtering

One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.

Licensing Information

This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html

Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.

Citation Information

@dataset{penedo2025finewiki,
  author    = {Guilherme Penedo},
  title     = {FineWiki},
  year      = {2025},
  publisher = {Hugging Face Datasets},
  url       = {https://huggingface.co/datasets/HuggingFaceFW/finewiki},
  urldate   = {2025-10-20},
  note      = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.}
}

HuggingFaceFW/finewiki · Datasets at Hugging Face (original) (raw)

Visualize and Compare

Available subsets

How to download and use 🌐 FineWiki

Using 🏭 datatrove

Using `huggingface_hub`

Using `datasets`

Dataset Structure

Data Instances

Data Fields

Data Processing

Downloading

Extraction

Filtering

Licensing Information

Citation Information

Models trained or fine-tuned on HuggingFaceFW/finewiki

Spaces using HuggingFaceFW/finewiki 5

Collection including HuggingFaceFW/finewiki

HuggingFaceFW/finewiki · Datasets at Hugging Face (original) (raw)

Visualize and Compare

Available subsets

How to download and use 🌐 FineWiki

Using 🏭 datatrove

Using huggingface_hub

Using datasets

Dataset Structure

Data Instances

Data Fields

Data Processing

Downloading

Extraction

Filtering

Licensing Information

Citation Information

Models trained or fine-tuned on HuggingFaceFW/finewiki

Spaces using HuggingFaceFW/finewiki 5

Collection including HuggingFaceFW/finewiki

Using `huggingface_hub`

Using `datasets`