Research:Data - Meta (original) (raw)

From Meta, a Wikimedia project coordination wiki

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.

If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you still have questions, you can email your question to the Analytics mailing list (more information).

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.

See also inspirational example uses.

Also consider searching for datasets at Zenodo, Figshare, Dimensions.ai, Google Dataset Search, Academic Torrents, DataHub (historical) or Hugging Face (see also a curated "Wikimedia Datasets" list on Huggingface).

Data Dumps (details)

HomepageDownload

Dumps of all WMF projects for backup, offline use, research, etc.

APIs (details)

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using Server Sent Events over HTTP.

Analytics Dumps (details)

WikiStats (details)

Homepage

Reports based on data dumps and server log files.

DBpedia (details)

DBpedia extracts structured data from Wikipedia. It allows users to run complex queries and link Wikipedia data to other data sets.

DataHub and Figshare (details)

Differential privacy (details)

Differential privacy homepage

A collection of differentially-private datasets, released daily, weekly, or monthly.

The table below is a quick reference of data sources organized by data domain. For a more detailed overview of Wikimedia data domains and how to access data in each domain, use the links in the table or see Research:Data introduction.

Data domain Data source Access method
Content MediaWiki REST API API
Content MediaWiki Action API:Parse (HTML) API
Content MediaWiki Action API:Revisions (wikitext) API
Content Wikidata:REST_API API
Content Wikimedia Enterprise APIs (require separate accounts, free access may have limits) API
Content – structured data Wikidata:REST_API API
Content – structured data Wikidata SPARQL query service API
Content – structured data Commons SPARQL query service API
Content – structured data DBpedia SPARQL endpoint API
Contributions / edits MediaWiki Action API: Revisions API
Contributions / edits MediaWiki Action API: Allrevisions API
Contributions / edits Wikimedia Analytics API: Edits data API
Contributions / edits MediaWiki Event Streams API
Contributions / edits Wikimedia Enterprise APIs (require separate accounts, free access may have limits) API
Contributors / editors Wikimedia Analytics API: Editors by country API
Contributors / editors MediaWiki Action API: Users API
Contributors / editors MediaWiki Action API: Usercontribs API
Traffic Wikimedia Analytics API: Pageviews API
Traffic Wikimedia Analytics API: Unique devices API
Traffic Wikimedia Analytics API: Mediarequests API
Contributions / edits Wikistats Dashboard
Contributions / edits XTools Dashboard
Contributions / edits Bitergia: technical community metrics Dashboard
Contributors / editors Wikistats Dashboard
Contributors / editors XTools Dashboard
Contributors / editors Bitergia: technical community metrics Dashboard
Traffic Devices Dashboard
Traffic Wikistats Dashboard
Traffic Readers:Pageviews and Unique Devices Dashboard
Traffic Pageviews Tool Dashboard
Traffic WikiNav Dashboard
Content Wikitext Download
Content Static HTML and Enterprise HTML (use mwparserfromhtml) Download
Content Knowledge gaps Download
Content – structured data Commons image depicts Download
Content – structured data Wikidata dumps (JSON, RDF, XML) Download
Content – structured data DBpedia.org Download
Contributions / edits Mediawiki_history Download
Contributions / edits geoeditors Download
Contributions / edits Differential privacy: Geoeditors Download
Traffic Clickstream Download
Traffic Pageview hourly Download
Traffic Unique devices Download
Traffic Mediacounts Download
Traffic Differential privacy pageviews Download
Content Text MediaWiki database tables
Contributions / edits Revision_table MediaWiki database tables
Contributors / editors Mediawiki_history MediaWiki database tables
Contributors / editors geoeditors MediaWiki database tables
Contributors / editors Differential privacy: Geoeditors MediaWiki database tables
Contributors / editors actor MediaWiki database tables
Contributors / editors user MediaWiki database tables
Contributors / editors user_groups MediaWiki database tables
Contributors / editors user_former_groups MediaWiki database tables
Contributors / editors user_properties MediaWiki database tables
Contributors / editors globaluser MediaWiki database tables
Contributors / editors user_groups MediaWiki database tables

WMF releases data dumps of Wikipedia, Wikidata, and all WMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short URL mappings.

See a more comprehensive list of what is available for download.

Dumps.wikimedia.org offers various other database dumps and datasets, including

You can download the latest dumps for the last year (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).Download mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

There are also archives. Many older dumps can also be found at the Internet Archive.

XML dumps are in the wrapper format described at Export format (schema). Files are compressed in gzip (.gz), bzip2/lbzip2 (.bz2) and .7z formats.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples

[edit]

See examples of importing dumps in a MySQL database with step-by-step instructions.

Some tools are listed on the following pages, but these tools are mostly outdated and non-functional:

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

To query the database you send a HTTP GET request to the desired endpoint (example https://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to query and defining the query details the URL.

How to and examples

[edit]

To try out the API interactively on English Wikipedia, use the API Sandbox.

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

The Wiki Replicas (part of WMCS wikitech:Portal:Data Services) host sanitized versions of Wikimedia production MediaWiki databases.

Users of various Wikimedia Cloud Services products can access the wiki Wiki Replicas databases that host sanitized copies of the databases of all Wikimedia projects including Commons.

Explore the database schema of the MediaWiki software.

See the Wiki Replicas page on Wikitech on how to access the Wiki Replicas.

See wikitech:Help:Cloud Services introduction#Communication and support

Recent changes stream

[edit]

See EventStreams to subscribe to Recent changes on all Wikimedia wikis. This broadcasts edits and other changes as they happen.

See wikitech:Event Platform/EventStreams/Powered By

Analytics Datasets on dumps.wikimedia.org offers stable and continuous datasets about web request statistics (including page views, mediacounts, unique devices), page revision history, data by country, and Wikidata QRanks.

Pageview statistics

[edit]

Pageview statistics are one example. Each request of a page reaches one of Wikimedia's Varnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly.

Files starting with "project" contain total hits per project per hour statistics.

Per-country pageviews data is also available, sanitized for privacy reasons. See this announcement post (June 2023).

See the README for details on the format.

You can interactively browse the page view statistics at https://pageviews.toolforge.org. More documentation on the Pageviews Analysis tool is available.

The Wikipedia clickstream dataset contains counts of (referrer, resource)pairs extracted from the request logs of Wikipedia.

The public "Geoeditors" dataset contains information about the monthly number of active editors from a particular country on a particular Wikipedia language edition (bucketed and redacted for privacy reasons). For some earlier years, similar data is available at [1]/[2], see also Edits by project and country of origin.

Additional datasets (mostly irregular or discontinued ones) are published at https://analytics.wikimedia.org/datasets/. These include Caching research data, and AS Performance Report.

Wikistats is an informal but widely recognized name for a set of reports which provide monthly trend information for all Wikimedia projects and wikis.

Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:

Data is presented as charts with the option to download the underlying data.

For more details on Wikistats, see wikitech:Data Platform/Systems/Wikistats 2.

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

The English version of the DBpedia knowledge base describes millions of things, and the majority of items are classified in a consistent ontology (persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.). Localized versions of DBpedia in more than hundred languages describe millions of things.

The data set also features:

The Wikimedia organization on the Open Knowledge Foundation's DataHub was established by the Wikimedia Foundation around 2013, and contains a collection of datasets about Wikipedia and other projects which mostly date from around 2013-2016.

Wikivoyage also maintains data on its own DataHub:

Differential privacy

[edit]

The WMF privacy engineering team uses differential privacy to release data that would otherwise be too sensitive to release. This data currently only includes pageview statistics; in the future, it will include statistics about editors, centralnotice impressions and views, search, and more.

Differentially-private data is currently available in static TSV form at https://analytics.wikimedia.org/published/datasets/. Work to make this data available via API is ongoing.

Differentially-private data and code is available under a Creative Commons Zero license.