Common Crawl News Dataset (original) (raw)

The news dataset includes articles from news sites all over the world. WARC files are released on a daily basis. The news crawl was started in 2016, please see the news dataset announcement for further information.

The source code of the news crawler is available on our GitHub account. Please, report issues there and share your suggestions for improvements with us.

News Dataset WARC File Location

The WARC file names of the news data set follow the pattern:

crawl-data/CC-NEWS/yyyy/mm/CC-NEWS-yyyymmddHHMMSS-nnnnn.warc.gz

with

yyyy

year

mm

month (01..12)

dd

day of month (01, etc.)

HH

hour (00..23)

MM

minute (00..59)

SS

second (00..59)

nnnnn

serial WARC file number. The serial number is reset when the crawl process is resumed.

The timestamp (yyyymmddHHMMSS) indicates the time the first record in the WARC file was created.

We provide WARC file listings by month. The path listings are found at

s3://commoncrawl/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz

resp.

https://data.commoncrawl.org/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz

For accessing the data please see our Get Started page.

News Dataset Size By Year

For every year (linked) we provide an overview by month including links to the WARC file listings.

Year	Num. WARC files	Total WARC SizeCompressed (TiB)
2026	n/a	n/a
2025	5988	5.839
2024	6224	6.072
2023	8318	8.102
2022	7956	7.754
2021	6605	6.435
2020	5395	5.263
2019	3536	3.449
2018	2613	2.548
2017	1583	1.504
2016	207	0.151