Common Crawl News Dataset (original) (raw)

The news dataset includes articles from news sites all over the world. WARC files are released on a daily basis. The news crawl was started in 2016, please see the news dataset announcement for further information.

The source code of the news crawler is available on our GitHub account. Please, report issues there and share your suggestions for improvements with us.

News Dataset WARC File Location

The WARC file names of the news data set follow the pattern:

crawl-data/CC-NEWS/yyyy/mm/CC-NEWS-yyyymmddHHMMSS-nnnnn.warc.gz

with

yyyy

year

mm

month (01..12)

dd

day of month (01, etc.)

HH

hour (00..23)

MM

minute (00..59)

SS

second (00..59)

nnnnn

serial WARC file number. The serial number is reset when the crawl process is resumed.

The timestamp (yyyymmddHHMMSS) indicates the time the first record in the WARC file was created.

We provide WARC file listings by month. The path listings are found at

s3://commoncrawl/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz

resp.

https://data.commoncrawl.org/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz

For accessing the data please see our Get Started page.

News Dataset Size By Year

For every year (linked) we provide an overview by month including links to the WARC file listings.

Year Num. WARC files Total WARC SizeCompressed (TiB)
2026 n/a n/a
2025 5988 5.839
2024 6224 6.072
2023 8318 8.102
2022 7956 7.754
2021 6605 6.435
2020 5395 5.263
2019 3536 3.449
2018 2613 2.548
2017 1583 1.504
2016 207 0.151