Common Crawl News Dataset (original) (raw)
The news dataset includes articles from news sites all over the world. WARC files are released on a daily basis. The news crawl was started in 2016, please see the news dataset announcement for further information.
The source code of the news crawler is available on our GitHub account. Please, report issues there and share your suggestions for improvements with us.
News Dataset WARC File Location
The WARC file names of the news data set follow the pattern:
crawl-data/CC-NEWS/yyyy/mm/CC-NEWS-yyyymmddHHMMSS-nnnnn.warc.gz
with
yyyy
year
mm
month (01..12)
dd
day of month (01, etc.)
HH
hour (00..23)
MM
minute (00..59)
SS
second (00..59)
nnnnn
serial WARC file number. The serial number is reset when the crawl process is resumed.
The timestamp (yyyymmddHHMMSS) indicates the time the first record in the WARC file was created.
We provide WARC file listings by month. The path listings are found at
s3://commoncrawl/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz
resp.
https://data.commoncrawl.org/crawl-data/CC-NEWS/yyyy/mm/warc.paths.gz
For accessing the data please see our Get Started page.
News Dataset Size By Year
For every year (linked) we provide an overview by month including links to the WARC file listings.
| Year | Num. WARC files | Total WARC SizeCompressed (TiB) |
|---|---|---|
| 2026 | n/a | n/a |
| 2025 | 5988 | 5.839 |
| 2024 | 6224 | 6.072 |
| 2023 | 8318 | 8.102 |
| 2022 | 7956 | 7.754 |
| 2021 | 6605 | 6.435 |
| 2020 | 5395 | 5.263 |
| 2019 | 3536 | 3.449 |
| 2018 | 2613 | 2.548 |
| 2017 | 1583 | 1.504 |
| 2016 | 207 | 0.151 |