GitHub - google-deepmind/pg19 (original) (raw)

PG-19 Language Modelling Benchmark

This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library [1], that were published before 1919. It also contains metadata of book titles and publication dates.

Full dataset download link

PG-19 is over double the size of the Billion Word benchmark [2] and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark [3].

Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains(book_id, short_book_title, publication_date).

Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text.

To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table.

One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.

Dataset Statistics

| | Train | Validation | Test | | | ----------- | ------------- | --------- | --------- | | Books | 28,602 | 50 | 100 | | Num. Tokens | 1,973,136,207 | 3,007,061 | 6,966,499 |

Bibtex

@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
          Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property	value
name	The PG-19 Language Modeling Benchmark
alternateName	PG-19
url	https://github.com/deepmind/pg19
sameAs	https://github.com/deepmind/pg19
description	This repository contains the PG-19 dataset. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates.
provider	property value name DeepMind sameAs https://en.wikipedia.org/wiki/DeepMind
license	property value name Apache License, Version 2.0 url https://www.apache.org/licenses/LICENSE-2.0.html
citation	https://identifiers.org/arxiv:1911.05507

Contact

If you have any questions, please contact Jack Rae.

References

[1] https://www.gutenberg.org
[2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
[3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
[4] Ofcom offensive language guide
[5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
[6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)