GitHub - allenai/peS2o: Pretraining Efficiently on S2ORC! (original) (raw)

peS2o logo. It's a picure of a mortar and pestle with documents flying in.

Pretraining Effectively on S2ORC!

The peS2o dataset is a collection of ~40M open access academic papers, cleaned, filtered, and formatted for pre-training of language models. It is derived from the Semantic Scholar Open Research Corpus(Lo et al, 2020), or S2ORC.

peS2o is available on the Huggingface Hub!

from datasets import load_dataset dataset = load_dataset("allenai/peS2o", "v2", split="train")

We release multiple version of peS2o, each with different processing and knowledge cutoff date. We recommend you to use the latest version available.

If you use this dataset, please cite:

@techreport{peS2o, author = {Luca Soldaini and Kyle Lo}, year = 2023, title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}}, institution = {{Allen Institute for AI}}, note = {ODC-By, \url{https://github.com/allenai/pes2o}} }

Document Format

Each document in the dataset is a dictionary with the following fields:


peS2o V1

Key Facts

Processing

Processing differs slightly whether it was derived from the full-text corpus (s2orc) or the title and abstract corpus (s2ag).

S2ORC-derived documents

Unfiltered, S2ORC contains 11.3M papers and 46.9B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints:

The train set contains papers published before 2022-12-01; the validation set includes documents published after 2022-12-01 and until 2023-01-03.

S2AG-derived documents

The S2AG corpus contains titles and abstracts of papers in Semantic Scholar. Unfiltered, the corpus contains 91.1M papers and 15.5B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints:

Statistics

Dataset Split # Documents # Words
s2orc train 8,242,162 36,088,195,908
s2orc valid 51,323 255,139,074
s2ag train 59,382,301 11,009,123,378
s2ag valid 111,228 24,398,512

peS2o V2

Key Facts

Processing

peS2o V2 is largely the same as V1, but it includes additional heuristics s2ag aimed at filtering out OCR errors from abstract.

First, we check if the abstract was obtained from Semantic Scholar sources that are likely to contain OCR'ed content. For any abstract derived from those sources, we count how often the text contains subsequences matching \b([A-Za-z]\s)([a-z]\s)*[A-Za-z]\b, i.e. individual alpha letters separated by a space. This heuristic matches cases such as A b stra ct (2 matching subsequences), where the OCR parser inserted erroneous spaces. Any abstract with more than 4 matching subsequences is removed.

Statistics

Dataset Split # Documents # Words
s2orc train 8,242,162 36,088,195,908
s2orc valid 51,323 255,139,074
s2ag train 30,569,017 5,920,099,207
s2ag valid 109,709 24,029,459