Builder classes (original) (raw)

Builders

🤗 Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.

class datasets.DatasetBuilder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None **config_kwargs )

Parameters

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

Some DatasetBuilders expose multiple variants of the dataset by defining a BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in DatasetBuilder.builder_configs().

as_dataset

< source >

( split: typing.Optional[datasets.splits.Split] = None run_post_process = True verification_mode: typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None in_memory = False )

Parameters

Return a Dataset for the specified split.

Example:

from datasets import load_dataset_builder builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') builder.download_and_prepare() ds = builder.as_dataset(split='train') ds Dataset({ features: ['text', 'label'], num_rows: 8530 })

download_and_prepare

< source >

( output_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None verification_mode: typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None dl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None base_path: typing.Optional[str] = None file_format: str = 'arrow' max_shard_size: typing.Union[str, int, NoneType] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **download_and_prepare_kwargs )

Parameters

Downloads and prepares dataset for reading.

Example:

Download and prepare the dataset as Arrow files that can be loaded as a Dataset using builder.as_dataset():

from datasets import load_dataset_builder builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") builder.download_and_prepare()

Download and prepare the dataset as sharded Parquet files locally:

from datasets import load_dataset_builder builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") builder.download_and_prepare("./output_dir", file_format="parquet")

Download and prepare the dataset as sharded Parquet files in a cloud storage:

from datasets import load_dataset_builder storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") builder.download_and_prepare("s3://my-bucket/my_rotten_tomatoes", storage_options=storage_options, file_format="parquet")

Empty dict if doesn’t exist

Example:

from datasets import load_dataset_builder ds_builder = load_dataset_builder('vivos') ds_builder.get_all_exported_dataset_infos() {'default': DatasetInfo(description='', citation='', homepage='', license='', features={'speaker_id': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'sentence': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name=None, dataset_name=None, config_name='default', version=None, splits={'train': SplitInfo(name='train', num_bytes=1722002133, num_examples=11660, shard_lengths=None, dataset_name=None), 'test': SplitInfo(name='test', num_bytes=86120227, num_examples=760, shard_lengths=None, dataset_name=None)}, download_checksums=None, download_size=1475540500, post_processing_size=None, dataset_size=1808122360, size_in_bytes=None)}

Empty DatasetInfo if doesn’t exist

Example:

from datasets import load_dataset_builder ds_builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') ds_builder.get_exported_dataset_info() DatasetInfo(description='', citation='', homepage='', license='', features={'speaker_id': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'sentence': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name=None, dataset_name=None, config_name='default', version=None, splits={'train': SplitInfo(name='train', num_bytes=1722002133, num_examples=11660, shard_lengths=None, dataset_name=None), 'test': SplitInfo(name='test', num_bytes=86120227, num_examples=760, shard_lengths=None, dataset_name=None)}, download_checksums=None, download_size=1475540500, post_processing_size=None, dataset_size=1808122360, size_in_bytes=None)

Return the path of the module of this class or subclass.

class datasets.GeneratorBasedBuilder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None **config_kwargs )

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.

class datasets.ArrowBasedBuilder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None **config_kwargs )

Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

class datasets.BuilderConfig

< source >

( name: str = 'default' version: typing.Union[str, datasets.utils.version.Version, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None )

Parameters

Base class for DatasetBuilder data configuration.

DatasetBuilder subclasses with data configuration options should subclassBuilderConfig and add their own properties.

create_config_id

< source >

( config_kwargs: dict custom_features: typing.Optional[datasets.features.features.Features] = None )

The config id is used to build the cache directory. By default it is equal to the config name. However the name of a config is not sufficient to have a unique identifier for the dataset being generated since it doesn’t take into account:

Therefore the config id is just the config name with an optional suffix based on these.

Download

class datasets.DownloadManager

< source >

( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None base_path: typing.Optional[str] = None record_checksums = True )

download

< source >

( url_or_urls ) → str or list or dict

Parameters

Returns

str or list or dict

The downloaded paths matching the given input url_or_urls.

Download given URL(s).

By default, only one process is used for download. Pass customized download_config.num_proc to change this behavior.

Example:

downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')

download_and_extract

< source >

( url_or_urls ) → extracted_path(s)

Parameters

Returns

extracted_path(s)

str, extracted paths of given URL(s).

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

( path_or_paths ) → extracted_path(s)

Parameters

str, The extracted paths matching the given input path_or_paths.

Extract given path(s).

Example:

downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') extracted_files = dl_manager.extract(downloaded_files)

iter_archive

< source >

( path_or_buf: typing.Union[str, _io.BufferedReader] ) → tuple[str, io.BufferedReader]

Parameters

Yields

tuple[str, io.BufferedReader]

Iterate over files within an archive.

Example:

archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') files = dl_manager.iter_archive(archive)

iter_files

< source >

( paths: typing.Union[str, list[str]] ) → str

Parameters

Iterate over file paths.

Example:

files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip') files = dl_manager.iter_files(files)

class datasets.StreamingDownloadManager

< source >

( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None base_path: typing.Optional[str] = None )

Download manager that uses the ”::” separator to navigate through (possibly remote) compressed archives. Contrary to the regular DownloadManager, the download and extract methods don’t actually download nor extract data, but they rather return the path or url that could be opened using the xopen function which extends the built-in open function to stream data from remote files.

download

< source >

( url_or_urls ) → url(s)

Parameters

(str or list or dict), URL(s) to stream data from matching the given input url_or_urls.

Normalize URL(s) of files to stream data from. This is the lazy version of DownloadManager.download for streaming.

Example:

downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz')

download_and_extract

< source >

( url_or_urls ) → url(s)

Parameters

(str or list or dict), URL(s) to stream data from matching the given input url_or_urls.

Prepare given url_or_urls for streaming (add extraction protocol).

This is the lazy version of DownloadManager.download_and_extract for streaming.

Is equivalent to:

urls = dl_manager.extract(dl_manager.download(url_or_urls))

( url_or_urls ) → url(s)

Parameters

(str or list or dict), URL(s) to stream data from matching the given input url_or_urls.

Add extraction protocol for given url(s) for streaming.

This is the lazy version of DownloadManager.extract for streaming.

Example:

downloaded_files = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') extracted_files = dl_manager.extract(downloaded_files)

iter_archive

< source >

( urlpath_or_buf: typing.Union[str, _io.BufferedReader] ) → tuple[str, io.BufferedReader]

Parameters

Yields

tuple[str, io.BufferedReader]

Iterate over files within an archive.

Example:

archive = dl_manager.download('https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz') files = dl_manager.iter_archive(archive)

iter_files

< source >

( urlpaths: typing.Union[str, list[str]] ) → str

Parameters

Iterate over files.

Example:

files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip') files = dl_manager.iter_files(files)

class datasets.DownloadConfig

< source >

( cache_dir: typing.Union[str, pathlib.Path, NoneType] = None force_download: bool = False resume_download: bool = False local_files_only: bool = False proxies: typing.Optional[dict] = None user_agent: typing.Optional[str] = None extract_compressed_file: bool = False force_extract: bool = False delete_extracted: bool = False extract_on_the_fly: bool = False use_etag: bool = True num_proc: typing.Optional[int] = None max_retries: int = 1 token: typing.Union[str, bool, NoneType] = None storage_options: dict = download_desc: typing.Optional[str] = None disable_tqdm: bool = False )

Parameters

Configuration for our cached path manager.

class datasets.DownloadMode

< source >

( value names = None module = None qualname = None type = None start = 1 )

Enum for how to treat pre-existing downloads and data.

The default mode is REUSE_DATASET_IF_EXISTS, which will reuse both raw downloads and the prepared dataset if they exist.

The generations modes:

Downloads Dataset
REUSE_DATASET_IF_EXISTS (default) Reuse Reuse
REUSE_CACHE_IF_EXISTS Reuse Fresh
FORCE_REDOWNLOAD Fresh Fresh

Verification

class datasets.VerificationMode

< source >

( value names = None module = None qualname = None type = None start = 1 )

Enum that specifies which verification checks to run.

The default mode is BASIC_CHECKS, which will perform only rudimentary checks to avoid slowdowns when generating/downloading a dataset for the first time.

The verification modes:

Verification checks
ALL_CHECKS Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder
and the validity (number of files, checksums, etc.) of downloaded files
BASIC_CHECKS (default) Same as ALL_CHECKS but without checking downloaded files
NO_CHECKS None

Splits

class datasets.SplitGenerator

< source >

( name: str gen_kwargs: dict = )

Parameters

Defines the split information for the generator.

This should be used as returned value ofGeneratorBasedBuilder._split_generators. See GeneratorBasedBuilder._split_generators for more info and example of usage.

Example:

datasets.SplitGenerator( ... name=datasets.Split.TRAIN, ... gen_kwargs={"split_key": "train", "files": dl_manager.download_and_extract(url)}, ... )

Enum for dataset splits.

Datasets are typically split into different subsets to be used at various stages of training and evaluation.

All splits, including compositions inherit from datasets.SplitBase.

See the guide on splits for more information.

Example:

datasets.SplitGenerator( ... name=datasets.Split.TRAIN, ... gen_kwargs={"split_key": "train", "files": dl_manager.download_and extract(url)}, ... ), ... datasets.SplitGenerator( ... name=datasets.Split.VALIDATION, ... gen_kwargs={"split_key": "validation", "files": dl_manager.download_and extract(url)}, ... ), ... datasets.SplitGenerator( ... name=datasets.Split.TEST, ... gen_kwargs={"split_key": "test", "files": dl_manager.download_and extract(url)}, ... )

Descriptor corresponding to a named split (train, test, …).

Example:

Each descriptor can be composed with other using addition or slice:

split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST

The resulting split will correspond to 25% of the train split merged with 100% of the test split.

A split cannot be added twice, so the following will fail:

split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TRAIN.subsplit(datasets.percent[75:]) )
split = datasets.Split.TEST + datasets.Split.ALL

The slices can be applied only one time. So the following are valid:

split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TEST.subsplit(datasets.percent[:50]) ) split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])

But this is not valid:

train = datasets.Split.TRAIN test = datasets.Split.TEST split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25]) split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])

Split corresponding to the union of all defined dataset splits.

class datasets.ReadInstruction

< source >

( split_name rounding = None from_ = None to = None unit = None )

Reading instruction for a dataset.

Examples:

ds = datasets.load_dataset('mnist', split='test[:33%]') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( 'test', from_=0, to=33, unit='%'))

ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( 'test[:33%]+train[1:-1]')) ds = datasets.load_dataset('mnist', split=( datasets.ReadInstruction('test', to=33, unit='%') + datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

ds = datasets.load_dataset('mnist', split='test:33%') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( 'test:33%')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( 'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

tests = datasets.load_dataset( 'mnist', [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%') for k in range(0, 100, 10)]) trains = datasets.load_dataset( 'mnist', [datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%') for k in range(0, 100, 10)])

from_spec

< source >

( spec )

Parameters

Creates a ReadInstruction instance out of a string spec.

Examples:

test: test split. test + validation: test split + validation split. test[10:]: test split, minus its first 10 records. test[:10%]: first 10% records of test split. test:20%: first 10% records, rounded with the pct1_dropremainder rounding. test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train.

to_absolute

< source >

( name2len )

Parameters

Translate instruction into a list of absolute instructions.

Those absolute instructions are then to be added together.

Version

class datasets.Version

< source >

( version_str: str description: typing.Optional[str] = None major: typing.Union[str, int, NoneType] = None minor: typing.Union[str, int, NoneType] = None patch: typing.Union[str, int, NoneType] = None )

Parameters

Dataset version MAJOR.MINOR.PATCH.

Example:

VERSION = datasets.Version("1.0.0")

< > Update on GitHub