Loading methods · Hugging Face (original) (raw)

Methods for listing and loading datasets:

Datasets

datasets.load_dataset

< source >

( path: str name: typing.Optional[str] = None data_dir: typing.Optional[str] = None data_files: typing.Union[str, collections.abc.Sequence[str], collections.abc.Mapping[str, typing.Union[str, collections.abc.Sequence[str]]], NoneType] = None split: typing.Union[str, datasets.splits.Split, list[str], list[datasets.splits.Split], NoneType] = None cache_dir: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None verification_mode: typing.Union[datasets.utils.info_utils.VerificationMode, str, NoneType] = None keep_in_memory: typing.Optional[bool] = None save_infos: bool = False revision: typing.Union[datasets.utils.version.Version, str, NoneType] = None token: typing.Union[bool, str, NoneType] = None streaming: bool = False num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **config_kwargs ) → Dataset or DatasetDict

Parameters

or IterableDataset or IterableDatasetDict: if streaming=True

Load a dataset from the Hugging Face Hub, or a local dataset.

You can find the list of datasets on the Hub or with huggingface_hub.list_datasets.

A dataset is a directory that contains some data files in generic formats (JSON, CSV, Parquet, etc.) and possibly in a generic structure (Webdataset, ImageFolder, AudioFolder, VideoFolder, etc.)

This function does the following under the hood:

  1. Load a dataset builder:
    • Find the most common data format in the dataset and pick its associated builder (JSON, CSV, Parquet, Webdataset, ImageFolder, AudioFolder, etc.)
    • Find which file goes into which split (e.g. train/test) based on file and directory names or on the YAML configuration
    • It is also possible to specify data_files manually, and which dataset builder to use (e.g. “parquet”).
  2. Run the dataset builder:
    In the general case:
    • Download the data files from the dataset if they are not already available locally or cached.
    • Process and cache the dataset in typed Arrow tables for caching.
      Arrow table are arbitrarily long, typed tables which can store nested objects and be mapped to numpy/pandas/python generic types. They can be directly accessed from disk, loaded in RAM or even streamed over the web.
      In the streaming case:
    • Don’t download or cache anything. Instead, the dataset is lazily loaded and will be streamed on-the-fly when iterating on it.
  3. Return a dataset built from the requested splits in split (default: all).

Example:

Load a dataset from the Hugging Face Hub:

from datasets import load_dataset ds = load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train')

from datasets import load_dataset ds = load_dataset('nyu-mll/glue', 'sst2', split='train')

data_files = {'train': 'train.csv', 'test': 'test.csv'} ds = load_dataset('namespace/your_dataset_name', data_files=data_files)

ds = load_dataset('namespace/your_dataset_name', data_dir='folder_name')

Load a dataset from a Storage Bucket on the Hugging Face Hub:

from datasets import load_dataset ds = load_dataset('buckets/username/bucket_name/rotten_tomatoes', split='train')

Load a local dataset:

from datasets import load_dataset ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')

from datasets import load_dataset ds = load_dataset('json', data_files='path/to/local/my_dataset.json')

Load an IterableDataset:

from datasets import load_dataset ds = load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train', streaming=True)

Load an image dataset with the ImageFolder dataset builder:

from datasets import load_dataset ds = load_dataset('imagefolder', data_dir='/path/to/images', split='train')

datasets.load_from_disk

< source >

( dataset_path: typing.Union[str, bytes, os.PathLike] keep_in_memory: typing.Optional[bool] = None storage_options: typing.Optional[dict] = None ) → Dataset or DatasetDict

Parameters

Loads a dataset that was previously saved using save_to_disk() from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

Example:

from datasets import load_from_disk ds = load_from_disk('path/to/dataset/directory')

datasets.load_dataset_builder

< source >

( path: str name: typing.Optional[str] = None data_dir: typing.Optional[str] = None data_files: typing.Union[str, collections.abc.Sequence[str], collections.abc.Mapping[str, typing.Union[str, collections.abc.Sequence[str]]], NoneType] = None cache_dir: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None revision: typing.Union[datasets.utils.version.Version, str, NoneType] = None token: typing.Union[bool, str, NoneType] = None storage_options: typing.Optional[dict] = None **config_kwargs )

Parameters

Load a dataset builder which can be used to:

You can find the list of datasets on the Hub or with huggingface_hub.list_datasets.

A dataset is a directory that contains some data files in generic formats (JSON, CSV, Parquet, etc.) and possibly in a generic structure (Webdataset, ImageFolder, AudioFolder, VideoFolder, etc.)

Example:

from datasets import load_dataset_builder ds_builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') ds_builder.info.features {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}

datasets.get_dataset_config_names

< source >

( path: str revision: typing.Union[datasets.utils.version.Version, str, NoneType] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None data_files: typing.Union[str, list, dict, NoneType] = None **download_kwargs )

Parameters

Get the list of available config names for a particular dataset.

Example:

from datasets import get_dataset_config_names get_dataset_config_names("nyu-mll/glue") ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']

datasets.get_dataset_infos

< source >

( path: str data_files: typing.Union[str, list, dict, NoneType] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None revision: typing.Union[datasets.utils.version.Version, str, NoneType] = None token: typing.Union[bool, str, NoneType] = None **config_kwargs )

Parameters

Get the meta information about a dataset, returned as a dict mapping config name to DatasetInfoDict.

Example:

from datasets import get_dataset_infos get_dataset_infos('cornell-movie-review-data/rotten_tomatoes') {'default': DatasetInfo(description="Movie Review Dataset. is a dataset of containing 5,331 positive and 5,331 negative processed ences from Rotten Tomatoes movie reviews...), ...}

datasets.get_dataset_split_names

< source >

( path: str config_name: typing.Optional[str] = None data_files: typing.Union[str, collections.abc.Sequence[str], collections.abc.Mapping[str, typing.Union[str, collections.abc.Sequence[str]]], NoneType] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Union[datasets.download.download_manager.DownloadMode, str, NoneType] = None revision: typing.Union[datasets.utils.version.Version, str, NoneType] = None token: typing.Union[bool, str, NoneType] = None **config_kwargs )

Parameters

Get the list of available splits for a particular config and dataset.

Example:

from datasets import get_dataset_split_names get_dataset_split_names('cornell-movie-review-data/rotten_tomatoes') ['train', 'validation', 'test']

From files

Configurations used to load data files. They are used when loading local files or a dataset repository:

You can pass arguments to load_dataset to configure data loading. For example you can specify the sep parameter to define the CsvConfig that is used to load the data:

load_dataset("csv", data_dir="path/to/data/dir", sep="\t")

Text

class datasets.packaged_modules.text.TextConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None encoding: str = 'utf-8' encoding_errors: typing.Optional[str] = None chunksize: int = 10485760 keep_linebreaks: bool = False sample_by: typing.Literal['line', 'paragraph', 'document'] = 'line' )

Parameters

BuilderConfig for text files.

class datasets.packaged_modules.text.Text

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

CSV

class datasets.packaged_modules.csv.CsvConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None sep: str = ',' delimiter: typing.Optional[str] = None header: typing.Union[int, list[int], str, NoneType] = 'infer' names: typing.Optional[list[str]] = None column_names: typing.Optional[list[str]] = None index_col: typing.Union[int, str, list[int], list[str], NoneType] = None usecols: typing.Union[list[int], list[str], NoneType] = None prefix: typing.Optional[str] = None mangle_dupe_cols: bool = True engine: typing.Optional[typing.Literal['c', 'python', 'pyarrow']] = None converters: dict = None true_values: typing.Optional[list] = None false_values: typing.Optional[list] = None skipinitialspace: bool = False skiprows: typing.Union[int, list[int], NoneType] = None nrows: typing.Optional[int] = None na_values: typing.Union[str, list[str], NoneType] = None keep_default_na: bool = True na_filter: bool = True verbose: bool = False skip_blank_lines: bool = True thousands: typing.Optional[str] = None decimal: str = '.' lineterminator: typing.Optional[str] = None quotechar: str = '"' quoting: int = 0 escapechar: typing.Optional[str] = None comment: typing.Optional[str] = None encoding: typing.Optional[str] = None dialect: typing.Optional[str] = None error_bad_lines: bool = True warn_bad_lines: bool = True skipfooter: int = 0 doublequote: bool = True memory_map: bool = False float_precision: typing.Optional[str] = None chunksize: int = 10000 features: typing.Optional[datasets.features.features.Features] = None encoding_errors: typing.Optional[str] = 'strict' on_bad_lines: typing.Literal['error', 'warn', 'skip'] = 'error' date_format: typing.Optional[str] = None )

BuilderConfig for CSV.

class datasets.packaged_modules.csv.Csv

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

JSON

class datasets.packaged_modules.json.JsonConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None encoding: str = 'utf-8' encoding_errors: typing.Optional[str] = None field: typing.Optional[str] = None use_threads: bool = True block_size: typing.Optional[int] = None chunksize: int = 10485760 newlines_in_values: typing.Optional[bool] = None on_mixed_types: typing.Optional[typing.Literal['use_json']] = 'use_json' )

BuilderConfig for JSON.

class datasets.Json

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

XML

class datasets.packaged_modules.xml.XmlConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None encoding: str = 'utf-8' encoding_errors: typing.Optional[str] = None )

BuilderConfig for xml files.

class datasets.packaged_modules.xml.Xml

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Parquet

class datasets.packaged_modules.parquet.ParquetConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None batch_size: typing.Optional[int] = None columns: typing.Optional[list[str]] = None features: typing.Optional[datasets.features.features.Features] = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None fragment_scan_options: typing.Optional[pyarrow._dataset_parquet.ParquetFragmentScanOptions] = None on_bad_files: typing.Literal['error', 'warn', 'skip'] = 'error' )

Parameters

BuilderConfig for Parquet.

Example:

Load a subset of columns:

ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])

Stream data and efficiently filter data, possibly skipping entire files or row groups:

filters = [("col_0", "==", 0)] ds = load_dataset(parquet_dataset_id, streaming=True, filters=filters)

Increase the minimum request size when streaming from 32MiB (default) to 128MiB and enable prefetching:

import pyarrow import pyarrow.dataset fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions( ... cache_options=pyarrow.CacheOptions( ... prefetch_limit=1, ... range_size_limit=128 << 20 ... ), ... ) ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

class datasets.packaged_modules.parquet.Parquet

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Arrow

class datasets.packaged_modules.arrow.ArrowConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None )

BuilderConfig for Arrow.

class datasets.packaged_modules.arrow.Arrow

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

SQL

class datasets.packaged_modules.sql.SqlConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None sql: typing.Union[str, ForwardRef('sqlalchemy.sql.Selectable')] = None con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')] = None index_col: typing.Union[str, list[str], NoneType] = None coerce_float: bool = True params: typing.Union[list, tuple, dict, NoneType] = None parse_dates: typing.Union[list, dict, NoneType] = None columns: typing.Optional[list[str]] = None chunksize: typing.Optional[int] = 10000 features: typing.Optional[datasets.features.features.Features] = None )

BuilderConfig for SQL.

class datasets.packaged_modules.sql.Sql

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Images

class datasets.packaged_modules.imagefolder.ImageFolderConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None drop_labels: bool = None drop_metadata: bool = None metadata_filenames: list = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None )

BuilderConfig for ImageFolder.

class datasets.packaged_modules.imagefolder.ImageFolder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Audio

class datasets.packaged_modules.audiofolder.AudioFolderConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None drop_labels: bool = None drop_metadata: bool = None metadata_filenames: list = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None )

Builder Config for AudioFolder.

class datasets.packaged_modules.audiofolder.AudioFolder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Videos

class datasets.packaged_modules.videofolder.VideoFolderConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None drop_labels: bool = None drop_metadata: bool = None metadata_filenames: list = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None )

BuilderConfig for ImageFolder.

class datasets.packaged_modules.videofolder.VideoFolder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

HDF5

class datasets.packaged_modules.hdf5.HDF5Config

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None batch_size: typing.Optional[int] = None features: typing.Optional[datasets.features.features.Features] = None )

BuilderConfig for HDF5.

class datasets.packaged_modules.hdf5.HDF5

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

ArrowBasedBuilder that converts HDF5 files to Arrow tables using the HF extension types.

Pdf

class datasets.packaged_modules.pdffolder.PdfFolderConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None drop_labels: bool = None drop_metadata: bool = None metadata_filenames: list = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None )

BuilderConfig for ImageFolder.

class datasets.packaged_modules.pdffolder.PdfFolder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Nifti

class datasets.packaged_modules.niftifolder.NiftiFolderConfig

< source >

( name: str = 'default' version: typing.Union[datasets.utils.version.Version, str, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Union[datasets.data_files.DataFilesDict, datasets.data_files.DataFilesPatternsDict, NoneType] = None description: typing.Optional[str] = None features: typing.Optional[datasets.features.features.Features] = None drop_labels: bool = None drop_metadata: bool = None metadata_filenames: list = None filters: typing.Union[pyarrow._compute.Expression, list[tuple], list[list[tuple]], NoneType] = None )

BuilderConfig for NiftiFolder.

class datasets.packaged_modules.niftifolder.NiftiFolder

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

WebDataset

class datasets.packaged_modules.webdataset.WebDataset

< source >

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None config_id: typing.Optional[str] = None **config_kwargs )

Update on GitHub