Main classes (original) (raw)

DatasetInfo

class datasets.DatasetInfo

< source >

( description: str = citation: str = homepage: str = license: str = features: typing.Optional[datasets.features.features.Features] = None post_processed: typing.Optional[datasets.info.PostProcessedInfo] = None supervised_keys: typing.Optional[datasets.info.SupervisedKeysData] = None builder_name: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None version: typing.Union[str, datasets.utils.version.Version, NoneType] = None splits: typing.Optional[dict] = None download_checksums: typing.Optional[dict] = None download_size: typing.Optional[int] = None post_processing_size: typing.Optional[int] = None dataset_size: typing.Optional[int] = None size_in_bytes: typing.Optional[int] = None )

Parameters

Information about a dataset.

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Not all fields are known on construction and may be updated later.

from_directory

< source >

( dataset_info_dir: str storage_options: typing.Optional[dict] = None )

Parameters

Create DatasetInfo from the JSON file in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.

This will overwrite all previous metadata.

Example:

from datasets import DatasetInfo ds_info = DatasetInfo.from_directory("/path/to/directory/")

write_to_directory

< source >

( dataset_info_dir pretty_print = False storage_options: typing.Optional[dict] = None )

Parameters

Write DatasetInfo and license (if present) as JSON files to dataset_info_dir.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.info.write_to_directory("/path/to/directory/")

Dataset

The base class Dataset implements a Dataset backed by an Apache Arrow table.

class datasets.Dataset

< source >

( arrow_table: Table info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_table: typing.Optional[datasets.table.Table] = None fingerprint: typing.Optional[str] = None )

A Dataset backed by an Arrow table.

add_column

< source >

( name: str column: typing.Union[list, ] new_fingerprint: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf, NoneType] = None )

Parameters

Add column to Dataset.

Added in 1.7

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") more_text = ds["text"] ds.add_column(name="text_2", column=more_text) Dataset({ features: ['text', 'label', 'text_2'], num_rows: 1066 })

add_item

< source >

( item: dict new_fingerprint: str )

Parameters

Add item to Dataset.

Added in 1.7

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") new_review = {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'} ds = ds.add_item(new_review) ds[-1] {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'}

from_file

< source >

( filename: str info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_filename: typing.Optional[str] = None in_memory: bool = False )

Parameters

Instantiate a Dataset backed by an Arrow table at filename.

from_buffer

< source >

( buffer: Buffer info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_buffer: typing.Optional[pyarrow.lib.Buffer] = None )

Parameters

Instantiate a Dataset backed by an Arrow buffer.

from_pandas

< source >

( df: DataFrame features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None preserve_index: typing.Optional[bool] = None )

Parameters

Convert pandas.DataFrame to a pyarrow.Table to create a Dataset.

The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.

Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing explicit features and passing it to this function.

Important: a dataset created with from_pandas() lives in memory and therefore doesn’t have an associated cache directory. This may change in the future, but in the meantime if you want to reduce memory usage you should write it back on disk and reload using e.g. save_to_disk / load_from_disk.

Example:

ds = Dataset.from_pandas(df)

from_dict

< source >

( mapping: dict features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None )

Parameters

Convert dict to a pyarrow.Table to create a Dataset.

Important: a dataset created with from_dict() lives in memory and therefore doesn’t have an associated cache directory. This may change in the future, but in the meantime if you want to reduce memory usage you should write it back on disk and reload using e.g. save_to_disk / load_from_disk.

from_generator

< source >

( generator: typing.Callable features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False gen_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None split: NamedSplit = NamedSplit('train') **kwargs )

Parameters

Create a Dataset from a generator.

Example:

def gen(): ... yield {"text": "Good", "label": 0} ... yield {"text": "Bad", "label": 1} ... ds = Dataset.from_generator(gen)

def gen(shards): ... for shard in shards: ... with open(shard) as f: ... for line in f: ... yield {"line": line} ... shards = [f"data{i}.txt" for i in range(32)] ds = Dataset.from_generator(gen, gen_kwargs={"shards": shards})

The Apache Arrow table backing the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.data MemoryMappedTable text: string label: int64


text: [["compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .","the soundtrack alone is worth the price of admission .","rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .","beneath the film's obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve .","bielinsky is a filmmaker of impressive talent .","so beautifully acted and directed , it's clear that washington most certainly has a new career ahead of him if he so chooses .","a visual spectacle full of stunning images and effects .","a gentle and engrossing character study .","it's enough to watch huppert scheming , with her small , intelligent eyes as steady as any noir villain , and to enjoy the perfectly pitched web of tension that chabrol spins .","an engrossing portrait of uncompromising artists trying to create something original against the backdrop of a corporate music industry that only seems to care about the bottom line .",...,"ultimately , jane learns her place as a girl , softens up and loses some of the intensity that made her an interesting character to begin with .","ah-nuld's action hero days might be over .","it's clear why deuces wild , which was shot two years ago , has been gathering dust on mgm's shelf .","feels like nothing quite so much as a middle-aged moviemaker's attempt to surround himself with beautiful , half-naked women .","when the precise nature of matthew's predicament finally comes into sharp focus , the revelation fails to justify the build-up .","this picture is murder by numbers , and as easy to be bored by as your abc's , despite a few whopping shootouts .","hilarious musical comedy though stymied by accents thick as mud .","if you are into splatter movies , then you will probably have a reasonably good time with the salton sea .","a dull , simple-minded and stereotypical tale of drugs , death and mind-numbing indifference on the inner-city streets .","the feature-length stretch . . . strains the show's concept ."]] label: [[1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0]]

The cache files containing the Apache Arrow table backing the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.cache_files [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]

Number of columns in the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.num_columns 2

Number of rows in the dataset (same as Dataset.len()).

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.num_rows 1066

Names of the columns in the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.column_names ['text', 'label']

Shape of the dataset (number of columns, number of rows).

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.shape (1066, 2)

unique

< source >

( column: str ) → list

Parameters

List of unique elements in the given column.

Return a list of the unique elements in a column.

This is implemented in the low-level backend and as such, very fast.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.unique('label') [1, 0]

flatten

< source >

( new_fingerprint: typing.Optional[str] = None max_depth = 16 ) → Dataset

Parameters

A copy of the dataset with flattened columns.

Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Example:

from datasets import load_dataset ds = load_dataset("rajpurkar/squad", split="train") ds.features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)} ds.flatten() Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 87599 })

cast

< source >

( features: Features batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 num_proc: typing.Optional[int] = None ) → Dataset

Parameters

A copy of the dataset with casted features.

Cast the dataset to a new set of features.

Example:

from datasets import load_dataset, ClassLabel, Value ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} new_features = ds.features.copy() new_features['label'] = ClassLabel(names=['bad', 'good']) new_features['text'] = Value('large_string') ds = ds.cast(new_features) ds.features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)}

cast_column

< source >

( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf] new_fingerprint: typing.Optional[str] = None )

Parameters

Cast column to feature for decoding.

Example:

from datasets import load_dataset, ClassLabel ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) ds.features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='string', id=None)}

remove_columns

< source >

( column_names: typing.Union[str, list[str]] new_fingerprint: typing.Optional[str] = None ) → Dataset

Parameters

A copy of the dataset object without the columns to remove.

Remove one or several column(s) in the dataset and the features associated to them.

You can also remove a column using map() with remove_columns but the present method doesn’t copy the data of the remaining columns and is thus faster.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds = ds.remove_columns('label') Dataset({ features: ['text'], num_rows: 1066 }) ds = ds.remove_columns(column_names=ds.column_names) Dataset({ features: [], num_rows: 0 })

rename_column

< source >

( original_column_name: str new_column_name: str new_fingerprint: typing.Optional[str] = None ) → Dataset

Parameters

A copy of the dataset with a renamed column.

Rename a column in the dataset, and move the features associated to the original column under the new column name.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds = ds.rename_column('label', 'label_new') Dataset({ features: ['text', 'label_new'], num_rows: 1066 })

rename_columns

< source >

( column_mapping: dict new_fingerprint: typing.Optional[str] = None ) → Dataset

Parameters

A copy of the dataset with renamed columns

Rename several columns in the dataset, and move the features associated to the original columns under the new column names.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds = ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) Dataset({ features: ['text_new', 'label_new'], num_rows: 1066 })

select_columns

< source >

( column_names: typing.Union[str, list[str]] new_fingerprint: typing.Optional[str] = None ) → Dataset

Parameters

A copy of the dataset object which only consists of selected columns.

Select one or several column(s) in the dataset and the features associated to them.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.select_columns(['text']) Dataset({ features: ['text'], num_rows: 1066 })

class_encode_column

< source >

( column: str include_nulls: bool = False )

Parameters

Casts the given column as ClassLabel and updates the table.

Example:

from datasets import load_dataset ds = load_dataset("boolq", split="validation") ds.features {'answer': Value(dtype='bool', id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)} ds = ds.class_encode_column('answer') ds.features {'answer': ClassLabel(num_classes=2, names=['False', 'True'], id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}

Number of rows in the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.len <bound method Dataset.__len__ of Dataset({ features: ['text', 'label'], num_rows: 1066 })>

Iterate through the examples.

If a formatting is set with Dataset.set_format() rows will be returned with the selected format.

iter

< source >

( batch_size: int drop_last_batch: bool = False )

Parameters

Iterate through the batches of size batch_size.

If a formatting is set with [_~datasets.Dataset.set_format_] rows will be returned with the selected format.

formatted_as

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

To be used in a with statement. Set __getitem__ return format (type and columns).

set_format

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__. It’s also possible to use custom transforms for formatting using set_transform().

It is possible to call map() after calling set_format. Since map may add new columns, then the list of formatted columns

gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted as:

new formatted columns = (all columns - previously unformatted columns)

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) ds.set_format(type='numpy', columns=['text', 'label']) ds.format {'type': 'numpy', 'format_kwargs': {}, 'columns': ['text', 'label'], 'output_all_columns': False}

set_transform

< source >

( transform: typing.Optional[typing.Callable] columns: typing.Optional[list] = None output_all_columns: bool = False )

Parameters

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called. As set_format(), this can be reset using reset_format().

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') def encode(batch): ... return tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt') ds.set_transform(encode) ds[0] {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'input_ids': tensor([ 101, 29353, 2135, 15102, 1996, 9428, 20868, 2890, 8663, 6895, 20470, 2571, 3663, 2090, 4603, 3017, 3008, 1998, 2037, 24211, 5637, 1998, 11690, 2336, 1012, 102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

Reset __getitem__ return format to python objects and all columns.

Same as self.set_format()

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) ds.set_format(type='numpy', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) ds.format {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'numpy'} ds.reset_format() ds.format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None}

with_format

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__.

It’s also possible to use custom transforms for formatting using with_transform().

Contrary to set_format(), with_format returns a new Dataset object.

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) ds.format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None} ds = ds.with_format("torch") ds.format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'torch'} ds[0] {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'label': tensor(1), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

with_transform

< source >

( transform: typing.Optional[typing.Callable] columns: typing.Optional[list] = None output_all_columns: bool = False )

Parameters

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called.

As set_format(), this can be reset using reset_format().

Contrary to set_transform(), with_transform returns a new Dataset object.

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def encode(example): ... return tokenizer(example["text"], padding=True, truncation=True, return_tensors='pt') ds = ds.with_transform(encode) ds[0] {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.

Be careful when running this command that no other process is currently using other cache files.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.cleanup_cache_files() 10

map

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, list[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None try_original_type: typing.Optional[bool] = True )

Parameters

Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.

You can specify whether the function should be batched or not with the batched parameter:

If the function is asynchronous, then map will run your function in parallel, with up to one thousand simultaneous calls. It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example ds = ds.map(add_prefix) ds[0:3]["text"] ['Review: compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'Review: the soundtrack alone is worth the price of admission .', 'Review: rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .']

ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)

ds = ds.map(add_prefix, num_proc=4)

filter

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None )

Parameters

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.

If the function is asynchronous, then filter will run your function in parallel, with up to one thousand simultaneous calls (configurable). It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.filter(lambda x: x["label"] == 1) Dataset({ features: ['text', 'label'], num_rows: 533 })

select

< source >

( indices: Iterable keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )

Parameters

Create a new dataset with rows selected following the list/array of indices.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds.select(range(4)) Dataset({ features: ['text', 'label'], num_rows: 4 })

sort

< source >

( column_names: typing.Union[str, collections.abc.Sequence[str]] reverse: typing.Union[bool, collections.abc.Sequence[bool]] = False null_placement: str = 'at_end' keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )

Parameters

Create a new dataset sorted according to a single or multiple columns.

Example:

from datasets import load_dataset ds = load_dataset('cornell-movie-review-data/rotten_tomatoes', split='validation') ds['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] sorted_ds = ds.sort('label') sorted_ds['label'][:10] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False]) another_sorted_ds['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

shuffle

< source >

( seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )

Parameters

Create a new Dataset where the rows are shuffled.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Shuffling takes the list of indices [0:len(my_dataset)] and shuffles it to create an indices mapping. However as soon as your Dataset has an indices mapping, the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping.

This may take a lot of time depending of the size of your dataset though:

my_dataset[0]
my_dataset = my_dataset.shuffle(seed=42) my_dataset[0]
my_dataset = my_dataset.flatten_indices()
my_dataset[0]

In this case, we recommend switching to an IterableDataset and leveraging its fast approximate shuffling method IterableDataset.shuffle().

It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal:

my_iterable_dataset = my_dataset.to_iterable_dataset(num_shards=128) for example in enumerate(my_iterable_dataset):
pass

shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=42, buffer_size=100)

for example in enumerate(shuffled_iterable_dataset):
pass

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

shuffled_ds = ds.shuffle(seed=42) shuffled_ds['label'][:10] [1, 0, 1, 1, 0, 0, 0, 0, 0, 0]

skip

< source >

( n: int )

Parameters

Create a new Dataset that skips the first n elements.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}] ds = ds.skip(1) list(ds.take(3)) [{'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}, {'label': 1, 'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'}]

take

< source >

( n: int )

Parameters

Create a new Dataset with only the first n elements.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") small_ds = ds.take(2) list(small_ds) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}]

train_test_split

< source >

( test_size: typing.Union[float, int, NoneType] = None train_size: typing.Union[float, int, NoneType] = None shuffle: bool = True stratify_by_column: typing.Optional[str] = None seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None train_indices_cache_file_name: typing.Optional[str] = None test_indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 train_new_fingerprint: typing.Optional[str] = None test_new_fingerprint: typing.Optional[str] = None )

Parameters

Return a dictionary (datasets.DatasetDict) with two random train and test subsets (train and test Dataset splits). Splits are created from the dataset according to test_size, train_size and shuffle.

This method is similar to scikit-learn train_test_split.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds = ds.train_test_split(test_size=0.2, shuffle=True) DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 852 }) test: Dataset({ features: ['text', 'label'], num_rows: 214 }) })

ds = ds.train_test_split(test_size=0.2, seed=42)

ds = load_dataset("imdb",split="train") Dataset({ features: ['text', 'label'], num_rows: 25000 }) ds = ds.train_test_split(test_size=0.2, stratify_by_column="label") DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 20000 }) test: Dataset({ features: ['text', 'label'], num_rows: 5000 }) })

shard

< source >

( num_shards: int index: int contiguous: bool = True keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 )

Parameters

Return the index-nth shard from dataset split into num_shards pieces.

This shards deterministically. dataset.shard(n, i) splits the dataset into contiguous chunks, so it can be easily concatenated back together after processing. If len(dataset) % n == l, then the first l dataset each have length (len(dataset) // n) + 1, and the remaining dataset have length (len(dataset) // n).datasets.concatenate_datasets([dset.shard(n, i) for i in range(n)]) returns a dataset with the same order as the original.

Note: n should be less or equal to the number of elements in the dataset len(dataset).

On the other hand, dataset.shard(n, i, contiguous=False) contains all elements of the dataset whose index mod n = i.

Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") ds Dataset({ features: ['text', 'label'], num_rows: 1066 }) ds.shard(num_shards=2, index=0) Dataset({ features: ['text', 'label'], num_rows: 533 })

repeat

< source >

( num_times: int )

Parameters

Create a new Dataset that repeats the underlying dataset num_times times.

Like itertools.repeat, repeating once just returns the full dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ds = ds.take(2).repeat(2) list(ds) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}, {'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}]

to_tf_dataset

< source >

( batch_size: typing.Optional[int] = None columns: typing.Union[str, list[str], NoneType] = None shuffle: bool = False collate_fn: typing.Optional[typing.Callable] = None drop_remainder: bool = False collate_fn_args: typing.Optional[dict[str, typing.Any]] = None label_cols: typing.Union[str, list[str], NoneType] = None prefetch: bool = True num_workers: int = 0 num_test_batches: int = 20 )

Parameters

Create a tf.data.Dataset from the underlying Dataset. This tf.data.Dataset will load and collate batches from the Dataset, and is suitable for passing to methods like model.fit() or model.predict(). The dataset will yielddicts for both inputs and labels unless the dict would contain only a single key, in which case a rawtf.Tensor is yielded instead.

Example:

ds_train = ds["train"].to_tf_dataset( ... columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... )

push_to_hub

< source >

( repo_id: str config_name: str = 'default' set_default: typing.Optional[bool] = None split: typing.Optional[str] = None data_dir: typing.Optional[str] = None commit_message: typing.Optional[str] = None commit_description: typing.Optional[str] = None private: typing.Optional[bool] = None token: typing.Optional[str] = None revision: typing.Optional[str] = None create_pr: typing.Optional[bool] = False max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[int] = None embed_external_files: bool = True )

Parameters

Pushes the dataset to the hub as a Parquet dataset. The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed.

The resulting Parquet files are self-contained by default. If your dataset contains Image, Audio or Videodata, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files to False.

Example:

dataset.push_to_hub("/") dataset_dict.push_to_hub("/", private=True) dataset.push_to_hub("/", max_shard_size="1GB") dataset.push_to_hub("/", num_shards=1024)

If your dataset has multiple splits (e.g. train/validation/test):

train_dataset.push_to_hub("/", split="train") val_dataset.push_to_hub("/", split="validation")

dataset = load_dataset("/") train_dataset = dataset["train"] val_dataset = dataset["validation"]

If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages):

english_dataset.push_to_hub("/", "en") french_dataset.push_to_hub("/", "fr")

english_dataset = load_dataset("/", "en") french_dataset = load_dataset("/", "fr")

save_to_disk

< source >

( dataset_path: typing.Union[str, bytes, os.PathLike] max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[int] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None )

Parameters

Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

For Image, Audio and Video data:

All the Image(), Audio() and Video() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.

Example:

ds.save_to_disk("path/to/dataset/directory") ds.save_to_disk("path/to/dataset/directory", max_shard_size="1GB") ds.save_to_disk("path/to/dataset/directory", num_shards=1024)

load_from_disk

< source >

( dataset_path: typing.Union[str, bytes, os.PathLike] keep_in_memory: typing.Optional[bool] = None storage_options: typing.Optional[dict] = None ) → Dataset or DatasetDict

Parameters

Loads a dataset that was previously saved using save_to_disk from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

Example:

ds = load_from_disk("path/to/dataset/directory")

flatten_indices

< source >

( keep_in_memory: bool = False cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False num_proc: typing.Optional[int] = None new_fingerprint: typing.Optional[str] = None )

Parameters

Create and cache a new Dataset by flattening the indices mapping.

to_csv

< source >

( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **to_csv_kwargs ) → int

Parameters

The number of characters or bytes written.

Exports the dataset to csv

Example:

ds.to_csv("path/to/dataset/directory")

to_pandas

< source >

( batch_size: typing.Optional[int] = None batched: bool = False )

Parameters

Returns the dataset as a pandas.DataFrame. Can also return a generator for large datasets.

to_dict

< source >

( batch_size: typing.Optional[int] = None )

Parameters

Returns the dataset as a Python dict. Can also return a generator for large datasets.

to_json

< source >

( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **to_json_kwargs ) → int

Parameters

The number of characters or bytes written.

Export the dataset to JSON Lines or JSON.

The default output format is JSON Lines. To export to JSON, pass lines=False argument and the desired orient.

Example:

ds.to_json("path/to/dataset/directory/filename.jsonl")

to_parquet

< source >

( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **parquet_writer_kwargs ) → int

Parameters

The number of characters or bytes written.

Exports the dataset to parquet

Example:

ds.to_parquet("path/to/dataset/directory")

to_sql

< source >

( name: str con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')] batch_size: typing.Optional[int] = None **sql_writer_kwargs ) → int

Parameters

The number of records written.

Exports the dataset to a SQL database.

Example:

ds.to_sql("data", "sqlite:///my_own_db.sql")

import sqlite3 con = sqlite3.connect("my_own_db.sql") with con: ... ds.to_sql("data", con)

to_iterable_dataset

< source >

( num_shards: typing.Optional[int] = 1 )

Parameters

Get an datasets.IterableDataset from a map-style datasets.Dataset. This is equivalent to loading a dataset in streaming mode with datasets.load_dataset(), but much faster since the data is streamed from local files.

Contrary to map-style datasets, iterable datasets are lazy and can only be iterated over (e.g. using a for loop). Since they are read sequentially in training loops, iterable datasets are much faster than map-style datasets. All the transformations applied to iterable datasets like filtering or processing are done on-the-fly when you start iterating over the dataset.

Still, it is possible to shuffle an iterable dataset using datasets.IterableDataset.shuffle(). This is a fast approximate shuffling that works best if you have multiple shards and if you specify a buffer size that is big enough.

To get the best speed performance, make sure your dataset doesn’t have an indices mapping. If this is the case, the data are not read contiguously, which can be slow sometimes. You can use ds = ds.flatten_indices() to write your dataset in contiguous chunks of data and have optimal speed before switching to an iterable dataset.

Example:

Basic usage:

ids = ds.to_iterable_dataset() for example in ids: ... pass

With lazy filtering and processing:

ids = ds.to_iterable_dataset() ids = ids.filter(filter_fn).map(process_fn)
for example in ids: ... pass

With sharding to enable efficient shuffling:

ids = ds.to_iterable_dataset(num_shards=64)
ids = ids.shuffle(buffer_size=10_000)
for example in ids: ... pass

With a PyTorch DataLoader:

import torch ids = ds.to_iterable_dataset(num_shards=64) ids = ids.filter(filter_fn).map(process_fn) dataloader = torch.utils.data.DataLoader(ids, num_workers=4)
for example in ids: ... pass

With a PyTorch DataLoader and shuffling:

import torch ids = ds.to_iterable_dataset(num_shards=64) ids = ids.shuffle(buffer_size=10_000)
dataloader = torch.utils.data.DataLoader(ids, num_workers=4)
for example in ids: ... pass

In a distributed setup like PyTorch DDP with a PyTorch DataLoader and shuffling

from datasets.distributed import split_dataset_by_node ids = ds.to_iterable_dataset(num_shards=512) ids = ids.shuffle(buffer_size=10_000, seed=42)
ids = split_dataset_by_node(ds, world_size=8, rank=0)
dataloader = torch.utils.data.DataLoader(ids, num_workers=4)
for example in ids: ... pass

With shuffling and multiple epochs:

ids = ds.to_iterable_dataset(num_shards=64) ids = ids.shuffle(buffer_size=10_000, seed=42)
for epoch in range(n_epochs): ... ids.set_epoch(epoch)
... for example in ids: ... pass

Feel free to also use `IterableDataset.set_epoch()` when using a PyTorch DataLoader or in distributed setups.

add_faiss_index

< source >

( column: str index_name: typing.Optional[str] = None device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None batch_size: int = 1000 train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )

Parameters

Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:

Example:

ds = datasets.load_dataset('crime_and_punish', split='train') ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']})) ds_with_embeddings.add_faiss_index(column='embeddings')

scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)

ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

ds = datasets.load_dataset('crime_and_punish', split='train')

ds.load_faiss_index('embeddings', 'my_index.faiss')

scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)

add_faiss_index_from_external_arrays

< source >

( external_arrays: index_name: str device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None batch_size: int = 1000 train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )

Parameters

Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:

save_faiss_index

< source >

( index_name: str file: typing.Union[str, pathlib.PurePath] storage_options: typing.Optional[dict] = None )

Parameters

Save a FaissIndex on disk.

load_faiss_index

< source >

( index_name: str file: typing.Union[str, pathlib.PurePath] device: typing.Union[int, list[int], NoneType] = None storage_options: typing.Optional[dict] = None )

Parameters

Load a FaissIndex from disk.

If you want to do additional configurations, you can have access to the faiss index object by doing.get_index(index_name).faiss_index to make it fit your needs.

add_elasticsearch_index

< source >

( column: str index_name: typing.Optional[str] = None host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('elasticsearch.Elasticsearch')] = None es_index_name: typing.Optional[str] = None es_index_config: typing.Optional[dict] = None )

Parameters

Add a text index using ElasticSearch for fast retrieval. This is done in-place.

Example:

es_client = elasticsearch.Elasticsearch() ds = datasets.load_dataset('crime_and_punish', split='train') ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index") scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)

load_elasticsearch_index

< source >

( index_name: str es_index_name: str host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('Elasticsearch')] = None es_index_config: typing.Optional[dict] = None )

Parameters

Load an existing text index using ElasticSearch for fast retrieval.

List the colindex_nameumns/identifiers of all the attached indexes.

get_index

< source >

( index_name: str )

Parameters

List the index_name/identifiers of all the attached indexes.

drop_index

< source >

( index_name: str )

Parameters

Drop the index with the specified column.

< source >

( index_name: str query: typing.Union[str, ] k: int = 10 **kwargs ) → (scores, indices)

Parameters

Returns

(scores, indices)

A tuple of (scores, indices) where:

Find the nearest examples indices in the dataset to the query.

search_batch

< source >

( index_name: str queries: typing.Union[list[str], ] k: int = 10 **kwargs ) → (total_scores, total_indices)

Parameters

Returns

(total_scores, total_indices)

A tuple of (total_scores, total_indices) where:

Find the nearest examples indices in the dataset to the query.

get_nearest_examples

< source >

( index_name: str query: typing.Union[str, ] k: int = 10 **kwargs ) → (scores, examples)

Parameters

Returns

(scores, examples)

A tuple of (scores, examples) where:

Find the nearest examples in the dataset to the query.

get_nearest_examples_batch

< source >

( index_name: str queries: typing.Union[list[str], ] k: int = 10 **kwargs ) → (total_scores, total_examples)

Parameters

Returns

(total_scores, total_examples)

A tuple of (total_scores, total_examples) where:

Find the nearest examples in the dataset to the query.

DatasetInfo object containing all the metadata in the dataset.

NamedSplit object corresponding to a named dataset split.

from_csv

< source >

( path_or_paths: typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False num_proc: typing.Optional[int] = None **kwargs )

Parameters

Create Dataset from CSV file(s).

Example:

ds = Dataset.from_csv('path/to/dataset.csv')

from_json

< source >

( path_or_paths: typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False field: typing.Optional[str] = None num_proc: typing.Optional[int] = None **kwargs )

Parameters

Create Dataset from JSON or JSON Lines file(s).

Example:

ds = Dataset.from_json('path/to/dataset.json')

from_parquet

< source >

( path_or_paths: typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[list[str]] = None num_proc: typing.Optional[int] = None **kwargs )

Parameters

Create Dataset from Parquet file(s).

Example:

ds = Dataset.from_parquet('path/to/dataset.parquet')

from_text

< source >

( path_or_paths: typing.Union[str, bytes, os.PathLike, list[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False num_proc: typing.Optional[int] = None **kwargs )

Parameters

Create Dataset from text file(s).

Example:

ds = Dataset.from_text('path/to/dataset.txt')

from_sql

< source >

( sql: typing.Union[str, ForwardRef('sqlalchemy.sql.Selectable')] con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )

Parameters

Create Dataset from SQL query or database table.

Example:

ds = Dataset.from_sql("test_data", "postgres:///db_name")

ds = Dataset.from_sql("SELECT sentence FROM test_data", "postgres:///db_name")

from sqlalchemy import select, text stmt = select([text("sentence")]).select_from(text("test_data")) ds = Dataset.from_sql(stmt, "postgres:///db_name")

The returned dataset can only be cached if con is specified as URI string.

align_labels_with_mapping

< source >

( label2id: dict label_column: str )

Parameters

Align the dataset’s label ID and label name mapping to match an input label2id mapping. This is useful when you want to ensure that a model’s predicted labels are aligned with the dataset. The alignment in done using the lowercase label names.

Example:

ds = load_dataset("nyu-mll/glue", "mnli", split="train")

label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2} ds_aligned = ds.align_labels_with_mapping(label2id, "label")

datasets.concatenate_datasets

< source >

( dsets: list info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None axis: int = 0 )

Parameters

Converts a list of Dataset with the same schema into a single Dataset.

Example:

ds3 = concatenate_datasets([ds1, ds2])

datasets.interleave_datasets

< source >

( datasets: list probabilities: typing.Optional[list[float]] = None seed: typing.Optional[int] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None stopping_strategy: typing.Literal['first_exhausted', 'all_exhausted'] = 'first_exhausted' ) → Dataset or IterableDataset

Parameters

Return type depends on the input datasetsparameter. Dataset if the input is a list of Dataset, IterableDataset if the input is a list ofIterableDataset.

Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples.

You can use this function on a list of Dataset objects, or on a list of IterableDataset objects.

The resulting dataset ends when one of the source datasets runs out of examples except when oversampling is True, in which case, the resulting dataset ends when all datasets have ran out of examples at least one time.

Note for iterable datasets:

In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. Therefore the “first_exhausted” strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker).

Example:

For regular datasets (map-style):

from datasets import Dataset, interleave_datasets d1 = Dataset.from_dict({"a": [0, 1, 2]}) d2 = Dataset.from_dict({"a": [10, 11, 12]}) d3 = Dataset.from_dict({"a": [20, 21, 22]}) dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") dataset["a"] [10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22] dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) dataset["a"] [10, 0, 11, 1, 2] dataset = interleave_datasets([d1, d2, d3]) dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] d1 = Dataset.from_dict({"a": [0, 1, 2]}) d2 = Dataset.from_dict({"a": [10, 11, 12, 13]}) d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]}) dataset = interleave_datasets([d1, d2, d3]) dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 10, 24] dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) dataset["a"] [10, 0, 11, 1, 2] dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") dataset["a"] [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24] For datasets in streaming mode (iterable):

from datasets import interleave_datasets d1 = load_dataset('allenai/c4', 'es', split='train', streaming=True) d2 = load_dataset('allenai/c4', 'fr', split='train', streaming=True) dataset = interleave_datasets([d1, d2]) iterator = iter(dataset) next(iterator) {'text': 'Comprar Zapatillas para niña en chancla con goma por...'} next(iterator) {'text': 'Le sacre de philippe ier, 23 mai 1059 - Compte Rendu...'

datasets.distributed.split_dataset_by_node

< source >

( dataset: ~DatasetType rank: int world_size: int ) → Dataset or IterableDataset

Parameters

The dataset to be used on the node at rank rank.

Split a dataset for the node at rank rank in a pool of nodes of size world_size.

For map-style datasets:

Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset. To maximize data loading throughput, chunks are made of contiguous data on disk if possible.

For iterable datasets:

If the dataset has a number of shards that is a factor of world_size (i.e. if dataset.num_shards % world_size == 0), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out of world_size, skipping the other examples.

When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.

Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.

If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:

When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.

Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.

If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:

When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.

Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.

If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:

DatasetDict

Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.

A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)

The Apache Arrow tables backing each split.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.data

The cache files containing the Apache Arrow table backing each split.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.cache_files {'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}], 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}], 'validation': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]}

Number of columns in each split of the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.num_columns {'test': 2, 'train': 2, 'validation': 2}

Number of rows in each split of the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.num_rows {'test': 1066, 'train': 8530, 'validation': 1066}

Names of the columns in each split of the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.column_names {'test': ['text', 'label'], 'train': ['text', 'label'], 'validation': ['text', 'label']}

Shape of each split of the dataset (number of rows, number of columns).

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.shape {'test': (1066, 2), 'train': (8530, 2), 'validation': (1066, 2)}

unique

< source >

( column: str ) → Dict[str, list]

Parameters

Dictionary of unique elements in the given column.

Return a list of the unique elements in a column for each split.

This is implemented in the low-level backend and as such, very fast.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.unique("label") {'test': [1, 0], 'train': [1, 0], 'validation': [1, 0]}

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be careful when running this command that no other process is currently using other cache files.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.cleanup_cache_files() {'test': 0, 'train': 0, 'validation': 0}

map

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False with_split: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, list[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_names: typing.Optional[dict[str, typing.Optional[str]]] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )

Parameters

Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it. The transformation is applied to all the datasets of the dataset dictionary.

You can specify whether the function should be batched or not with the batched parameter:

If the function is asynchronous, then map will run your function in parallel, with up to one thousand simulatenous calls. It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example ds = ds.map(add_prefix) ds["train"][0:3]["text"] ['Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .', 'Review: effective but too-tepid biopic']

ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)

ds = ds.map(add_prefix, num_proc=4)

filter

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_names: typing.Optional[dict[str, typing.Optional[str]]] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )

Parameters

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.filter(lambda x: x["label"] == 1) DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 4265 }) validation: Dataset({ features: ['text', 'label'], num_rows: 533 }) test: Dataset({ features: ['text', 'label'], num_rows: 533 }) })

sort

< source >

( column_names: typing.Union[str, collections.abc.Sequence[str]] reverse: typing.Union[bool, collections.abc.Sequence[bool]] = False null_placement: str = 'at_end' keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_names: typing.Optional[dict[str, typing.Optional[str]]] = None writer_batch_size: typing.Optional[int] = 1000 )

Parameters

Create a new dataset sorted according to a single or multiple columns.

Example:

from datasets import load_dataset ds = load_dataset('cornell-movie-review-data/rotten_tomatoes') ds['train']['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] sorted_ds = ds.sort('label') sorted_ds['train']['label'][:10] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False]) another_sorted_ds['train']['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

shuffle

< source >

( seeds: typing.Union[int, dict[str, typing.Optional[int]], NoneType] = None seed: typing.Optional[int] = None generators: typing.Optional[dict[str, numpy.random._generator.Generator]] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_names: typing.Optional[dict[str, typing.Optional[str]]] = None writer_batch_size: typing.Optional[int] = 1000 )

Parameters

Create a new Dataset where the rows are shuffled.

The transformation is applied to all the datasets of the dataset dictionary.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds["train"]["label"][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

shuffled_ds = ds.shuffle(seed=42) shuffled_ds["train"]["label"][:10] [0, 1, 0, 1, 0, 0, 0, 0, 0, 0]

set_format

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

Set __getitem__ return format (type and columns). The format is set for every dataset in the dataset dictionary.

It is possible to call map after calling set_format. Since map may add new columns, then the list of formatted columns gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted:

new formatted columns = (all columns - previously unformatted columns)

Example:

from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) ds["train"].format {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'numpy'}

Reset __getitem__ return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.

Same as self.set_format()

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) ds["train"].format {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'numpy'} ds.reset_format() ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None}

formatted_as

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

To be used in a with statement. Set __getitem__ return format (type and columns). The transformation is applied to all the datasets of the dataset dictionary.

with_format

< source >

( type: typing.Optional[str] = None columns: typing.Optional[list] = None output_all_columns: bool = False **format_kwargs )

Parameters

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__. The format is set for every dataset in the dataset dictionary.

It’s also possible to use custom transforms for formatting using with_transform().

Contrary to set_format(), with_format returns a new DatasetDict object with new Dataset objects.

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None} ds = ds.with_format("torch") ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'torch'} ds["train"][0] {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'label': tensor(1), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

with_transform

< source >

( transform: typing.Optional[typing.Callable] columns: typing.Optional[list] = None output_all_columns: bool = False )

Parameters

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called. The transform is set for every dataset in the dataset dictionary

As set_format(), this can be reset using reset_format().

Contrary to set_transform(), with_transform returns a new DatasetDict object with new Dataset objects.

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def encode(example): ... return tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt") ds = ds.with_transform(encode) ds["train"][0] {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'input_ids': tensor([ 101, 1103, 2067, 1110, 17348, 1106, 1129, 1103, 6880, 1432, 112, 188, 1207, 107, 14255, 1389, 107, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 170, 11791, 5253, 188, 1732, 7200, 10947, 12606, 2895, 117, 179, 7766, 118, 172, 15554, 1181, 3498, 6961, 3263, 1137, 188, 1566, 7912, 14516, 6997, 119, 102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Example:

from datasets import load_dataset ds = load_dataset("rajpurkar/squad") ds["train"].features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)} ds.flatten() DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 87599 }) validation: Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 10570 }) })

cast

< source >

( features: Features )

Parameters

Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset, ClassLabel, Value ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} new_features = ds["train"].features.copy() new_features['label'] = ClassLabel(names=['bad', 'good']) new_features['text'] = Value('large_string') ds = ds.cast(new_features) ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)}

cast_column

< source >

( column: str feature )

Parameters

Cast column to feature for decoding.

Example:

from datasets import load_dataset, ClassLabel ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='string', id=None)}

remove_columns

< source >

( column_names: typing.Union[str, list[str]] ) → DatasetDict

Parameters

A copy of the dataset object without the columns to remove.

Remove one or several column(s) from each split in the dataset and the features associated to the column(s).

The transformation is applied to all the splits of the dataset dictionary.

You can also remove a column using map() with remove_columns but the present method doesn’t copy the data of the remaining columns and is thus faster.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds = ds.remove_columns("label") DatasetDict({ train: Dataset({ features: ['text'], num_rows: 8530 }) validation: Dataset({ features: ['text'], num_rows: 1066 }) test: Dataset({ features: ['text'], num_rows: 1066 }) })

rename_column

< source >

( original_column_name: str new_column_name: str )

Parameters

Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.

You can also rename a column using map() with remove_columns but the present method:

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds = ds.rename_column("label", "label_new") DatasetDict({ train: Dataset({ features: ['text', 'label_new'], num_rows: 8530 }) validation: Dataset({ features: ['text', 'label_new'], num_rows: 1066 }) test: Dataset({ features: ['text', 'label_new'], num_rows: 1066 }) })

rename_columns

< source >

( column_mapping: dict ) → DatasetDict

Parameters

A copy of the dataset with renamed columns.

Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) DatasetDict({ train: Dataset({ features: ['text_new', 'label_new'], num_rows: 8530 }) validation: Dataset({ features: ['text_new', 'label_new'], num_rows: 1066 }) test: Dataset({ features: ['text_new', 'label_new'], num_rows: 1066 }) })

select_columns

< source >

( column_names: typing.Union[str, list[str]] )

Parameters

Select one or several column(s) from each split in the dataset and the features associated to the column(s).

The transformation is applied to all the splits of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") ds.select_columns("text") DatasetDict({ train: Dataset({ features: ['text'], num_rows: 8530 }) validation: Dataset({ features: ['text'], num_rows: 1066 }) test: Dataset({ features: ['text'], num_rows: 1066 }) })

class_encode_column

< source >

( column: str include_nulls: bool = False )

Parameters

Casts the given column as ClassLabel and updates the tables.

Example:

from datasets import load_dataset ds = load_dataset("boolq") ds["train"].features {'answer': Value(dtype='bool', id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)} ds = ds.class_encode_column("answer") ds["train"].features {'answer': ClassLabel(num_classes=2, names=['False', 'True'], id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}

push_to_hub

< source >

( repo_id config_name: str = 'default' set_default: typing.Optional[bool] = None data_dir: typing.Optional[str] = None commit_message: typing.Optional[str] = None commit_description: typing.Optional[str] = None private: typing.Optional[bool] = None token: typing.Optional[str] = None revision: typing.Optional[str] = None create_pr: typing.Optional[bool] = False max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[dict[str, int]] = None embed_external_files: bool = True )

Parameters

Pushes the DatasetDict to the hub as a Parquet dataset. The DatasetDict is pushed using HTTP requests and does not need to have neither git or git-lfs installed.

Each dataset split will be pushed independently. The pushed dataset will keep the original split names.

The resulting Parquet files are self-contained by default: if your dataset contains Image or Audiodata, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files to False.

Example:

dataset_dict.push_to_hub("/") dataset_dict.push_to_hub("/", private=True) dataset_dict.push_to_hub("/", max_shard_size="1GB") dataset_dict.push_to_hub("/", num_shards={"train": 1024, "test": 8})

If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages):

english_dataset.push_to_hub("/", "en") french_dataset.push_to_hub("/", "fr")

english_dataset = load_dataset("/", "en") french_dataset = load_dataset("/", "fr")

save_to_disk

< source >

( dataset_dict_path: typing.Union[str, bytes, os.PathLike] max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[dict[str, int]] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None )

Parameters

Saves a dataset dict to a filesystem using fsspec.spec.AbstractFileSystem.

For Image, Audio and Video data:

All the Image(), Audio() and Video() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.

Example:

dataset_dict.save_to_disk("path/to/dataset/directory") dataset_dict.save_to_disk("path/to/dataset/directory", max_shard_size="1GB") dataset_dict.save_to_disk("path/to/dataset/directory", num_shards={"train": 1024, "test": 8})

load_from_disk

< source >

( dataset_dict_path: typing.Union[str, bytes, os.PathLike] keep_in_memory: typing.Optional[bool] = None storage_options: typing.Optional[dict] = None )

Parameters

Load a dataset that was previously saved using save_to_disk from a filesystem using fsspec.spec.AbstractFileSystem.

Example:

ds = load_from_disk('path/to/dataset/directory')

from_csv

< source >

( path_or_paths: dict features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )

Parameters

Create DatasetDict from CSV file(s).

Example:

from datasets import DatasetDict ds = DatasetDict.from_csv({'train': 'path/to/dataset.csv'})

from_json

< source >

( path_or_paths: dict features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )

Parameters

Create DatasetDict from JSON Lines file(s).

Example:

from datasets import DatasetDict ds = DatasetDict.from_json({'train': 'path/to/dataset.json'})

from_parquet

< source >

( path_or_paths: dict features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[list[str]] = None **kwargs )

Parameters

Create DatasetDict from Parquet file(s).

Example:

from datasets import DatasetDict ds = DatasetDict.from_parquet({'train': 'path/to/dataset/parquet'})

from_text

< source >

( path_or_paths: dict features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )

Parameters

Create DatasetDict from text file(s).

Example:

from datasets import DatasetDict ds = DatasetDict.from_text({'train': 'path/to/dataset.txt'})

IterableDataset

The base class IterableDataset implements an iterable Dataset backed by python generators.

class datasets.IterableDataset

< source >

( ex_iterable: _BaseExamplesIterable info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None formatting: typing.Optional[datasets.iterable_dataset.FormattingConfig] = None shuffling: typing.Optional[datasets.iterable_dataset.ShufflingConfig] = None distributed: typing.Optional[datasets.iterable_dataset.DistributedConfig] = None token_per_repo_id: typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None )

A Dataset backed by an iterable.

from_generator

< source >

( generator: typing.Callable features: typing.Optional[datasets.features.features.Features] = None gen_kwargs: typing.Optional[dict] = None split: NamedSplit = NamedSplit('train') ) → IterableDataset

Parameters

Create an Iterable Dataset from a generator.

Example:

def gen(): ... yield {"text": "Good", "label": 0} ... yield {"text": "Bad", "label": 1} ... ds = IterableDataset.from_generator(gen)

def gen(shards): ... for shard in shards: ... with open(shard) as f: ... for line in f: ... yield {"line": line} ... shards = [f"data{i}.txt" for i in range(32)] ds = IterableDataset.from_generator(gen, gen_kwargs={"shards": shards}) ds = ds.shuffle(seed=42, buffer_size=10_000)
from torch.utils.data import DataLoader dataloader = DataLoader(ds.with_format("torch"), num_workers=4)

remove_columns

< source >

( column_names: typing.Union[str, list[str]] ) → IterableDataset

Parameters

A copy of the dataset object without the columns to remove.

Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) next(iter(ds)) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} ds = ds.remove_columns("label") next(iter(ds)) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

select_columns

< source >

( column_names: typing.Union[str, list[str]] ) → IterableDataset

Parameters

A copy of the dataset object with selected columns.

Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) next(iter(ds)) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} ds = ds.select_columns("text") next(iter(ds)) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

cast_column

< source >

( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf] ) → IterableDataset

Parameters

Cast column to feature for decoding.

Example:

from datasets import load_dataset, Audio ds = load_dataset("PolyAI/minds14", name="en-US", split="train", streaming=True) ds.features {'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None), 'english_transcription': Value(dtype='string', id=None), 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None), 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None), 'path': Value(dtype='string', id=None), 'transcription': Value(dtype='string', id=None)} ds = ds.cast_column("audio", Audio(sampling_rate=16000)) ds.features {'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'english_transcription': Value(dtype='string', id=None), 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None), 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None), 'path': Value(dtype='string', id=None), 'transcription': Value(dtype='string', id=None)}

cast

< source >

( features: Features ) → IterableDataset

Parameters

A copy of the dataset with casted features.

Cast the dataset to a new set of features.

Example:

from datasets import load_dataset, ClassLabel, Value ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} new_features = ds.features.copy() new_features["label"] = ClassLabel(names=["bad", "good"]) new_features["text"] = Value("large_string") ds = ds.cast(new_features) ds.features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)}

decode

< source >

( enable: bool = True num_threads: int = 0 ) → IterableDataset

Parameters

A copy of the dataset with casted features.

Enable or disable the dataset features decoding for audio, image, video.

When enabled (default), media types are decoded:

You can enable multithreading using num_threads. This is especially useful to speed up remote data streaming. However it can be slower than num_threads=0 for local data on fast disks.

Disabling decoding is useful if you want to iterate on the paths or bytes of the media files without actually decoding their content. To disable decoding you can use .decode(False), which is equivalent to calling .cast() or .cast_column() with all the Audio, Image and Video types set to decode=False.

Examples:

Disable decoding:

from datasets import load_dataset ds = load_dataset("sshh12/planet-textures", split="train", streaming=True) next(iter(ds)) {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=2048x1024>, 'text': 'A distant celestial object with an icy crust, displaying a light blue shade, covered with round pits and rugged terrains.'} ds = ds.decode(False) ds.features {'image': Image(mode=None, decode=False, id=None), 'text': Value(dtype='string', id=None)} next(iter(ds)) { 'image': { 'path': 'hf://datasets/sshh12/planet-textures@69dc4cef7a5c4b2cfe387727ec8ea73d4bff7302/train/textures/0000.png', 'bytes': None }, 'text': 'A distant celestial object with an icy crust, displaying a light blue shade, covered with round pits and rugged terrains.' }

Speed up streaming with multithreading:

import os from datasets import load_dataset from tqdm import tqdm ds = load_dataset("sshh12/planet-textures", split="train", streaming=True) num_threads = min(32, (os.cpu_count() or 1) + 4) ds = ds.decode(num_threads=num_threads) for _ in tqdm(ds):
... ...

iter

< source >

( batch_size: int drop_last_batch: bool = False )

Parameters

Iterate through the batches of size batch_size.

map

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, list[str], NoneType] = None features: typing.Optional[datasets.features.features.Features] = None fn_kwargs: typing.Optional[dict] = None )

Parameters

Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset.

You can specify whether the function should be batched or not with the batched parameter:

If the function is asynchronous, then map will run your function in parallel, with up to one thousand simulatenous calls. It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example ds = ds.map(add_prefix) list(ds.take(3)) [{'label': 1, 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'Review: effective but too-tepid biopic'}]

rename_column

< source >

( original_column_name: str new_column_name: str ) → IterableDataset

Parameters

A copy of the dataset with a renamed column.

Rename a column in the dataset, and move the features associated to the original column under the new column name.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) next(iter(ds)) {'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} ds = ds.rename_column("text", "movie_review") next(iter(ds)) {'label': 1, 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

filter

< source >

( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None )

Parameters

Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset.

If the function is asynchronous, then filter will run your function in parallel, with up to one thousand simulatenous calls (configurable). It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) ds = ds.filter(lambda x: x["label"] == 0) list(ds.take(3)) [{'label': 0, 'movie_review': 'simplistic , silly and tedious .'}, {'label': 0, 'movie_review': "it's so laddish and juvenile , only teenage boys could possibly find it funny ."}, {'label': 0, 'movie_review': 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}]

shuffle

< source >

( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )

Parameters

Randomly shuffles the elements of this dataset.

This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1000, then shuffle will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.

If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using skip() or take()then the order of the shards is kept unchanged.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}] shuffled_ds = ds.shuffle(seed=42) list(shuffled_ds.take(3)) [{'label': 1, 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."}, {'label': 1, 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'}, {'label': 1, 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}]

batch

< source >

( batch_size: int drop_last_batch: bool = False )

Parameters

Group samples from the dataset into batches.

Example:

ds = load_dataset("some_dataset", streaming=True) batched_ds = ds.batch(batch_size=32)

skip

< source >

( n: int )

Parameters

Create a new IterableDataset that skips the first n elements.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}] ds = ds.skip(1) list(ds.take(3)) [{'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}, {'label': 1, 'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'}]

take

< source >

( n: int )

Parameters

Create a new IterableDataset with only the first n elements.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) small_ds = ds.take(2) list(small_ds) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}]

shard

< source >

( num_shards: int index: int contiguous: bool = True )

Parameters

Return the index-nth shard from dataset split into num_shards pieces.

This shards deterministically. dataset.shard(n, i) splits the dataset into contiguous chunks, so it can be easily concatenated back together after processing. If dataset.num_shards % n == l, then the first l datasets each have (dataset.num_shards // n) + 1 shards, and the remaining datasets have (dataset.num_shards // n) shards.datasets.concatenate_datasets([dset.shard(n, i) for i in range(n)]) returns a dataset with the same order as the original. In particular, dataset.shard(dataset.num_shards, i) returns a dataset with 1 shard.

Note: n should be less or equal to the number of shards in the dataset dataset.num_shards.

On the other hand, dataset.shard(n, i, contiguous=False) contains all the shards of the dataset whose index mod n = i.

Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.

Example:

from datasets import load_dataset ds = load_dataset("amazon_polarity", split="train", streaming=True) ds Dataset({ features: ['label', 'title', 'content'], num_shards: 4 }) ds.shard(num_shards=2, index=0) Dataset({ features: ['label', 'title', 'content'], num_shards: 2 })

repeat

< source >

( num_times: typing.Optional[int] )

Parameters

Create a new IterableDataset that repeats the underlying dataset num_times times.

N.B. The effect of calling shuffle after repeat depends significantly on buffer size. With buffer_size 1, duplicate data is never seen in the same iteration, even after shuffling: ds.repeat(n).shuffle(seed=42, buffer_size=1) is equivalent to ds.shuffle(seed=42, buffer_size=1).repeat(n), and only shuffles shard orders within each iteration. With buffer size >= (num samples in the dataset * num_times), we get full shuffling of the repeated data, i.e. we can observe duplicates in the same iteration.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ds = ds.take(2).repeat(2) list(ds) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}, {'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}]

Load the state_dict of the dataset. The iteration will restart at the next example from when the state was saved.

Resuming returns exactly where the checkpoint was saved except in two cases:

  1. examples from shuffle buffers are lost when resuming and the buffers are refilled with new data
  2. combinations of .with_format(arrow) and batched .map() may skip one batch.

Example:

from datasets import Dataset, concatenate_datasets ds = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) for idx, example in enumerate(ds): ... print(example) ... if idx == 2: ... state_dict = ds.state_dict() ... print("checkpoint") ... break ds.load_state_dict(state_dict) print(f"restart from checkpoint") for example in ds: ... print(example)

which returns:

{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}

from torchdata.stateful_dataloader import StatefulDataLoader ds = load_dataset("deepmind/code_contests", streaming=True, split="train") dataloader = StatefulDataLoader(ds, batch_size=32, num_workers=4)

state_dict = dataloader.state_dict()

dataloader.load_state_dict(state_dict)

Get the current state_dict of the dataset. It corresponds to the state at the latest example it yielded.

Resuming returns exactly where the checkpoint was saved except in two cases:

  1. examples from shuffle buffers are lost when resuming and the buffers are refilled with new data
  2. combinations of .with_format(arrow) and batched .map() may skip one batch.

Example:

from datasets import Dataset, concatenate_datasets ds = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) for idx, example in enumerate(ds): ... print(example) ... if idx == 2: ... state_dict = ds.state_dict() ... print("checkpoint") ... break ds.load_state_dict(state_dict) print(f"restart from checkpoint") for example in ds: ... print(example)

which returns:

{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}

from torchdata.stateful_dataloader import StatefulDataLoader ds = load_dataset("deepmind/code_contests", streaming=True, split="train") dataloader = StatefulDataLoader(ds, batch_size=32, num_workers=4)

state_dict = dataloader.state_dict()

dataloader.load_state_dict(state_dict)

DatasetInfo object containing all the metadata in the dataset.

NamedSplit object corresponding to a named dataset split.

IterableDatasetDict

Dictionary with split names as keys (‘train’, ‘test’ for example), and IterableDataset objects as values.

map

< source >

( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_split: bool = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: int = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, list[str], NoneType] = None fn_kwargs: typing.Optional[dict] = None )

Parameters

Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary.

You can specify whether the function should be batched or not with the batched parameter:

If the function is asynchronous, then map will run your function in parallel, with up to one thousand simulatenous calls. It is recommended to use a asyncio.Semaphore in your function if you want to set a maximum number of operations that can run at the same time.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example ds = ds.map(add_prefix) next(iter(ds["train"])) {'label': 1, 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

filter

< source >

( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, list[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None )

Parameters

Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds = ds.filter(lambda x: x["label"] == 0) list(ds["train"].take(3)) [{'label': 0, 'text': 'Review: simplistic , silly and tedious .'}, {'label': 0, 'text': "Review: it's so laddish and juvenile , only teenage boys could possibly find it funny ."}, {'label': 0, 'text': 'Review: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}]

shuffle

< source >

( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )

Parameters

Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary.

This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1000, then shuffle will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.

If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using skip() or take()then the order of the shards is kept unchanged.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) list(ds["train"].take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}] ds = ds.shuffle(seed=42) list(ds["train"].take(3)) [{'label': 1, 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."}, {'label': 1, 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'}, {'label': 1, 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}]

with_format

< source >

( type: typing.Optional[str] = None )

Parameters

Return a dataset with the specified format.

Example:

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) ds = ds.with_format("torch") next(iter(ds)) {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'label': tensor(1), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

cast

< source >

( features: Features ) → IterableDatasetDict

Parameters

A copy of the dataset with casted features.

Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} new_features = ds["train"].features.copy() new_features['label'] = ClassLabel(names=['bad', 'good']) new_features['text'] = Value('large_string') ds = ds.cast(new_features) ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)}

cast_column

< source >

( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.LargeList, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image, datasets.features.video.Video, datasets.features.pdf.Pdf] )

Parameters

Cast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset, ClassLabel ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='string', id=None)}

remove_columns

< source >

( column_names: typing.Union[str, list[str]] ) → IterableDatasetDict

Parameters

A copy of the dataset object without the columns to remove.

Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds = ds.remove_columns("label") next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

rename_column

< source >

( original_column_name: str new_column_name: str ) → IterableDatasetDict

Parameters

A copy of the dataset with a renamed column.

Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds = ds.rename_column("text", "movie_review") next(iter(ds["train"])) {'label': 1, 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

rename_columns

< source >

( column_mapping: dict ) → IterableDatasetDict

Parameters

A copy of the dataset with renamed columns

Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds = ds.rename_columns({"text": "movie_review", "label": "rating"}) next(iter(ds["train"])) {'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'rating': 1}

select_columns

< source >

( column_names: typing.Union[str, list[str]] ) → IterableDatasetDict

Parameters

A copy of the dataset object with only selected columns.

Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset. The selection is applied to all the datasets of the dataset dictionary.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) ds = ds.select("text") next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

Features

class datasets.Features

< source >

( *args **kwargs )

A special dictionary that defines the internal structure of a dataset.

Instantiated with a dictionary of type dict[str, FieldType], where keys are the desired column names, and values are the type of that column.

FieldType can be one of the following:

Make a deep copy of Features.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") copy_of_features = ds.features.copy() copy_of_features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)}

decode_batch

< source >

( batch: dict token_per_repo_id: typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None )

Parameters

Decode batch with custom feature decoding.

decode_column

< source >

( column: list column_name: str )

Parameters

Decode column with custom feature decoding.

decode_example

< source >

( example: dict token_per_repo_id: typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None )

Parameters

Decode example with custom feature decoding.

encode_batch

< source >

( batch )

Parameters

Encode batch into a format for Arrow.

encode_column

< source >

( column column_name: str )

Parameters

Encode column into a format for Arrow.

encode_example

< source >

( example )

Parameters

Encode example into a format for Arrow.

Flatten the features. Every dictionary column is removed and is replaced by all the subfields it contains. The new fields are named by concatenating the name of the original column and the subfield name like this: <original>.<subfield>.

If a column contains nested dictionaries, then all the lower-level subfields names are also concatenated to form new columns: <original>.<subfield>.<subsubfield>, etc.

Example:

from datasets import load_dataset ds = load_dataset("rajpurkar/squad", split="train") ds.features.flatten() {'answers.answer_start': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'answers.text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'context': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)}

from_arrow_schema

< source >

( pa_schema: Schema )

Parameters

Construct Features from Arrow Schema. It also checks the schema metadata for Hugging Face Datasets features. Non-nullable fields are not supported and set to nullable.

Also, pa.dictionary is not supported and it uses its underlying type instead. Therefore datasets convert DictionaryArray objects to their actual values.

from_dict

< source >

( dic ) → Features

Parameters

Construct [_Features_] from dict.

Regenerate the nested feature object from a deserialized dict. We use the _type key to infer the dataclass name of the feature FieldType.

It allows for a convenient constructor syntax to define features from deserialized JSON dictionaries. This function is used in particular when deserializing a [_DatasetInfo_] that was dumped to a JSON object. This acts as an analogue to [_Features.from_arrow_schema_] and handles the recursive field-by-field instantiation, but doesn’t require any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes that [_Value_] automatically performs.

Example:

Features.from_dict({'_type': {'dtype': 'string', 'id': None, '_type': 'Value'}}) {'_type': Value(dtype='string', id=None)}

reorder_fields_as

< source >

( other: Features )

Parameters

Reorder Features fields to match the field order of other [_Features_].

The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match.

Example:

from datasets import Features, Sequence, Value

f1 = Features({"root": Sequence({"a": Value("string"), "b": Value("string")})}) f2 = Features({"root": {"b": Sequence(Value("string")), "a": Sequence(Value("string"))}}) assert f1.type != f2.type

f1.reorder_fields_as(f2) {'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)} assert f1.reorder_fields_as(f2).type == f2.type

Scalar

class datasets.Value

< source >

( dtype: str id: typing.Optional[str] = None )

Parameters

Scalar feature value of a particular data type.

The possible dtypes of Value are as follows:

Example:

from datasets import Features features = Features({'stars': Value(dtype='int32')}) features {'stars': Value(dtype='int32', id=None)}

class datasets.ClassLabel

< source >

( num_classes: dataclasses.InitVar[typing.Optional[int]] = None names: list = None names_file: dataclasses.InitVar[typing.Optional[str]] = None id: typing.Optional[str] = None )

Parameters

Feature type for integer class labels.

There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:

Under the hood the labels are stored as integers. You can use negative integers to represent unknown/missing labels.

Example:

from datasets import Features, ClassLabel features = Features({'label': ClassLabel(num_classes=3, names=['bad', 'ok', 'good'])}) features {'label': ClassLabel(names=['bad', 'ok', 'good'], id=None)}

cast_storage

< source >

( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.IntegerArray] ) → pa.Int64Array

Parameters

Array in the ClassLabel arrow storage type.

Cast an Arrow array to the ClassLabel arrow storage type. The Arrow types that can be converted to the ClassLabel pyarrow storage type are:

int2str

< source >

( values: typing.Union[int, collections.abc.Iterable] )

Conversion integer => class name string.

Regarding unknown/missing labels: passing negative integers raises ValueError.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ds.features["label"].int2str(0) 'neg'

str2int

< source >

( values: typing.Union[str, collections.abc.Iterable] )

Conversion class name string => integer.

Example:

from datasets import load_dataset ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ds.features["label"].str2int('neg') 0

Composite

class datasets.LargeList

< source >

( feature: typing.Any id: typing.Optional[str] = None )

Parameters

Feature type for large list data composed of child feature data type.

It is backed by pyarrow.LargeListType, which is like pyarrow.ListType but with 64-bit rather than 32-bit offsets.

class datasets.Sequence

< source >

( feature: typing.Any length: int = -1 id: typing.Optional[str] = None )

Parameters

Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.

Example:

from datasets import Features, Sequence, Value, ClassLabel features = Features({'post': Sequence(feature={'text': Value(dtype='string'), 'upvotes': Value(dtype='int32'), 'label': ClassLabel(num_classes=2, names=['hot', 'cold'])})}) features {'post': Sequence(feature={'text': Value(dtype='string', id=None), 'upvotes': Value(dtype='int32', id=None), 'label': ClassLabel(names=['hot', 'cold'], id=None)}, length=-1, id=None)}

Translation

class datasets.Translation

< source >

( languages: list id: typing.Optional[str] = None )

Parameters

Feature for translations with fixed languages per example. Here for compatibility with tfds.

Example:

datasets.features.Translation(languages=['en', 'fr', 'de'])

yield { ... 'en': 'the cat', ... 'fr': 'le chat', ... 'de': 'die katze' ... }

Flatten the Translation feature into a dictionary.

class datasets.TranslationVariableLanguages

< source >

( languages: typing.Optional[list] = None num_languages: typing.Optional[int] = None id: typing.Optional[str] = None ) →

Parameters

Returns

Language codes sorted in ascending order or plain text translations, sorted to align with language codes.

Feature for translations with variable languages per example. Here for compatibility with tfds.

Example:

datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de'])

yield { ... 'en': 'the cat', ... 'fr': ['le chat', 'la chatte,'] ... 'de': 'die katze' ... }

{ ... 'language': ['en', 'de', 'fr', 'fr'], ... 'translation': ['the cat', 'die katze', 'la chatte', 'le chat'], ... }

Flatten the TranslationVariableLanguages feature into a dictionary.

Arrays

class datasets.Array2D

< source >

( shape: tuple dtype: str id: typing.Optional[str] = None )

Parameters

Create a two-dimensional array.

Example:

from datasets import Features features = Features({'x': Array2D(shape=(1, 3), dtype='int32')})

class datasets.Array3D

< source >

( shape: tuple dtype: str id: typing.Optional[str] = None )

Parameters

Create a three-dimensional array.

Example:

from datasets import Features features = Features({'x': Array3D(shape=(1, 2, 3), dtype='int32')})

class datasets.Array4D

< source >

( shape: tuple dtype: str id: typing.Optional[str] = None )

Parameters

Create a four-dimensional array.

Example:

from datasets import Features features = Features({'x': Array4D(shape=(1, 2, 2, 3), dtype='int32')})

class datasets.Array5D

< source >

( shape: tuple dtype: str id: typing.Optional[str] = None )

Parameters

Create a five-dimensional array.

Example:

from datasets import Features features = Features({'x': Array5D(shape=(1, 2, 2, 3, 3), dtype='int32')})

Audio

class datasets.Audio

< source >

( sampling_rate: typing.Optional[int] = None mono: bool = True decode: bool = True id: typing.Optional[str] = None )

Parameters

Audio Feature to extract audio data from an audio file.

Input: The Audio feature accepts as input:

Example:

from datasets import load_dataset, Audio ds = load_dataset("PolyAI/minds14", name="en-US", split="train") ds = ds.cast_column("audio", Audio(sampling_rate=16000)) ds[0]["audio"] {'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'sampling_rate': 16000}

cast_storage

< source >

( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray] ) → pa.StructArray

Parameters

Array in the Audio arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()})

Cast an Arrow array to the Audio arrow storage type. The Arrow types that can be converted to the Audio pyarrow storage type are:

decode_example

< source >

( value: dict token_per_repo_id: typing.Optional[dict[str, typing.Union[str, bool, NoneType]]] = None ) → dict

Parameters

Decode example audio file into audio data.

embed_storage

< source >

( storage: StructArray ) → pa.StructArray

Parameters

Array in the Audio arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()}).

Embed audio files into the Arrow array.

encode_example

< source >

( value: typing.Union[str, bytes, dict] ) → dict

Parameters

Encode example into a format for Arrow.

If in the decodable state, raise an error, otherwise flatten the feature into a dictionary.

Image

class datasets.Image

< source >

( mode: typing.Optional[str] = None decode: bool = True id: typing.Optional[str] = None )

Parameters

Image Feature to read image data from an image file.

Input: The Image feature accepts as input:

Examples:

from datasets import load_dataset, Image ds = load_dataset("AI-Lab-Makerere/beans", split="train") ds.features["image"] Image(decode=True, id=None) ds[0]["image"] <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x15E52E7F0> ds = ds.cast_column('image', Image(decode=False)) {'bytes': None, 'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/healthy/healthy_train.85.jpg'}

cast_storage

< source >

( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray] ) → pa.StructArray

Parameters

Array in the Image arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()}).

Cast an Arrow array to the Image arrow storage type. The Arrow types that can be converted to the Image pyarrow storage type are:

decode_example

< source >

( value: dict token_per_repo_id = None )

Parameters

Decode example image file into image data.

embed_storage

< source >

( storage: StructArray ) → pa.StructArray

Parameters

Array in the Image arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()}).

Embed image files into the Arrow array.

encode_example

< source >

( value: typing.Union[str, bytes, dict, numpy.ndarray, ForwardRef('PIL.Image.Image')] )

Parameters

Encode example into a format for Arrow.

If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.

Video

class datasets.Video

< source >

( decode: bool = True id: typing.Optional[str] = None )

Parameters

Experimental. Video Feature to read video data from a video file.

Input: The Video feature accepts as input:

Examples:

from datasets import Dataset, Video ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) ds.features["video"] Video(decode=True, id=None) ds[0]["video"] <torchvision.io.video_reader.VideoReader object at 0x325b1aae0> ds = ds.cast_column('video', Video(decode=False)) {'bytes': None, 'path': 'path/to/Screen Recording.mov'}

cast_storage

< source >

( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray] ) → pa.StructArray

Parameters

Array in the Video arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()}).

Cast an Arrow array to the Video arrow storage type. The Arrow types that can be converted to the Video pyarrow storage type are:

decode_example

< source >

( value: typing.Union[str, datasets.features.video.Example] token_per_repo_id: typing.Optional[dict[str, typing.Union[bool, str]]] = None )

Parameters

Decode example video file into video data.

encode_example

< source >

( value: typing.Union[str, bytes, datasets.features.video.Example, numpy.ndarray, ForwardRef('VideoReader')] )

Parameters

Encode example into a format for Arrow.

If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.

Pdf

class datasets.Pdf

< source >

( decode: bool = True id: typing.Optional[str] = None )

Parameters

**Experimental.**Pdf Feature to read pdf documents from a pdf file.

Input: The Pdf feature accepts as input:

Examples:

from datasets import Dataset, Pdf ds = Dataset.from_dict({"pdf": ["path/to/pdf/file.pdf"]}).cast_column("pdf", Pdf()) ds.features["pdf"] Pdf(decode=True, id=None) ds[0]["pdf"] <pdfplumber.pdf.PDF object at 0x7f8a1c2d8f40> ds = ds.cast_column("pdf", Pdf(decode=False)) ds[0]["pdf"] {'bytes': None, 'path': 'path/to/pdf/file.pdf'}

cast_storage

< source >

( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray] ) → pa.StructArray

Parameters

Array in the Pdf arrow storage type, that ispa.struct({"bytes": pa.binary(), "path": pa.string()}).

Cast an Arrow array to the Pdf arrow storage type. The Arrow types that can be converted to the Pdf pyarrow storage type are:

decode_example

< source >

( value: dict token_per_repo_id = None )

Parameters

Decode example pdf file into pdf data.

encode_example

< source >

( value: typing.Union[str, bytes, dict, ForwardRef('pdfplumber.pdf.PDF')] )

Parameters

Encode example into a format for Arrow.

If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.

Filesystems

datasets.filesystems.is_remote_filesystem

< source >

( fs: AbstractFileSystem )

Parameters

Checks if fs is a remote filesystem.

Fingerprint

Hasher that accepts python objects as inputs.

< > Update on GitHub