dataset_utils — Model Optimizer 0.31.0 (original) (raw)

Utility functions for getting samples and forward loop function for different datasets.

Functions

create_forward_loop Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.
get_dataset_dataloader Get a dataloader with the dataset name and toknizer of the target model.
get_max_batch_size Get the maximum batch size that can be used for the model.
get_supported_datasets Retrieves a list of datasets supported.

create_forward_loop(model=None, dataset_name='cnn_dailymail', tokenizer=None, batch_size=1, num_samples=512, max_sample_length=512, device=None, include_labels=False, dataloader=None)

Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.

This function initializes a forward loop function tailored to process batches of data from the specified dataset using the given model and tokenizer. The forward loop function, when called, iterates over the dataset, applies the tokenizer to prepare the input data, feeds it into the model, and returns the model’s predictions.

Parameters:

Return type:

Callable

Example usage for quantization:

import modelopt.torch.quantization as mtq from modelopt.torch.utils import create_forward_loop

Initialize model and tokenizer

...

Create forward loop for calibration

forward_loop = create_forward_loop( model=model, dataset_name="cnn_dailymail", tokenizer=tokenizer )

Quantize the model with the calibration dataset

mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

Returns:

A forward loop function that can be called with no arguments. When called, this function iterates over

the dataset specified by dataset_name.

Parameters:

Return type:

Callable

get_dataset_dataloader(dataset_name='cnn_dailymail', tokenizer=None, batch_size=1, num_samples=512, max_sample_length=512, device=None, include_labels=False)

Get a dataloader with the dataset name and toknizer of the target model.

Parameters:

Returns:

A instance of dataloader.

Return type:

DataLoader

get_max_batch_size(model, max_sample_length=512, sample_memory_usage_ratio=1.0, sample_input_single_batch=None)

Get the maximum batch size that can be used for the model.

Parameters:

get_supported_datasets()

Retrieves a list of datasets supported.

Returns:

A list of strings, where each string is the name of a supported dataset.

Return type:

_list_[_str_]

Example usage:

from modelopt.torch.utils import get_supported_datasets

print("Supported datasets:", get_supported_datasets())