Processors (original) (raw)

Processors can mean two different things in the Transformers library:

Multi-modal processors

Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, vision and audio). This is handled by objects called processors, which group together two or more processing objects such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).

Those processors inherit from the following base class that implements the saving and loading functionality:

class transformers.ProcessorMixin

< source >

( *args **kwargs )

This is a mixin used to provide saving/loading functionality for all processor classes.

apply_chat_template

< source >

( conversation: typing.Union[list[dict[str, str]], list[list[dict[str, str]]]] chat_template: typing.Optional[str] = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.AllKwargsForChatTemplate] )

Parameters

Similar to the apply_chat_template method on tokenizers, this method applies a Jinja template to input conversations to turn them into a single tokenizable string.

The input is expected to be in the following format, where each message content is a list consisting of text and optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to formpixel_values when return_dict=True. If not provided, one will get only the formatted text, optionally tokenized text.

conversation = [ { “role”: “user”, “content”: [ {“type”: “image”, “image”: “https://www.ilankelman.org/stopsigns/australia.jpg”}, {“type”: “text”, “text”: “Please describe this image in detail.”}, ], }, ]

from_args_and_dict

< source >

( args processor_dict: dict **kwargs ) → ~processing_utils.ProcessingMixin

Parameters

Returns

~processing_utils.ProcessingMixin

The processor object instantiated from those parameters.

Instantiates a type of ~processing_utils.ProcessingMixin from a Python dictionary of parameters.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' **kwargs )

Parameters

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractorfrom_pretrained(), image processorImageProcessingMixin and the tokenizer~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of the methods above for more information.

get_processor_dict

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs ) → Tuple[Dict, Dict]

Parameters

Returns

Tuple[Dict, Dict]

The dictionary(ies) that will be used to instantiate the processor object.

From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a processor of type ~processing_utils.ProcessingMixin using from_args_and_dict.

post_process_image_text_to_text

< source >

( generated_outputs skip_special_tokens = True **kwargs ) → List[str]

Parameters

The decoded text.

Post-process the output of a vlm to decode the text.

prepare_and_validate_optional_call_args

< source >

( *args )

Matches optional positional arguments to their corresponding names in optional_call_argsin the processor class in the order they are passed to the processor call.

Note that this should only be used in the __call__ method of the processors with special arguments. Special arguments are arguments that aren’t text, images, audio, nor videosbut also aren’t passed to the tokenizer, image processor, etc. Examples of such processors are:

Also note that passing by position to the processor call is now deprecated and will be disallowed in future versions. We only have this for backward compatibility.

Example: Suppose that the processor class has optional_call_args = ["arg_name_1", "arg_name_2"].

And we define the call method as:

def call( self, text: str, images: Optional[ImageInput] = None, *arg, audio=None, videos=None, )

Then, if we call the processor as:

images = [...] processor("What is common in these images?", images, arg_value_1, arg_value_2)

Then, this method will return:

{ "arg_name_1": arg_value_1, "arg_name_2": arg_value_2, }

which we could then pass as kwargs to `self._merge_kwargs`

push_to_hub

< source >

( repo_id: str use_temp_dir: typing.Optional[bool] = None commit_message: typing.Optional[str] = None private: typing.Optional[bool] = None token: typing.Union[bool, str, NoneType] = None max_shard_size: typing.Union[str, int, NoneType] = '5GB' create_pr: bool = False safe_serialization: bool = True revision: typing.Optional[str] = None commit_description: typing.Optional[str] = None tags: typing.Optional[list[str]] = None **deprecated_kwargs )

Parameters

Upload the processor files to the 🤗 Model Hub.

Examples:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")

processor.push_to_hub("my-finetuned-bert")

processor.push_to_hub("huggingface/my-finetuned-bert")

register_for_auto_class

< source >

( auto_class = 'AutoProcessor' )

Parameters

Register this class with a given auto class. This should only be used for custom feature extractors as the ones in the library are already mapped with AutoProcessor.

This API is experimental and may have some slight breaking changes in the next releases.

save_pretrained

< source >

( save_directory push_to_hub: bool = False **kwargs )

Parameters

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() andsave_pretrained(). Please refer to the docstrings of the methods above for more information.

to_dict

< source >

( ) → Dict[str, Any]

Dictionary of all the attributes that make up this processor instance.

Serializes this instance to a Python dictionary.

to_json_file

< source >

( json_file_path: typing.Union[str, os.PathLike] )

Parameters

Save this instance to a JSON file.

to_json_string

< source >

( ) → str

String containing all the attributes that make up this feature_extractor instance in JSON format.

Serializes this instance to a JSON string.

Deprecated processors

All processors follow the same architecture which is that of theDataProcessor. The processor returns a list ofInputExample. TheseInputExample can be converted toInputFeatures in order to be fed to the model.

Base class for data converters for sequence classification data sets.

get_example_from_tensor_dict

< source >

( tensor_dict )

Parameters

Gets an example from a dict with tensorflow tensors.

Gets the list of labels for this data set.

Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.

class transformers.InputExample

< source >

( guid: str text_a: str text_b: typing.Optional[str] = None label: typing.Optional[str] = None )

Parameters

A single training/test example for simple sequence classification.

Serializes this instance to a JSON string.

class transformers.InputFeatures

< source >

( input_ids: typing.List[int] attention_mask: typing.Optional[typing.List[int]] = None token_type_ids: typing.Optional[typing.List[int]] = None label: typing.Union[int, float, NoneType] = None )

Parameters

A single set of features of data. Property names are the same names as the corresponding inputs to a model.

Serializes this instance to a JSON string.

GLUE

General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding

This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.

Those processors are:

Additionally, the following method can be used to load values from a data file and convert them to a list ofInputExample.

transformers.glue_convert_examples_to_features

< source >

( examples: typing.Union[typing.List[transformers.data.processors.utils.InputExample], ForwardRef('tf.data.Dataset')] tokenizer: PreTrainedTokenizer max_length: typing.Optional[int] = None task = None label_list = None output_mode = None )

Parameters

Loads a data file into a list of InputFeatures

XNLI

The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations

This library hosts the processor to load the XNLI data:

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

An example using these processors is given in the run_xnli.py script.

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The second version (v2.0) was released alongside the paper Know What You Don’t Know: Unanswerable Questions for SQuAD.

This library hosts a processor for each of the two versions:

Processors

Those processors are:

They both inherit from the abstract class ~data.processors.utils.SquadProcessor

class transformers.data.processors.squad.SquadProcessor

< source >

( )

Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.

get_dev_examples

< source >

( data_dir filename = None )

Parameters

Returns the evaluation example from the data directory.

get_examples_from_dataset

< source >

( dataset evaluate = False )

Parameters

Creates a list of SquadExample using a TFDS dataset.

Examples:

import tensorflow_datasets as tfds

dataset = tfds.load("squad")

training_examples = get_examples_from_dataset(dataset, evaluate=False) evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)

get_train_examples

< source >

( data_dir filename = None )

Parameters

Returns the training examples from the data directory.

Additionally, the following method can be used to convert SQuAD examples into~data.processors.utils.SquadFeatures that can be used as model inputs.

transformers.squad_convert_examples_to_features

< source >

( examples tokenizer max_seq_length doc_stride max_query_length is_training padding_strategy = 'max_length' return_dataset = False threads = 1 tqdm_enabled = True )

Parameters

Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.

Example:

processor = SquadV2Processor() examples = processor.get_dev_examples(data_dir)

features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=args.max_seq_length, doc_stride=args.doc_stride, max_query_length=args.max_query_length, is_training=not evaluate, )

These processors as well as the aforementioned method can be used with files containing the data as well as with the_tensorflow_datasets_ package. Examples are given below.

Example usage

Here is an example using the processors as well as the conversion method using data files:

processor = SquadV2Processor() examples = processor.get_dev_examples(squad_v2_data_dir)

processor = SquadV1Processor() examples = processor.get_dev_examples(squad_v1_data_dir)

features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=max_seq_length, doc_stride=args.doc_stride, max_query_length=max_query_length, is_training=not evaluate, )

Using tensorflow_datasets is as easy as using a data file:

tfds_examples = tfds.load("squad") examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=max_seq_length, doc_stride=args.doc_stride, max_query_length=max_query_length, is_training=not evaluate, )

Another example using these processors is given in the run_squad.py script.

< > Update on GitHub