Data Collator (original) (raw)

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

To be able to build batches, data collators may apply some processing (like padding). Some of them (likeDataCollatorForLanguageModeling) also apply some random data augmentation (like random masking) on the formed batch.

Examples of use can be found in the example scripts or example notebooks.

Default data collator

transformers.default_data_collator

< source >

( features: typing.List[transformers.data.data_collator.InputDataClass] return_tensors = 'pt' )

Very simple data collator that simply collates batches of dict-like objects and performs special handling for potential keys named:

Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the model. See glue and ner for example of how it’s useful.

DefaultDataCollator

class transformers.DefaultDataCollator

< source >

( return_tensors: str = 'pt' )

Parameters

Very simple data collator that simply collates batches of dict-like objects and performs special handling for potential keys named:

Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the model. See glue and ner for example of how it’s useful.

This is an object (like other data collators) rather than a pure function like default_data_collator. This can be helpful if you need to set a return_tensors value at initialization.

DataCollatorWithPadding

class transformers.DataCollatorWithPadding

< source >

( tokenizer: PreTrainedTokenizerBase padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None return_tensors: str = 'pt' )

Parameters

Data collator that will dynamically pad the inputs received.

DataCollatorForTokenClassification

class transformers.DataCollatorForTokenClassification

< source >

( tokenizer: PreTrainedTokenizerBase padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None label_pad_token_id: int = -100 return_tensors: str = 'pt' )

Parameters

Data collator that will dynamically pad the inputs received, as well as the labels.

DataCollatorForSeq2Seq

class transformers.DataCollatorForSeq2Seq

< source >

( tokenizer: PreTrainedTokenizerBase model: typing.Optional[typing.Any] = None padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None label_pad_token_id: int = -100 return_tensors: str = 'pt' )

Parameters

Data collator that will dynamically pad the inputs received, as well as the labels.

DataCollatorForLanguageModeling

class transformers.DataCollatorForLanguageModeling

< source >

( tokenizer: PreTrainedTokenizerBase mlm: bool = True mlm_probability: float = 0.15 mask_replace_prob: float = 0.8 random_replace_prob: float = 0.1 pad_to_multiple_of: typing.Optional[int] = None tf_experimental_compile: bool = False return_tensors: str = 'pt' seed: typing.Optional[int] = None )

Parameters

Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the "special_tokens_mask" key, as returned by a PreTrainedTokenizer or aPreTrainedTokenizerFast with the argument return_special_tokens_mask=True.

  1. Default Behavior:
    • mask_replace_prob=0.8, random_replace_prob=0.1.
    • Expect 80% of masked tokens replaced with [MASK], 10% replaced with random tokens, and 10% left unchanged.
  2. All masked tokens replaced by [MASK]:
    • mask_replace_prob=1.0, random_replace_prob=0.0.
    • Expect all masked tokens to be replaced with [MASK]. No tokens are left unchanged or replaced with random tokens.
  3. No [MASK] replacement, only random tokens:
    • mask_replace_prob=0.0, random_replace_prob=1.0.
    • Expect all masked tokens to be replaced with random tokens. No [MASK] replacements or unchanged tokens.
  4. Balanced replacement:
    • mask_replace_prob=0.5, random_replace_prob=0.4.
    • Expect 50% of masked tokens replaced with [MASK], 40% replaced with random tokens, and 10% left unchanged.

Note: The sum of mask_replace_prob and random_replace_prob must not exceed 1. If their sum is less than 1, the remaining proportion will consist of masked tokens left unchanged.

numpy_mask_tokens

< source >

( inputs: typing.Any special_tokens_mask: typing.Optional[typing.Any] = None )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.

tf_mask_tokens

< source >

( inputs: typing.Any vocab_size mask_token_id special_tokens_mask: typing.Optional[typing.Any] = None )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.

torch_mask_tokens

< source >

( inputs: typing.Any special_tokens_mask: typing.Optional[typing.Any] = None )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.

DataCollatorForWholeWordMask

class transformers.DataCollatorForWholeWordMask

< source >

( tokenizer: PreTrainedTokenizerBase mlm: bool = True mlm_probability: float = 0.15 mask_replace_prob: float = 0.8 random_replace_prob: float = 0.1 pad_to_multiple_of: typing.Optional[int] = None tf_experimental_compile: bool = False return_tensors: str = 'pt' seed: typing.Optional[int] = None )

Data collator used for language modeling that masks entire words.

This collator relies on details of the implementation of subword tokenization by BertTokenizer, specifically that subword tokens are prefixed with ##. For tokenizers that do not adhere to this scheme, this collator will produce an output that is roughly equivalent to .DataCollatorForLanguageModeling.

numpy_mask_tokens

< source >

( inputs: typing.Any mask_labels: typing.Any )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set ‘mask_labels’ means we use whole word mask (wwm), we directly mask idxs according to it’s ref.

tf_mask_tokens

< source >

( inputs: typing.Any mask_labels: typing.Any )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set ‘mask_labels’ means we use whole word mask (wwm), we directly mask idxs according to it’s ref.

torch_mask_tokens

< source >

( inputs: typing.Any mask_labels: typing.Any )

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set ‘mask_labels’ means we use whole word mask (wwm), we directly mask idxs according to it’s ref.

DataCollatorForPermutationLanguageModeling

class transformers.DataCollatorForPermutationLanguageModeling

< source >

( tokenizer: PreTrainedTokenizerBase plm_probability: float = 0.16666666666666666 max_span_length: int = 5 return_tensors: str = 'pt' )

Data collator used for permutation language modeling.

numpy_mask_tokens

< source >

( inputs: typing.Any )

The masked tokens to be predicted for a particular sequence are determined by the following algorithm:

  1. Start from the beginning of the sequence by setting cur_len = 0 (number of tokens processed so far).
  2. Sample a span_length from the interval [1, max_span_length] (length of span of tokens to be masked)
  3. Reserve a context of length context_length = span_length / plm_probability to surround span to be masked
  4. Sample a starting point start_index from the interval [cur_len, cur_len + context_length - span_length] and mask tokens start_index:start_index + span_length
  5. Set cur_len = cur_len + context_length. If cur_len < max_len (i.e. there are tokens remaining in the sequence to be processed), repeat from Step 1.

The masked tokens to be predicted for a particular sequence are determined by the following algorithm:

  1. Start from the beginning of the sequence by setting cur_len = 0 (number of tokens processed so far).
  2. Sample a span_length from the interval [1, max_span_length] (length of span of tokens to be masked)
  3. Reserve a context of length context_length = span_length / plm_probability to surround span to be masked
  4. Sample a starting point start_index from the interval [cur_len, cur_len + context_length - span_length] and mask tokens start_index:start_index + span_length
  5. Set cur_len = cur_len + context_length. If cur_len < max_len (i.e. there are tokens remaining in the sequence to be processed), repeat from Step 1.

torch_mask_tokens

< source >

( inputs: typing.Any )

The masked tokens to be predicted for a particular sequence are determined by the following algorithm:

  1. Start from the beginning of the sequence by setting cur_len = 0 (number of tokens processed so far).
  2. Sample a span_length from the interval [1, max_span_length] (length of span of tokens to be masked)
  3. Reserve a context of length context_length = span_length / plm_probability to surround span to be masked
  4. Sample a starting point start_index from the interval [cur_len, cur_len + context_length - span_length] and mask tokens start_index:start_index + span_length
  5. Set cur_len = cur_len + context_length. If cur_len < max_len (i.e. there are tokens remaining in the sequence to be processed), repeat from Step 1.

DataCollatorWithFlattening

class transformers.DataCollatorWithFlattening

< source >

( *args return_position_ids = True separator_id = -100 return_flash_attn_kwargs = False return_seq_idx = False **kwargs )

Data collator used for padding free approach. Does the following:

Using DataCollatorWithFlattening will flatten the entire mini batch into single long sequence. Make sure your attention computation is able to handle it!

DataCollatorForMultipleChoice

class transformers.DataCollatorForMultipleChoice

< source >

( tokenizer: PreTrainedTokenizerBase padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None return_tensors: str = 'pt' )

Parameters

Data collator that dynamically pads a batch of nested examples for multiple choice, so that all choices of all examples have the same length.

< > Update on GitHub