Utilities for Tokenizers (original) (raw)

Most of those are only useful if you are studying the code of the tokenizers in the library.

class transformers.PreTrainedTokenizerBase

< source >

( **kwargs )

Parameters

Base class for PreTrainedTokenizer and PreTrainedTokenizerFast.

Handles shared (mostly boiler plate) methods for those two classes.

Class attributes (overridden by derived classes)

__call__

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

apply_chat_template

< source >

( conversation: typing.Union[typing.List[typing.Dict[str, str]], typing.List[typing.List[typing.Dict[str, str]]]] tools: typing.Optional[typing.List[typing.Union[typing.Dict, typing.Callable]]] = None documents: typing.Optional[typing.List[typing.Dict[str, str]]] = None chat_template: typing.Optional[str] = None add_generation_prompt: bool = False continue_final_message: bool = False tokenize: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: bool = False max_length: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_dict: bool = False return_assistant_tokens_mask: bool = False tokenizer_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None **kwargs ) → Union[List[int], Dict]

Parameters

Returns

Union[List[int], Dict]

A list of token ids representing the tokenized chat so far, including control tokens. This output is ready to pass to the model, either directly or via methods like generate(). If return_dict is set, will return a dict of tokenizer outputs instead.

Converts a list of dictionaries with "role" and "content" keys to a list of token ids. This method is intended for use with chat models, and will read the tokenizer’s chat_template attribute to determine the format and control tokens to use when converting.

Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.

batch_decode

< source >

( sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs ) → List[str]

Parameters

The list of decoded sentences.

Convert a list of lists of token ids into a list of strings by calling decode.

batch_encode_plus

< source >

( batch_text_or_text_pairs: typing.Union[typing.List[str], typing.List[typing.Tuple[str, str]], typing.List[typing.List[str]], typing.List[typing.Tuple[typing.List[str], typing.List[str]]], typing.List[typing.List[int]], typing.List[typing.Tuple[typing.List[int], typing.List[int]]]] add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True split_special_tokens: bool = False **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.

This method is deprecated, __call__ should be used instead.

build_inputs_with_special_tokens

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

The model input with special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

clean_up_tokenization

< source >

( out_string: str ) → str

Parameters

The cleaned-up string.

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

convert_tokens_to_string

< source >

( tokens: typing.List[str] ) → str

Parameters

The joined tokens.

Converts a sequence of tokens in a single string. The most simple way to do it is " ".join(tokens) but we often want to remove sub-word tokenization artifacts at the same time.

create_token_type_ids_from_sequences

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

The token type ids.

Create the token type IDs corresponding to the sequences passed. What are token type IDs?

Should be overridden in a subclass if the model has a special way of building those.

decode

< source >

( token_ids: typing.Union[int, typing.List[int], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs ) → str

Parameters

The decoded sentence.

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

encode

< source >

( text: typing.Union[str, typing.List[str], typing.List[int]] text_pair: typing.Union[str, typing.List[str], typing.List[int], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs ) → List[int], torch.Tensor, tf.Tensor or np.ndarray

Parameters

Returns

List[int], torch.Tensor, tf.Tensor or np.ndarray

The tokenized ids of the text.

Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

encode_plus

< source >

( text: typing.Union[str, typing.List[str], typing.List[int]] text_pair: typing.Union[str, typing.List[str], typing.List[int], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Tokenize and prepare for the model a sequence or a pair of sequences.

This method is deprecated, __call__ should be used instead.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] *init_inputs cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' trust_remote_code = False **kwargs )

Parameters

Instantiate a PreTrainedTokenizerBase (or a derived class) from a predefined tokenizer.

Passing token=True is required when you want to use a private model.

Examples:

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

tokenizer = BertTokenizer.from_pretrained("./test/saved_model/")

tokenizer = BertTokenizer.from_pretrained("./test/saved_model/my_vocab.txt")

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", unk_token="")

assert tokenizer.unk_token == ""

get_chat_template

< source >

( chat_template: typing.Optional[str] = None tools: typing.Optional[typing.List[typing.Dict]] = None ) → str

Parameters

The chat template string.

Retrieve the chat template string used for tokenizing chat messages. This template is used internally by the apply_chat_template method and can also be used externally to retrieve the model’s chat template for better generation tracking.

get_special_tokens_mask

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]

Parameters

Returns

A list of integers in the range [0, 1]

1 for a special token, 0 for a sequence token.

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

pad

< source >

( encoded_inputs: typing.Union[transformers.tokenization_utils_base.BatchEncoding, typing.List[transformers.tokenization_utils_base.BatchEncoding], typing.Dict[str, typing.List[int]], typing.Dict[str, typing.List[typing.List[int]]], typing.List[typing.Dict[str, typing.List[int]]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None verbose: bool = True )

Parameters

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side,self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

prepare_for_model

< source >

( ids: typing.List[int] pair_ids: typing.Optional[typing.List[int]] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True prepend_batch_axis: bool = False **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for _pair_ids_different than None and truncation_strategy = longest_first or True, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.

prepare_seq2seq_batch

< source >

( src_texts: typing.List[str] tgt_texts: typing.Optional[typing.List[str]] = None max_length: typing.Optional[int] = None max_target_length: typing.Optional[int] = None padding: str = 'longest' return_tensors: typing.Optional[str] = None truncation: bool = True **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Prepare model inputs for translation. For best performance, translate one sentence at a time.

push_to_hub

< source >

( repo_id: str use_temp_dir: typing.Optional[bool] = None commit_message: typing.Optional[str] = None private: typing.Optional[bool] = None token: typing.Union[bool, str, NoneType] = None max_shard_size: typing.Union[int, str, NoneType] = '5GB' create_pr: bool = False safe_serialization: bool = True revision: typing.Optional[str] = None commit_description: typing.Optional[str] = None tags: typing.Optional[list[str]] = None **deprecated_kwargs )

Parameters

Upload the tokenizer files to the 🤗 Model Hub.

Examples:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

tokenizer.push_to_hub("my-finetuned-bert")

tokenizer.push_to_hub("huggingface/my-finetuned-bert")

register_for_auto_class

< source >

( auto_class = 'AutoTokenizer' )

Parameters

Register this class with a given auto class. This should only be used for custom tokenizers as the ones in the library are already mapped with AutoTokenizer.

This API is experimental and may have some slight breaking changes in the next releases.

save_chat_templates

< source >

( save_directory: typing.Union[str, os.PathLike] tokenizer_config: dict filename_prefix: typing.Optional[str] save_jinja_files: bool )

Writes chat templates out to the save directory if we’re using the new format, and removes them from the tokenizer config if present. If we’re using the legacy format, it doesn’t write any files, and instead writes the templates to the tokenizer config in the correct format.

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] legacy_format: typing.Optional[bool] = None filename_prefix: typing.Optional[str] = None push_to_hub: bool = False **kwargs ) → A tuple of str

Parameters

The files saved.

Save the full tokenizer state.

This method make sure the full tokenizer can then be re-loaded using the~tokenization_utils_base.PreTrainedTokenizer.from_pretrained class method..

Warning,None This won’t save modifications you may have applied to the tokenizer after the instantiation (for instance, modifying tokenizer.do_lower_case after creation).

save_vocabulary

< source >

( save_directory: str filename_prefix: typing.Optional[str] = None ) → Tuple(str)

Parameters

Paths to the files saved.

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use_save_pretrained() to save the whole state of the tokenizer.

tokenize

< source >

( text: str pair: typing.Optional[str] = None add_special_tokens: bool = False **kwargs ) → List[str]

Parameters

The list of tokens.

Converts a string into a sequence of tokens, replacing unknown tokens with the unk_token.

truncate_sequences

< source >

( ids: typing.List[int] pair_ids: typing.Optional[typing.List[int]] = None num_tokens_to_remove: int = 0 truncation_strategy: typing.Union[str, transformers.tokenization_utils_base.TruncationStrategy] = 'longest_first' stride: int = 0 ) → Tuple[List[int], List[int], List[int]]

Parameters

Returns

Tuple[List[int], List[int], List[int]]

The truncated ids, the truncated pair_ids and the list of overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair of sequences (or a batch of pairs) is provided.

Truncates a sequence pair in-place following the strategy.

class transformers.SpecialTokensMixin

< source >

( verbose = False **kwargs )

Parameters

A mixin derived by PreTrainedTokenizer and PreTrainedTokenizerFast to handle specific behaviors related to special tokens. In particular, this class hold the attributes which can be used to directly access these special tokens in a model-independent manner and allow to set and update the special tokens.

add_special_tokens

< source >

( special_tokens_dict: typing.Dict[str, typing.Union[str, tokenizers.AddedToken, typing.Sequence[typing.Union[str, tokenizers.AddedToken]]]] replace_additional_special_tokens = True ) → int

Parameters

Number of tokens added to the vocabulary.

Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the resize_token_embeddings() method.

Using add_special_tokens will ensure your special tokens can be used in several ways:

When possible, special tokens are already registered for provided pretrained models (for instanceBertTokenizer cls_token is already registered to be :obj_’[CLS]’_ and XLM’s one is also registered to be'</s>').

Examples:

tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2") model = GPT2Model.from_pretrained("openai-community/gpt2")

special_tokens_dict = {"cls_token": ""}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) print("We have added", num_added_toks, "tokens")

model.resize_token_embeddings(len(tokenizer))

assert tokenizer.cls_token == ""

add_tokens

< source >

( new_tokens: typing.Union[str, tokenizers.AddedToken, typing.Sequence[typing.Union[str, tokenizers.AddedToken]]] special_tokens: bool = False ) → int

Parameters

Number of tokens added to the vocabulary.

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to it with indices starting from length of the current vocabulary and will be isolated before the tokenization algorithm is applied. Added tokens and tokens from the vocabulary of the tokenization algorithm are therefore not treated in the same way.

Note, when adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the resize_token_embeddings() method.

Examples:

tokenizer = BertTokenizerFast.from_pretrained("google-bert/bert-base-uncased") model = BertModel.from_pretrained("google-bert/bert-base-uncased")

num_added_toks = tokenizer.add_tokens(["new_tok1", "my_new-tok2"]) print("We have added", num_added_toks, "tokens")

model.resize_token_embeddings(len(tokenizer))

The sanitize_special_tokens is now deprecated kept for backward compatibility and will be removed in transformers v5.

class transformers.tokenization_utils_base.TruncationStrategy

< source >

( value names = None module = None qualname = None type = None start = 1 )

Possible values for the truncation argument in PreTrainedTokenizerBase.call(). Useful for tab-completion in an IDE.

class transformers.CharSpan

< source >

( start: int end: int )

Parameters

Character span in the original string.

class transformers.TokenSpan

< source >

( start: int end: int )

Parameters

Token span in an encoded string (list of tokens).