LayoutLMv3 (original) (raw)

Overview

The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).

The abstract from the paper is the following:

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

drawing LayoutLMv3 architecture. Taken from the original paper.

This model was contributed by nielsr. The TensorFlow version of this model was added by chriskoo, tokec, and lre. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

LayoutLMv3 is nearly identical to LayoutLMv2, so we’ve also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use LayoutLMv2Processor instead when preparing data for the model!

Document question answering

LayoutLMv3Config

class transformers.LayoutLMv3Config

< source >

( vocab_size = 50265 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-05 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 max_2d_position_embeddings = 1024 coordinate_size = 128 shape_size = 128 has_relative_attention_bias = True rel_pos_bins = 32 max_rel_pos = 128 rel_2d_pos_bins = 64 max_rel_2d_pos = 256 has_spatial_attention_bias = True text_embed = True visual_embed = True input_size = 224 num_channels = 3 patch_size = 16 classifier_dropout = None **kwargs )

Parameters

This is the configuration class to store the configuration of a LayoutLMv3Model. It is used to instantiate an LayoutLMv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv3microsoft/layoutlmv3-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import LayoutLMv3Config, LayoutLMv3Model

configuration = LayoutLMv3Config()

model = LayoutLMv3Model(configuration)

configuration = model.config

LayoutLMv3FeatureExtractor

Preprocess an image or a batch of images.

LayoutLMv3ImageProcessor

class transformers.LayoutLMv3ImageProcessor

< source >

( do_resize: bool = True size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_value: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.Iterable[float], NoneType] = None image_std: typing.Union[float, typing.Iterable[float], NoneType] = None apply_ocr: bool = True ocr_lang: typing.Optional[str] = None tesseract_config: typing.Optional[str] = '' **kwargs )

Parameters

Constructs a LayoutLMv3 image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.Iterable[float], NoneType] = None image_std: typing.Union[float, typing.Iterable[float], NoneType] = None apply_ocr: typing.Optional[bool] = None ocr_lang: typing.Optional[str] = None tesseract_config: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

Parameters

Preprocess an image or batch of images.

LayoutLMv3ImageProcessorFast

class transformers.LayoutLMv3ImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.layoutlmv3.image_processing_layoutlmv3_fast.LayoutLMv3FastImageProcessorKwargs] )

Parameters

Constructs a fast Layoutlmv3 image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] **kwargs: typing_extensions.Unpack[transformers.models.layoutlmv3.image_processing_layoutlmv3_fast.LayoutLMv3FastImageProcessorKwargs] ) β†’ <class 'transformers.image_processing_base.BatchFeature'>

Parameters

Returns

<class 'transformers.image_processing_base.BatchFeature'>

LayoutLMv3Tokenizer

class transformers.LayoutLMv3Tokenizer

< source >

( vocab_file merges_file errors = 'replace' bos_token = '' eos_token = '' sep_token = '' cls_token = '' unk_token = '' pad_token = '' mask_token = '' add_prefix_space = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )

Parameters

Construct a LayoutLMv3 tokenizer. Based on RoBERTatokenizer (Byte Pair Encoding or BPE).LayoutLMv3Tokenizer can be used to turn words, word-level bounding boxes and optional word labels to token-level input_ids, attention_mask, token_type_ids, bbox, and optional labels (for token classification).

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

LayoutLMv3Tokenizer runs end-to-end tokenization: punctuation splitting and wordpiece. It also turns the word-level bounding boxes into token-level bounding boxes.

__call__

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]], NoneType] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )

Parameters

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.

save_vocabulary

< source >

( save_directory: str filename_prefix: typing.Optional[str] = None )

LayoutLMv3TokenizerFast

class transformers.LayoutLMv3TokenizerFast

< source >

( vocab_file = None merges_file = None tokenizer_file = None errors = 'replace' bos_token = '' eos_token = '' sep_token = '' cls_token = '' unk_token = '' pad_token = '' mask_token = '' add_prefix_space = True trim_offsets = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )

Parameters

Construct a β€œfast” LayoutLMv3 tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

__call__

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]], NoneType] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )

Parameters

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.

LayoutLMv3Processor

class transformers.LayoutLMv3Processor

< source >

( image_processor = None tokenizer = None **kwargs )

Parameters

Constructs a LayoutLMv3 processor which combines a LayoutLMv3 image processor and a LayoutLMv3 tokenizer into a single processor.

LayoutLMv3Processor offers all the functionalities you need to prepare data for the model.

It first uses LayoutLMv3ImageProcessor to resize and normalize document images, and optionally applies OCR to get words and normalized bounding boxes. These are then provided to LayoutLMv3Tokenizer orLayoutLMv3TokenizerFast, which turns the words and bounding boxes into token-level input_ids,attention_mask, token_type_ids, bbox. Optionally, one can provide integer word_labels, which are turned into token-level labels for token classification tasks (such as FUNSD, CORD).

__call__

< source >

( images text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]], NoneType] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs )

This method first forwards the images argument to call(). In caseLayoutLMv3ImageProcessor was initialized with apply_ocr set to True, it passes the obtained words and bounding boxes along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values. In case LayoutLMv3ImageProcessor was initialized withapply_ocr set to False, it passes the words (text/` text_pair) and boxes specified by the user along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values.

Please refer to the docstring of the above two methods for more information.

LayoutLMv3Model

class transformers.LayoutLMv3Model

< source >

( config )

Parameters

The bare Layoutlmv3 Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None bbox: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The LayoutLMv3Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, AutoModel from datasets import load_dataset

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = AutoModel.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, words, boxes=boxes, return_tensors="pt")

outputs = model(**encoding) last_hidden_states = outputs.last_hidden_state

LayoutLMv3ForSequenceClassification

class transformers.LayoutLMv3ForSequenceClassification

< source >

( config )

Parameters

LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as theRVL-CDIP dataset.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None bbox: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.LongTensor] = None ) β†’ transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The LayoutLMv3ForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, AutoModelForSequenceClassification from datasets import load_dataset import torch

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = AutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, words, boxes=boxes, return_tensors="pt") sequence_label = torch.tensor([1])

outputs = model(**encoding, labels=sequence_label) loss = outputs.loss logits = outputs.logits

LayoutLMv3ForTokenClassification

class transformers.LayoutLMv3ForTokenClassification

< source >

( config )

Parameters

LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD,SROIE, CORD andKleister-NDA.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None bbox: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None pixel_values: typing.Optional[torch.LongTensor] = None ) β†’ transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.TokenClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The LayoutLMv3ForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, AutoModelForTokenClassification from datasets import load_dataset

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = AutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"] word_labels = example["ner_tags"]

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")

outputs = model(**encoding) loss = outputs.loss logits = outputs.logits

LayoutLMv3ForQuestionAnswering

class transformers.LayoutLMv3ForQuestionAnswering

< source >

( config )

Parameters

The Layoutlmv3 transformer with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None start_positions: typing.Optional[torch.LongTensor] = None end_positions: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None bbox: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.LongTensor] = None ) β†’ transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The LayoutLMv3ForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, AutoModelForQuestionAnswering from datasets import load_dataset import torch

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] question = "what's his name?" words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, question, words, boxes=boxes, return_tensors="pt") start_positions = torch.tensor([1]) end_positions = torch.tensor([3])

outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions) loss = outputs.loss start_scores = outputs.start_logits end_scores = outputs.end_logits

TFLayoutLMv3Model

class transformers.TFLayoutLMv3Model

< source >

( config *inputs **kwargs )

Parameters

The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should β€œjust work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

Note that when creating models and layers withsubclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_ids: tf.Tensor | None = None bbox: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None pixel_values: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) β†’ transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFBaseModelOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The TFLayoutLMv3Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, TFAutoModel from datasets import load_dataset

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = TFAutoModel.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, words, boxes=boxes, return_tensors="tf")

outputs = model(**encoding) last_hidden_states = outputs.last_hidden_state

TFLayoutLMv3ForSequenceClassification

class transformers.TFLayoutLMv3ForSequenceClassification

< source >

( config: LayoutLMv3Config **kwargs )

Parameters

LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as theRVL-CDIP dataset.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should β€œjust work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

Note that when creating models and layers withsubclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_ids: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None bbox: tf.Tensor | None = None pixel_values: tf.Tensor | None = None training: Optional[bool] = False ) β†’ transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The TFLayoutLMv3ForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, TFAutoModelForSequenceClassification from datasets import load_dataset import tensorflow as tf

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = TFAutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, words, boxes=boxes, return_tensors="tf") sequence_label = tf.convert_to_tensor([1])

outputs = model(**encoding, labels=sequence_label) loss = outputs.loss logits = outputs.logits

TFLayoutLMv3ForTokenClassification

class transformers.TFLayoutLMv3ForTokenClassification

< source >

( config: LayoutLMv3Config **kwargs )

Parameters

LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD,SROIE, CORD andKleister-NDA.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should β€œjust work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

Note that when creating models and layers withsubclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_ids: tf.Tensor | None = None bbox: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None pixel_values: tf.Tensor | None = None training: Optional[bool] = False ) β†’ transformers.modeling_tf_outputs.TFTokenClassifierOutput or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFTokenClassifierOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The TFLayoutLMv3ForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, TFAutoModelForTokenClassification from datasets import load_dataset

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = TFAutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"] word_labels = example["ner_tags"]

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="tf")

outputs = model(**encoding) loss = outputs.loss logits = outputs.logits

TFLayoutLMv3ForQuestionAnswering

class transformers.TFLayoutLMv3ForQuestionAnswering

< source >

( config: LayoutLMv3Config **kwargs )

Parameters

LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such asDocVQA (a linear layer on top of the text part of the hidden-states output to compute span start logits and span end logits).

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should β€œjust work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

Note that when creating models and layers withsubclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_ids: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None start_positions: tf.Tensor | None = None end_positions: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None bbox: tf.Tensor | None = None pixel_values: tf.Tensor | None = None return_dict: Optional[bool] = None training: bool = False ) β†’ transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

The TFLayoutLMv3ForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoProcessor, TFAutoModelForQuestionAnswering from datasets import load_dataset import tensorflow as tf

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) model = TFAutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train", trust_remote_code=True) example = dataset[0] image = example["image"] question = "what's his name?" words = example["tokens"] boxes = example["bboxes"]

encoding = processor(image, question, words, boxes=boxes, return_tensors="tf") start_positions = tf.convert_to_tensor([1]) end_positions = tf.convert_to_tensor([3])

outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions) loss = outputs.loss start_scores = outputs.start_logits end_scores = outputs.end_logits

< > Update on GitHub