LayoutLM · Hugging Face (original) (raw)

This model was released on 2019-12-31 and added to Hugging Face Transformers on 2020-11-16.

PyTorch

LayoutLM jointly learns text and the document layout rather than focusing only on text. It incorporates positional layout information and visual features of words from the document images.

You can find all the original LayoutLM checkpoints under the LayoutLM collection.

Click on the LayoutLM models in the right sidebar for more examples of how to apply LayoutLM to different vision and language tasks.

The example below demonstrates question answering with the AutoModel class.

import torch from datasets import load_dataset from transformers import AutoTokenizer, LayoutLMForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True) model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", dtype=torch.float16)

dataset = load_dataset("nielsr/funsd", split="train") example = dataset[0] question = "what's his name?" words = example["words"] boxes = example["bboxes"]

encoding = tokenizer( question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt" ) bbox = [] for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)): if s == 1: bbox.append(boxes[w]) elif i == tokenizer.sep_token_id: bbox.append([1000] * 4) else: bbox.append([0] * 4) encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0) outputs = model(**encoding) loss = outputs.loss start_scores = outputs.start_logits end_scores = outputs.end_logits start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)] print(" ".join(words[start : end + 1]))

Notes

def normalize_bbox(bbox, width, height): return [ int(1000 * (bbox[0] / width)), int(1000 * (bbox[1] / height)), int(1000 * (bbox[2] / width)), int(1000 * (bbox[3] / height)), ]

from PIL import Image

image = Image.open(name_of_your_document).convert("RGB")

width, height = image.size

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLM. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

LayoutLMConfig

class transformers.LayoutLMConfig

< source >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None vocab_size: int = 30522 hidden_size: int = 768 num_hidden_layers: int = 12 num_attention_heads: int = 12 intermediate_size: int = 3072 hidden_act: str = 'gelu' hidden_dropout_prob: float | int = 0.1 attention_probs_dropout_prob: float | int = 0.1 max_position_embeddings: int = 512 type_vocab_size: int = 2 initializer_range: float = 0.02 layer_norm_eps: float = 1e-12 pad_token_id: int | None = 0 eos_token_id: int | list[int] | None = None bos_token_id: int | None = None use_cache: bool = True max_2d_position_embeddings: int = 1024 tie_word_embeddings: bool = True )

Parameters

This is the configuration class to store the configuration of a LayoutLMModel. It is used to instantiate a Layoutlm model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the microsoft/layoutlm-base-uncased

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Examples:

from transformers import LayoutLMConfig, LayoutLMModel

configuration = LayoutLMConfig()

model = LayoutLMModel(configuration)

configuration = model.config

LayoutLMTokenizer

class transformers.BertTokenizer

< source >

( vocab: str | dict[str, int] | None = None do_lower_case: bool = True unk_token: str = '[UNK]' sep_token: str = '[SEP]' pad_token: str = '[PAD]' cls_token: str = '[CLS]' mask_token: str = '[MASK]' tokenize_chinese_chars: bool = True strip_accents: bool | None = None **kwargs )

Parameters

Construct a BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece.

This tokenizer inherits from TokenizersBackend which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

__call__

< source >

( text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None add_special_tokens: bool = True padding: bool | str | PaddingStrategy = False truncation: bool | str | TruncationStrategy | None = None max_length: int | None = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: int | None = None padding_side: str | None = None return_tensors: str | TensorType | None = None return_token_type_ids: bool | None = None return_attention_mask: bool | None = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True tokenizer_kwargs: dict[str, Any] | None = None **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

LayoutLMTokenizerFast

class transformers.BertTokenizer

< source >

( vocab: str | dict[str, int] | None = None do_lower_case: bool = True unk_token: str = '[UNK]' sep_token: str = '[SEP]' pad_token: str = '[PAD]' cls_token: str = '[CLS]' mask_token: str = '[MASK]' tokenize_chinese_chars: bool = True strip_accents: bool | None = None **kwargs )

Parameters

Construct a BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece.

This tokenizer inherits from TokenizersBackend which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

__call__

< source >

( text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None add_special_tokens: bool = True padding: bool | str | PaddingStrategy = False truncation: bool | str | TruncationStrategy | None = None max_length: int | None = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: int | None = None padding_side: str | None = None return_tensors: str | TensorType | None = None return_token_type_ids: bool | None = None return_attention_mask: bool | None = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True tokenizer_kwargs: dict[str, Any] | None = None **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

LayoutLMModel

class transformers.LayoutLMModel

< source >

( config )

Parameters

The bare Layoutlm Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: torch.LongTensor | None = None bbox: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None token_type_ids: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None inputs_embeds: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

Returns

BaseModelOutputWithPooling or tuple(torch.FloatTensor)

A BaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMConfig) and inputs.

The LayoutLMModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, LayoutLMModel import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlm-base-uncased") model = LayoutLMModel.from_pretrained("microsoft/layoutlm-base-uncased")

words = ["Hello", "world"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = [] for word, box in zip(words, normalized_word_boxes): ... word_tokens = tokenizer.tokenize(word) ... token_boxes.extend([box] * len(word_tokens))

token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(" ".join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes])

outputs = model( ... input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids ... )

last_hidden_states = outputs.last_hidden_state

LayoutLMForMaskedLM

class transformers.LayoutLMForMaskedLM

< source >

( config )

Parameters

The Layoutlm Model with a language modeling head on top.”

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: torch.LongTensor | None = None bbox: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None token_type_ids: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → MaskedLMOutput or tuple(torch.FloatTensor)

Parameters

Returns

MaskedLMOutput or tuple(torch.FloatTensor)

A MaskedLMOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMConfig) and inputs.

The LayoutLMForMaskedLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, LayoutLMForMaskedLM import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlm-base-uncased") model = LayoutLMForMaskedLM.from_pretrained("microsoft/layoutlm-base-uncased")

words = ["Hello", "[MASK]"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = [] for word, box in zip(words, normalized_word_boxes): ... word_tokens = tokenizer.tokenize(word) ... token_boxes.extend([box] * len(word_tokens))

token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(" ".join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes])

labels = tokenizer("Hello world", return_tensors="pt")["input_ids"]

outputs = model( ... input_ids=input_ids, ... bbox=bbox, ... attention_mask=attention_mask, ... token_type_ids=token_type_ids, ... labels=labels, ... )

loss = outputs.loss

LayoutLMForSequenceClassification

class transformers.LayoutLMForSequenceClassification

< source >

( config )

Parameters

LayoutLM Model with a sequence classification head on top (a linear layer on top of the pooled output) e.g. for document image classification tasks such as the RVL-CDIP dataset.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: torch.LongTensor | None = None bbox: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None token_type_ids: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

Returns

SequenceClassifierOutput or tuple(torch.FloatTensor)

A SequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMConfig) and inputs.

The LayoutLMForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, LayoutLMForSequenceClassification import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlm-base-uncased") model = LayoutLMForSequenceClassification.from_pretrained("microsoft/layoutlm-base-uncased")

words = ["Hello", "world"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = [] for word, box in zip(words, normalized_word_boxes): ... word_tokens = tokenizer.tokenize(word) ... token_boxes.extend([box] * len(word_tokens))

token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(" ".join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes]) sequence_label = torch.tensor([1])

outputs = model( ... input_ids=input_ids, ... bbox=bbox, ... attention_mask=attention_mask, ... token_type_ids=token_type_ids, ... labels=sequence_label, ... )

loss = outputs.loss logits = outputs.logits

LayoutLMForTokenClassification

class transformers.LayoutLMForTokenClassification

< source >

( config )

Parameters

LayoutLM Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for sequence labeling (information extraction) tasks such as the FUNSDdataset and the SROIE dataset.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: torch.LongTensor | None = None bbox: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None token_type_ids: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → TokenClassifierOutput or tuple(torch.FloatTensor)

Parameters

Returns

TokenClassifierOutput or tuple(torch.FloatTensor)

A TokenClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMConfig) and inputs.

The LayoutLMForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, LayoutLMForTokenClassification import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlm-base-uncased") model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased")

words = ["Hello", "world"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = [] for word, box in zip(words, normalized_word_boxes): ... word_tokens = tokenizer.tokenize(word) ... token_boxes.extend([box] * len(word_tokens))

token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(" ".join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes]) token_labels = torch.tensor([1, 1, 0, 0]).unsqueeze(0)

outputs = model( ... input_ids=input_ids, ... bbox=bbox, ... attention_mask=attention_mask, ... token_type_ids=token_type_ids, ... labels=token_labels, ... )

loss = outputs.loss logits = outputs.logits

LayoutLMForQuestionAnswering

class transformers.LayoutLMForQuestionAnswering

< source >

( config has_visual_segment_embedding = True )

Parameters

The Layoutlm transformer with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: torch.LongTensor | None = None bbox: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None token_type_ids: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None inputs_embeds: torch.FloatTensor | None = None start_positions: torch.LongTensor | None = None end_positions: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

A QuestionAnsweringModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMConfig) and inputs.

The LayoutLMForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

In the example below, we prepare a question + context pair for the LayoutLM model. It will give us a prediction of what it thinks the answer is (the span of the answer within the texts parsed from the image).

from transformers import AutoTokenizer, LayoutLMForQuestionAnswering from datasets import load_dataset import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True) model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train") example = dataset[0] question = "what's his name?" words = example["words"] boxes = example["bboxes"]

encoding = tokenizer( ... question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt" ... ) bbox = [] for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)): ... if s == 1: ... bbox.append(boxes[w]) ... elif i == tokenizer.sep_token_id: ... bbox.append([1000] * 4) ... else: ... bbox.append([0] * 4) encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0) outputs = model(**encoding) loss = outputs.loss start_scores = outputs.start_logits end_scores = outputs.end_logits start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)] print(" ".join(words[start : end + 1])) M. Hamann P. Harper, P. Martinez

Update on GitHub