Donut (original) (raw)

PyTorch

Donut (Document Understanding Transformer) is a visual document understanding model that doesn’t require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.

Donut features vision encoder (Swin) and a text decoder (BART). Swin converts document images into embeddings and BART processes them into meaningful text sequences.

You can find all the original Donut checkpoints under the Naver Clova Information Extraction organization.

Click on the Donut models in the right sidebar for more examples of how to apply Donut to different language and vision tasks.

The examples below demonstrate how to perform document understanding tasks using Donut with Pipeline and AutoModel

import torch from transformers import pipeline from PIL import Image

pipeline = pipeline( task="document-question-answering", model="naver-clova-ix/donut-base-finetuned-docvqa", device=0, torch_dtype=torch.float16 ) dataset = load_dataset("hf-internal-testing/example-documents", split="test") image = dataset[0]["image"]

pipeline(image=image, question="What time is the coffee break?")

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses torchao to only quantize the weights to int4.

import torch from datasets import load_dataset from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq

quantization_config = TorchAoConfig("int4_weight_only", group_size=128) processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)

dataset = load_dataset("hf-internal-testing/example-documents", split="test") image = dataset[0]["image"] question = "What time is the coffee break?" task_prompt = f"{question}" inputs = processor(image, task_prompt, return_tensors="pt")

outputs = model.generate( input_ids=inputs.input_ids, pixel_values=inputs.pixel_values, max_length=512 ) answer = processor.decode(outputs[0], skip_special_tokens=True) print(answer)

Notes

DonutSwinConfig

class transformers.DonutSwinConfig

< source >

( image_size = 224 patch_size = 4 num_channels = 3 embed_dim = 96 depths = [2, 2, 6, 2] num_heads = [3, 6, 12, 24] window_size = 7 mlp_ratio = 4.0 qkv_bias = True hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 drop_path_rate = 0.1 hidden_act = 'gelu' use_absolute_embeddings = False initializer_range = 0.02 layer_norm_eps = 1e-05 **kwargs )

Parameters

This is the configuration class to store the configuration of a DonutSwinModel. It is used to instantiate a Donut model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Donutnaver-clova-ix/donut-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import DonutSwinConfig, DonutSwinModel

configuration = DonutSwinConfig()

model = DonutSwinModel(configuration)

configuration = model.config

DonutImageProcessor

class transformers.DonutImageProcessor

< source >

( do_resize: bool = True size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_thumbnail: bool = True do_align_long_axis: bool = False do_pad: bool = True do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None **kwargs )

Parameters

Constructs a Donut image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = None do_thumbnail: typing.Optional[bool] = None do_align_long_axis: typing.Optional[bool] = None do_pad: typing.Optional[bool] = None random_padding: bool = False do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

Parameters

Preprocess an image or batch of images.

DonutImageProcessorFast

class transformers.DonutImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.donut.image_processing_donut_fast.DonutFastImageProcessorKwargs] )

Parameters

Constructs a fast Donut image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] **kwargs: typing_extensions.Unpack[transformers.models.donut.image_processing_donut_fast.DonutFastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

Parameters

Returns

<class 'transformers.image_processing_base.BatchFeature'>

DonutFeatureExtractor

Preprocess an image or a batch of images.

DonutProcessor

class transformers.DonutProcessor

< source >

( image_processor = None tokenizer = None **kwargs )

Parameters

Constructs a Donut processor which wraps a Donut image processor and an XLMRoBERTa tokenizer into a single processor.

DonutProcessor offers all the functionalities of DonutImageProcessor and [XLMRobertaTokenizer/XLMRobertaTokenizerFast]. See the call() anddecode() for more information.

__call__

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] = None text: typing.Union[str, typing.List[str], NoneType] = None audio = None videos = None **kwargs: typing_extensions.Unpack[transformers.models.donut.processing_donut.DonutProcessorKwargs] )

When used in normal mode, this method forwards all its arguments to AutoImageProcessor’s__call__() and returns its output. If used in the contextas_target_processor() this method forwards all its arguments to DonutTokenizer’s~DonutTokenizer.__call__. Please refer to the docstring of the above two methods for more information.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' **kwargs )

Parameters

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractorfrom_pretrained(), image processorImageProcessingMixin and the tokenizer~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of the methods above for more information.

save_pretrained

< source >

( save_directory push_to_hub: bool = False **kwargs )

Parameters

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() andsave_pretrained(). Please refer to the docstrings of the methods above for more information.

This method forwards all its arguments to DonutTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to DonutTokenizer’s decode(). Please refer to the docstring of this method for more information.

DonutSwinModel

class transformers.DonutSwinModel

< source >

( config add_pooling_layer = True use_mask_token = False )

Parameters

The bare Donut Swin Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.FloatTensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: bool = False return_dict: typing.Optional[bool] = None ) → transformers.models.donut.modeling_donut_swin.DonutSwinModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.donut.modeling_donut_swin.DonutSwinModelOutput or tuple(torch.FloatTensor)

A transformers.models.donut.modeling_donut_swin.DonutSwinModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (DonutSwinConfig) and inputs.

The DonutSwinModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

DonutSwinForImageClassification

class transformers.DonutSwinForImageClassification

< source >

( config )

Parameters

DonutSwin Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

Note that it’s possible to fine-tune DonutSwin on higher resolution images than the ones it has been trained on, by setting interpolate_pos_encoding to True in the forward of the model. This will interpolate the pre-trained position embeddings to the higher resolution.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.FloatTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: bool = False return_dict: typing.Optional[bool] = None ) → transformers.models.donut.modeling_donut_swin.DonutSwinImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.donut.modeling_donut_swin.DonutSwinImageClassifierOutput or tuple(torch.FloatTensor)

A transformers.models.donut.modeling_donut_swin.DonutSwinImageClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (DonutSwinConfig) and inputs.

The DonutSwinForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoImageProcessor, DonutSwinForImageClassification import torch from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("naver-clova-ix/donut-base") model = DonutSwinForImageClassification.from_pretrained("naver-clova-ix/donut-base")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) ...

< > Update on GitHub