doctr.models - docTR documentation (original) (raw)

doctr.models.classification

doctr.models.classification.vgg16_bn_r(pretrained: bool = False, **kwargs: Any) → VGG[source]

VGG-16 architecture as described in “Very Deep Convolutional Networks for Large-Scale Image Recognition”, modified by adding batch normalization, rectangular pooling and a simpler classification head.

import torch from doctr.models import vgg16_bn_r model = vgg16_bn_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

VGG feature extractor

doctr.models.classification.resnet18(pretrained: bool = False, **kwargs: Any) → ResNet[source]

ResNet-18 architecture as described in “Deep Residual Learning for Image Recognition”,.

import torch from doctr.models import resnet18 model = resnet18(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A resnet18 model

doctr.models.classification.resnet34(pretrained: bool = False, **kwargs: Any) → ResNet[source]

ResNet-34 architecture as described in “Deep Residual Learning for Image Recognition”,.

import torch from doctr.models import resnet34 model = resnet34(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A resnet34 model

doctr.models.classification.resnet50(pretrained: bool = False, **kwargs: Any) → ResNet[source]

ResNet-50 architecture as described in “Deep Residual Learning for Image Recognition”,.

import torch from doctr.models import resnet50 model = resnet50(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A resnet50 model

doctr.models.classification.resnet31(pretrained: bool = False, **kwargs: Any) → ResNet[source]

Resnet31 architecture with rectangular pooling windows as described in“Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”,. Downsizing: (H, W) –> (H/8, W/4)

import torch from doctr.models import resnet31 model = resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A resnet31 model

doctr.models.classification.mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.

import torch from doctr.models import mobilenet_v3_small model = mobilenetv3_small(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Large architecture as described in“Searching for MobileNetV3”,.

import torch from doctr.models import mobilenet_v3_large model = mobilenet_v3_large(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_r(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,, with rectangular pooling.

import torch from doctr.models import mobilenet_v3_small_r model = mobilenet_v3_small_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_large_r(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Large architecture as described in“Searching for MobileNetV3”,, with rectangular pooling.

import torch from doctr.models import mobilenet_v3_large_r model = mobilenet_v3_large_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_crop_orientation(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.

import torch from doctr.models import mobilenet_v3_small_crop_orientation model = mobilenet_v3_small_crop_orientation(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_page_orientation(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]

MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.

import torch from doctr.models import mobilenet_v3_small_page_orientation model = mobilenet_v3_small_page_orientation(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

a torch.nn.Module

doctr.models.classification.magc_resnet31(pretrained: bool = False, **kwargs: Any) → ResNet[source]

Resnet31 architecture with Multi-Aspect Global Context Attention as described in“MASTER: Multi-Aspect Non-local Network for Scene Text Recognition”,.

import torch from doctr.models import magc_resnet31 model = magc_resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 224, 224), dtype=tf.float32) out = model(input_tensor)

Parameters:

Returns:

A feature extractor model

doctr.models.classification.vit_s(pretrained: bool = False, **kwargs: Any) → VisionTransformer[source]

VisionTransformer-S architecture“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

NOTE: unofficial config used in ViTSTR and ParSeq

import torch from doctr.models import vit_s model = vit_s(pretrained=False) input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32) out = model(input_tensor)

Parameters:

Returns:

A feature extractor model

doctr.models.classification.vit_b(pretrained: bool = False, **kwargs: Any) → VisionTransformer[source]

VisionTransformer-B architecture as described in“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

import torch from doctr.models import vit_b model = vit_b(pretrained=False) input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32) out = model(input_tensor)

Parameters:

Returns:

A feature extractor model

doctr.models.classification.textnet_tiny(pretrained: bool = False, **kwargs: Any) → TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

import torch from doctr.models import textnet_tiny model = textnet_tiny(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A textnet tiny model

doctr.models.classification.textnet_small(pretrained: bool = False, **kwargs: Any) → TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

import torch from doctr.models import textnet_small model = textnet_small(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A TextNet small model

doctr.models.classification.textnet_base(pretrained: bool = False, **kwargs: Any) → TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

import torch from doctr.models import textnet_base model = textnet_base(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

A TextNet base model

doctr.models.classification.vip_tiny(pretrained: bool = False, **kwargs: Any) → VIPNet[source]

VIP-Tiny encoder architecture.Corresponds to SVIPTRv2-T variant in the paper (VIPTRv2 function in the official implementation:https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)

Parameters:

Returns:

VIPNet model

doctr.models.classification.vip_base(pretrained: bool = False, **kwargs: Any) → VIPNet[source]

VIP-Base encoder architecture. Corresponds to SVIPTRv2-B variant in the paper (VIPTRv2B function in the official implementation:https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)

Parameters:

Returns:

VIPNet model

doctr.models.classification.crop_orientation_predictor(arch: Any = 'mobilenet_v3_small_crop_orientation', pretrained: bool = False, batch_size: int = 128, **kwargs: Any) → OrientationPredictor[source]

Crop orientation classification architecture.

import numpy as np from doctr.models import crop_orientation_predictor model = crop_orientation_predictor(arch='mobilenet_v3_small_crop_orientation', pretrained=True) input_crop = (255 * np.random.rand(256, 256, 3)).astype(np.uint8) out = model([input_crop])

Parameters:

Returns:

OrientationPredictor

doctr.models.classification.page_orientation_predictor(arch: Any = 'mobilenet_v3_small_page_orientation', pretrained: bool = False, batch_size: int = 4, **kwargs: Any) → OrientationPredictor[source]

Page orientation classification architecture.

import numpy as np from doctr.models import page_orientation_predictor model = page_orientation_predictor(arch='mobilenet_v3_small_page_orientation', pretrained=True) input_page = (255 * np.random.rand(512, 512, 3)).astype(np.uint8) out = model([input_page])

Parameters:

Returns:

OrientationPredictor

doctr.models.detection

doctr.models.detection.linknet_resnet18(pretrained: bool = False, **kwargs: Any) → LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

import torch from doctr.models import linknet_resnet18 model = linknet_resnet18(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.linknet_resnet34(pretrained: bool = False, **kwargs: Any) → LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

import torch from doctr.models import linknet_resnet34 model = linknet_resnet34(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.linknet_resnet50(pretrained: bool = False, **kwargs: Any) → LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

import torch from doctr.models import linknet_resnet50 model = linknet_resnet50(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.db_resnet50(pretrained: bool = False, **kwargs: Any) → DBNet[source]

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a ResNet-50 backbone.

import torch from doctr.models import db_resnet50 model = db_resnet50(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.db_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → DBNet[source]

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a MobileNet V3 Large backbone.

import torch from doctr.models import db_mobilenet_v3_large model = db_mobilenet_v3_large(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.fast_tiny(pretrained: bool = False, **kwargs: Any) → FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a tiny TextNet backbone.

import torch from doctr.models import fast_tiny model = fast_tiny(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.fast_small(pretrained: bool = False, **kwargs: Any) → FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a small TextNet backbone.

import torch from doctr.models import fast_small model = fast_small(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.fast_base(pretrained: bool = False, **kwargs: Any) → FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a base TextNet backbone.

import torch from doctr.models import fast_base model = fast_base(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)

Parameters:

Returns:

text detection architecture

doctr.models.detection.detection_predictor(arch: Any = 'fast_base', pretrained: bool = False, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, batch_size: int = 2, **kwargs: Any) → DetectionPredictor[source]

Text detection architecture.

import numpy as np from doctr.models import detection_predictor model = detection_predictor(arch='db_resnet50', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])

Parameters:

Returns:

Detection predictor

doctr.models.recognition

doctr.models.recognition.crnn_vgg16_bn(pretrained: bool = False, **kwargs: Any) → CRNN[source]

CRNN with a VGG-16 backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

import torch from doctr.models import crnn_vgg16_bn model = crnn_vgg16_bn(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) → CRNN[source]

CRNN with a MobileNet V3 Small backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

import torch from doctr.models import crnn_mobilenet_v3_small model = crnn_mobilenet_v3_small(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → CRNN[source]

CRNN with a MobileNet V3 Large backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

import torch from doctr.models import crnn_mobilenet_v3_large model = crnn_mobilenet_v3_large(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.sar_resnet31(pretrained: bool = False, **kwargs: Any) → SAR[source]

SAR with a resnet-31 feature extractor as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”.

import torch from doctr.models import sar_resnet31 model = sar_resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.master(pretrained: bool = False, **kwargs: Any) → MASTER[source]

MASTER as described in paper: <https://arxiv.org/pdf/1910.02562.pdf>`_.

import torch from doctr.models import master model = master(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.vitstr_small(pretrained: bool = False, **kwargs: Any) → ViTSTR[source]

ViTSTR-Small as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

import torch from doctr.models import vitstr_small model = vitstr_small(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.vitstr_base(pretrained: bool = False, **kwargs: Any) → ViTSTR[source]

ViTSTR-Base as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

import torch from doctr.models import vitstr_base model = vitstr_base(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.parseq(pretrained: bool = False, **kwargs: Any) → PARSeq[source]

PARSeq architecture from“Scene Text Recognition with Permuted Autoregressive Sequence Models”.

import torch from doctr.models import parseq model = parseq(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

text recognition architecture

doctr.models.recognition.viptr_tiny(pretrained: bool = False, **kwargs: Any) → VIPTR[source]

VIPTR-Tiny as described in “A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition”.

import torch from doctr.models import viptr_tiny model = viptr_tiny(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)

Parameters:

Returns:

a VIPTR model instance

Return type:

VIPTR

doctr.models.recognition.recognition_predictor(arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, symmetric_pad: bool = False, batch_size: int = 128, **kwargs: Any) → RecognitionPredictor[source]

Text recognition architecture.

Example::

import numpy as np from doctr.models import recognition_predictor model = recognition_predictor(pretrained=True) input_page = (255 * np.random.rand(32, 128, 3)).astype(np.uint8) out = model([input_page])

Parameters:

Returns:

Recognition predictor

doctr.models.zoo

doctr.models.ocr_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, **kwargs: Any) → OCRPredictor[source]

End-to-end OCR architecture using one model for localization, and another for text recognition.

import numpy as np from doctr.models import ocr_predictor model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])

Parameters:

Returns:

OCR predictor

doctr.models.kie_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, **kwargs: Any) → KIEPredictor[source]

End-to-end KIE architecture using one model for localization, and another for text recognition.

import numpy as np from doctr.models import ocr_predictor model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])

Parameters:

Returns:

KIE predictor

doctr.models.factory

doctr.models.factory.login_to_hub() → None[source]

Login to huggingface hub

doctr.models.factory.from_hub(repo_id: str, **kwargs: Any)[source]

Instantiate & load a pretrained model from HF hub.

from doctr.models import from_hub model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")

Parameters:

Returns:

Model loaded with the checkpoint

doctr.models.factory.push_to_hf_hub(model: Any, model_name: str, task: str, **kwargs) → None[source]

Save model and its configuration on HF hub

from doctr.models import login_to_hub, push_to_hf_hub from doctr.models.recognition import crnn_mobilenet_v3_small login_to_hub() model = crnn_mobilenet_v3_small(pretrained=True) push_to_hf_hub(model, 'my-model', 'recognition', arch='crnn_mobilenet_v3_small')

Parameters: