doctr.models - docTR documentation (original) (raw)
doctr.models.classification¶
doctr.models.classification.vgg16_bn_r(pretrained: bool = False, **kwargs: Any) → VGG[source]¶
VGG-16 architecture as described in “Very Deep Convolutional Networks for Large-Scale Image Recognition”, modified by adding batch normalization, rectangular pooling and a simpler classification head.
import torch from doctr.models import vgg16_bn_r model = vgg16_bn_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on ImageNet
- **kwargs – keyword arguments of the VGG architecture
Returns:
VGG feature extractor
doctr.models.classification.resnet18(pretrained: bool = False, **kwargs: Any) → ResNet[source]¶
ResNet-18 architecture as described in “Deep Residual Learning for Image Recognition”,.
import torch from doctr.models import resnet18 model = resnet18(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the ResNet architecture
Returns:
A resnet18 model
doctr.models.classification.resnet34(pretrained: bool = False, **kwargs: Any) → ResNet[source]¶
ResNet-34 architecture as described in “Deep Residual Learning for Image Recognition”,.
import torch from doctr.models import resnet34 model = resnet34(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the ResNet architecture
Returns:
A resnet34 model
doctr.models.classification.resnet50(pretrained: bool = False, **kwargs: Any) → ResNet[source]¶
ResNet-50 architecture as described in “Deep Residual Learning for Image Recognition”,.
import torch from doctr.models import resnet50 model = resnet50(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the ResNet architecture
Returns:
A resnet50 model
doctr.models.classification.resnet31(pretrained: bool = False, **kwargs: Any) → ResNet[source]¶
Resnet31 architecture with rectangular pooling windows as described in“Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”,. Downsizing: (H, W) –> (H/8, W/4)
import torch from doctr.models import resnet31 model = resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the ResNet architecture
Returns:
A resnet31 model
doctr.models.classification.mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.
import torch from doctr.models import mobilenet_v3_small model = mobilenetv3_small(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Large architecture as described in“Searching for MobileNetV3”,.
import torch from doctr.models import mobilenet_v3_large model = mobilenet_v3_large(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.mobilenet_v3_small_r(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,, with rectangular pooling.
import torch from doctr.models import mobilenet_v3_small_r model = mobilenet_v3_small_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.mobilenet_v3_large_r(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Large architecture as described in“Searching for MobileNetV3”,, with rectangular pooling.
import torch from doctr.models import mobilenet_v3_large_r model = mobilenet_v3_large_r(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.mobilenet_v3_small_crop_orientation(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.
import torch from doctr.models import mobilenet_v3_small_crop_orientation model = mobilenet_v3_small_crop_orientation(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.mobilenet_v3_small_page_orientation(pretrained: bool = False, **kwargs: Any) → MobileNetV3[source]¶
MobileNetV3-Small architecture as described in“Searching for MobileNetV3”,.
import torch from doctr.models import mobilenet_v3_small_page_orientation model = mobilenet_v3_small_page_orientation(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the MobileNetV3 architecture
Returns:
a torch.nn.Module
doctr.models.classification.magc_resnet31(pretrained: bool = False, **kwargs: Any) → ResNet[source]¶
Resnet31 architecture with Multi-Aspect Global Context Attention as described in“MASTER: Multi-Aspect Non-local Network for Scene Text Recognition”,.
import torch from doctr.models import magc_resnet31 model = magc_resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 224, 224), dtype=tf.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the ResNet architecture
Returns:
A feature extractor model
doctr.models.classification.vit_s(pretrained: bool = False, **kwargs: Any) → VisionTransformer[source]¶
VisionTransformer-S architecture“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)
NOTE: unofficial config used in ViTSTR and ParSeq
import torch from doctr.models import vit_s model = vit_s(pretrained=False) input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the VisionTransformer architecture
Returns:
A feature extractor model
doctr.models.classification.vit_b(pretrained: bool = False, **kwargs: Any) → VisionTransformer[source]¶
VisionTransformer-B architecture as described in“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)
import torch from doctr.models import vit_b model = vit_b(pretrained=False) input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the VisionTransformer architecture
Returns:
A feature extractor model
doctr.models.classification.textnet_tiny(pretrained: bool = False, **kwargs: Any) → TextNet[source]¶
Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.
import torch from doctr.models import textnet_tiny model = textnet_tiny(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the TextNet architecture
Returns:
A textnet tiny model
doctr.models.classification.textnet_small(pretrained: bool = False, **kwargs: Any) → TextNet[source]¶
Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.
import torch from doctr.models import textnet_small model = textnet_small(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the TextNet architecture
Returns:
A TextNet small model
doctr.models.classification.textnet_base(pretrained: bool = False, **kwargs: Any) → TextNet[source]¶
Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.
import torch from doctr.models import textnet_base model = textnet_base(pretrained=False) input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained – boolean, True if model is pretrained
- **kwargs – keyword arguments of the TextNet architecture
Returns:
A TextNet base model
doctr.models.classification.vip_tiny(pretrained: bool = False, **kwargs: Any) → VIPNet[source]¶
VIP-Tiny encoder architecture.Corresponds to SVIPTRv2-T variant in the paper (VIPTRv2 function in the official implementation:https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)
Parameters:
- pretrained – whether to load pretrained weights
- **kwargs – optional arguments
Returns:
VIPNet model
doctr.models.classification.vip_base(pretrained: bool = False, **kwargs: Any) → VIPNet[source]¶
VIP-Base encoder architecture. Corresponds to SVIPTRv2-B variant in the paper (VIPTRv2B function in the official implementation:https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)
Parameters:
- pretrained – whether to load pretrained weights
- **kwargs – optional arguments
Returns:
VIPNet model
doctr.models.classification.crop_orientation_predictor(arch: Any = 'mobilenet_v3_small_crop_orientation', pretrained: bool = False, batch_size: int = 128, **kwargs: Any) → OrientationPredictor[source]¶
Crop orientation classification architecture.
import numpy as np from doctr.models import crop_orientation_predictor model = crop_orientation_predictor(arch='mobilenet_v3_small_crop_orientation', pretrained=True) input_crop = (255 * np.random.rand(256, 256, 3)).astype(np.uint8) out = model([input_crop])
Parameters:
- arch – name of the architecture to use (e.g. ‘mobilenet_v3_small_crop_orientation’)
- pretrained – If True, returns a model pre-trained on our recognition crops dataset
- batch_size – number of samples the model processes in parallel
- **kwargs – keyword arguments to be passed to the OrientationPredictor
Returns:
OrientationPredictor
doctr.models.classification.page_orientation_predictor(arch: Any = 'mobilenet_v3_small_page_orientation', pretrained: bool = False, batch_size: int = 4, **kwargs: Any) → OrientationPredictor[source]¶
Page orientation classification architecture.
import numpy as np from doctr.models import page_orientation_predictor model = page_orientation_predictor(arch='mobilenet_v3_small_page_orientation', pretrained=True) input_page = (255 * np.random.rand(512, 512, 3)).astype(np.uint8) out = model([input_page])
Parameters:
- arch – name of the architecture to use (e.g. ‘mobilenet_v3_small_page_orientation’)
- pretrained – If True, returns a model pre-trained on our recognition crops dataset
- batch_size – number of samples the model processes in parallel
- **kwargs – keyword arguments to be passed to the OrientationPredictor
Returns:
OrientationPredictor
doctr.models.detection¶
doctr.models.detection.linknet_resnet18(pretrained: bool = False, **kwargs: Any) → LinkNet[source]¶
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
import torch from doctr.models import linknet_resnet18 model = linknet_resnet18(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the LinkNet architecture
Returns:
text detection architecture
doctr.models.detection.linknet_resnet34(pretrained: bool = False, **kwargs: Any) → LinkNet[source]¶
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
import torch from doctr.models import linknet_resnet34 model = linknet_resnet34(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the LinkNet architecture
Returns:
text detection architecture
doctr.models.detection.linknet_resnet50(pretrained: bool = False, **kwargs: Any) → LinkNet[source]¶
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
import torch from doctr.models import linknet_resnet50 model = linknet_resnet50(pretrained=True).eval() input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the LinkNet architecture
Returns:
text detection architecture
doctr.models.detection.db_resnet50(pretrained: bool = False, **kwargs: Any) → DBNet[source]¶
DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a ResNet-50 backbone.
import torch from doctr.models import db_resnet50 model = db_resnet50(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the DBNet architecture
Returns:
text detection architecture
doctr.models.detection.db_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → DBNet[source]¶
DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a MobileNet V3 Large backbone.
import torch from doctr.models import db_mobilenet_v3_large model = db_mobilenet_v3_large(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the DBNet architecture
Returns:
text detection architecture
doctr.models.detection.fast_tiny(pretrained: bool = False, **kwargs: Any) → FAST[source]¶
FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a tiny TextNet backbone.
import torch from doctr.models import fast_tiny model = fast_tiny(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the DBNet architecture
Returns:
text detection architecture
doctr.models.detection.fast_small(pretrained: bool = False, **kwargs: Any) → FAST[source]¶
FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a small TextNet backbone.
import torch from doctr.models import fast_small model = fast_small(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the DBNet architecture
Returns:
text detection architecture
doctr.models.detection.fast_base(pretrained: bool = False, **kwargs: Any) → FAST[source]¶
FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a base TextNet backbone.
import torch from doctr.models import fast_base model = fast_base(pretrained=True) input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- **kwargs – keyword arguments of the DBNet architecture
Returns:
text detection architecture
doctr.models.detection.detection_predictor(arch: Any = 'fast_base', pretrained: bool = False, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, batch_size: int = 2, **kwargs: Any) → DetectionPredictor[source]¶
Text detection architecture.
import numpy as np from doctr.models import detection_predictor model = detection_predictor(arch='db_resnet50', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])
Parameters:
- arch – name of the architecture or model itself to use (e.g. ‘db_resnet50’)
- pretrained – If True, returns a model pre-trained on our text detection dataset
- assume_straight_pages – If True, fit straight boxes to the page
- preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it
- symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right
- batch_size – number of samples the model processes in parallel
- **kwargs – optional keyword arguments passed to the architecture
Returns:
Detection predictor
doctr.models.recognition¶
doctr.models.recognition.crnn_vgg16_bn(pretrained: bool = False, **kwargs: Any) → CRNN[source]¶
CRNN with a VGG-16 backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
import torch from doctr.models import crnn_vgg16_bn model = crnn_vgg16_bn(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the CRNN architecture
Returns:
text recognition architecture
doctr.models.recognition.crnn_mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) → CRNN[source]¶
CRNN with a MobileNet V3 Small backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
import torch from doctr.models import crnn_mobilenet_v3_small model = crnn_mobilenet_v3_small(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the CRNN architecture
Returns:
text recognition architecture
doctr.models.recognition.crnn_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) → CRNN[source]¶
CRNN with a MobileNet V3 Large backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
import torch from doctr.models import crnn_mobilenet_v3_large model = crnn_mobilenet_v3_large(pretrained=True) input_tensor = torch.rand(1, 3, 32, 128) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the CRNN architecture
Returns:
text recognition architecture
doctr.models.recognition.sar_resnet31(pretrained: bool = False, **kwargs: Any) → SAR[source]¶
SAR with a resnet-31 feature extractor as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”.
import torch from doctr.models import sar_resnet31 model = sar_resnet31(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the SAR architecture
Returns:
text recognition architecture
doctr.models.recognition.master(pretrained: bool = False, **kwargs: Any) → MASTER[source]¶
MASTER as described in paper: <https://arxiv.org/pdf/1910.02562.pdf>`_.
import torch from doctr.models import master model = master(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keywoard arguments passed to the MASTER architecture
Returns:
text recognition architecture
doctr.models.recognition.vitstr_small(pretrained: bool = False, **kwargs: Any) → ViTSTR[source]¶
ViTSTR-Small as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.
import torch from doctr.models import vitstr_small model = vitstr_small(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- kwargs – keyword arguments of the ViTSTR architecture
Returns:
text recognition architecture
doctr.models.recognition.vitstr_base(pretrained: bool = False, **kwargs: Any) → ViTSTR[source]¶
ViTSTR-Base as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.
import torch from doctr.models import vitstr_base model = vitstr_base(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- kwargs – keyword arguments of the ViTSTR architecture
Returns:
text recognition architecture
doctr.models.recognition.parseq(pretrained: bool = False, **kwargs: Any) → PARSeq[source]¶
PARSeq architecture from“Scene Text Recognition with Permuted Autoregressive Sequence Models”.
import torch from doctr.models import parseq model = parseq(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the PARSeq architecture
Returns:
text recognition architecture
doctr.models.recognition.viptr_tiny(pretrained: bool = False, **kwargs: Any) → VIPTR[source]¶
VIPTR-Tiny as described in “A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition”.
import torch from doctr.models import viptr_tiny model = viptr_tiny(pretrained=False) input_tensor = torch.rand((1, 3, 32, 128)) out = model(input_tensor)
Parameters:
- pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- **kwargs – keyword arguments of the VIPTR architecture
Returns:
a VIPTR model instance
Return type:
VIPTR
doctr.models.recognition.recognition_predictor(arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, symmetric_pad: bool = False, batch_size: int = 128, **kwargs: Any) → RecognitionPredictor[source]¶
Text recognition architecture.
Example::
import numpy as np from doctr.models import recognition_predictor model = recognition_predictor(pretrained=True) input_page = (255 * np.random.rand(32, 128, 3)).astype(np.uint8) out = model([input_page])
Parameters:
- arch – name of the architecture or model itself to use (e.g. ‘crnn_vgg16_bn’)
- pretrained – If True, returns a model pre-trained on our text recognition dataset
- symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right
- batch_size – number of samples the model processes in parallel
- **kwargs – optional parameters to be passed to the architecture
Returns:
Recognition predictor
doctr.models.zoo¶
doctr.models.ocr_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, **kwargs: Any) → OCRPredictor[source]¶
End-to-end OCR architecture using one model for localization, and another for text recognition.
import numpy as np from doctr.models import ocr_predictor model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])
Parameters:
- det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)
- reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)
- pretrained – If True, returns a model pre-trained on our OCR dataset
- pretrained_backbone – If True, returns a model with a pretrained backbone
- assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.
- preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.
- symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.
- export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.
- detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
- straighten_pages – if True, estimates the page general orientation based on the segmentation map median line orientation. Then, rotates page before passing it again to the deep learning detection module. Doing so will improve performances for documents with page-uniform rotations.
- detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
- kwargs – keyword args of OCRPredictor
Returns:
OCR predictor
doctr.models.kie_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, **kwargs: Any) → KIEPredictor[source]¶
End-to-end KIE architecture using one model for localization, and another for text recognition.
import numpy as np from doctr.models import ocr_predictor model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) out = model([input_page])
Parameters:
- det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)
- reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)
- pretrained – If True, returns a model pre-trained on our OCR dataset
- pretrained_backbone – If True, returns a model with a pretrained backbone
- assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.
- preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.
- symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.
- export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.
- detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
- straighten_pages – if True, estimates the page general orientation based on the segmentation map median line orientation. Then, rotates page before passing it again to the deep learning detection module. Doing so will improve performances for documents with page-uniform rotations.
- detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
- kwargs – keyword args of OCRPredictor
Returns:
KIE predictor
doctr.models.factory¶
doctr.models.factory.login_to_hub() → None[source]¶
Login to huggingface hub
doctr.models.factory.from_hub(repo_id: str, **kwargs: Any)[source]¶
Instantiate & load a pretrained model from HF hub.
from doctr.models import from_hub model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")
Parameters:
- repo_id – HuggingFace model hub repo
- kwargs – kwargs of hf_hub_download or snapshot_download
Returns:
Model loaded with the checkpoint
doctr.models.factory.push_to_hf_hub(model: Any, model_name: str, task: str, **kwargs) → None[source]¶
Save model and its configuration on HF hub
from doctr.models import login_to_hub, push_to_hf_hub from doctr.models.recognition import crnn_mobilenet_v3_small login_to_hub() model = crnn_mobilenet_v3_small(pretrained=True) push_to_hf_hub(model, 'my-model', 'recognition', arch='crnn_mobilenet_v3_small')
Parameters:
- model – PyTorch model to be saved
- model_name – name of the model which is also the repository name
- task – task name
- **kwargs – keyword arguments for push_to_hf_hub