Choosing the right model - docTR documentation (original) (raw)

The full Optical Character Recognition task can be seen as two consecutive tasks: text detection and text recognition. Either performed at once or separately, to each task corresponds a type of deep learning architecture.

For a given task, docTR provides a Predictor, which is composed of 2 components:

Text Detection

The task consists of localizing textual elements in a given image. While those text elements can represent many things, in docTR, we will consider uninterrupted character sequences (words). Additionally, the localization can take several forms: from straight bounding boxes (delimited by the 2D coordinates of the top-left and bottom-right corner), to polygons, or binary segmentation (flagging which pixels belong to this element, and which don’t). Our latest detection models works with rotated and skewed documents!

Available architectures

The following architectures are currently supported:

For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets:

FUNSD CORD
Architecture Input shape # params Recall Precision Recall Precision sec/it (B: 1)
db_resnet34 (1024, 1024, 3) 22.4 M 82.76 76.75 89.20 71.74 0.8
db_resnet50 (1024, 1024, 3) 25.4 M 83.56 86.68 92.61 86.39 1.1
db_mobilenet_v3_large (1024, 1024, 3) 4.2 M 82.69 84.63 94.51 70.28 0.5
linknet_resnet18 (1024, 1024, 3) 11.5 M 81.64 85.52 88.92 82.74 0.6
linknet_resnet34 (1024, 1024, 3) 21.6 M 81.62 82.95 86.26 81.06 0.7
linknet_resnet50 (1024, 1024, 3) 28.8 M 81.78 82.47 87.29 85.54 1.0
fast_tiny (1024, 1024, 3) 13.5 M (8.5M) 84.90 85.04 93.73 76.26 0.7 (0.4)
fast_small (1024, 1024, 3) 14.7 M (9.7M) 85.36 86.68 94.09 78.53 0.7 (0.5)
fast_base (1024, 1024, 3) 16.3 M (10.6M) 84.95 86.73 94.39 85.36 0.8 (0.5)

All text detection models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. doctr.datasets). Explanations about the metrics being used are available in Task evaluation.

Disclaimer: both FUNSD subsets combined have 199 pages which might not be representative enough of the model capabilities

Seconds per iteration (with a batch size of 1) is computed after a warmup phase of 100 tensors, by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz.

Detection predictors

detection_predictor wraps your detection model to make it easily useable with your favorite deep learning framework seamlessly.

import numpy as np from doctr.models import detection_predictor model = detection_predictor('db_resnet50') dummy_img = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) out = model([dummy_img])

You can pass specific boolean arguments to the predictor: * pretrained: if you want to use a model that has been pretrained on a specific dataset, setting pretrained=True this will load the corresponding weights. If pretrained=False, which is the default, would otherwise lead to a random initialization and would lead to no/useless results. * assume_straight_pages: if you work with straight documents only, it will fit straight bounding boxes to the text areas. * preserve_aspect_ratio: if you want to preserve the aspect ratio of your documents while resizing before sending them to the model. * symmetric_pad: if you choose to preserve the aspect ratio, it will pad the image symmetrically and not from the bottom-right.

For instance, this snippet will instantiates a detection predictor able to detect text on rotated documents while preserving the aspect ratio:

from doctr.models import detection_predictor predictor = detection_predictor('db_resnet50', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)

Text Recognition

The task consists of transcribing the character sequence in a given image.

Available architectures

The following architectures are currently supported:

For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets:

FUNSD CORD
Architecture Input shape # params Exact Partial Exact Partial sec/it (B: 64)
crnn_vgg16_bn (32, 128, 3) 15.8 M 88.21 88.95 95.47 95.91 0.6
crnn_mobilenet_v3_small (32, 128, 3) 2.1 M 87.25 87.99 93.91 94.34 0.05
crnn_mobilenet_v3_large (32, 128, 3) 4.5 M 87.38 88.09 94.46 94.92 0.08
master (32, 128, 3) 58.7 M 88.57 89.39 95.73 96.21 17.6
sar_resnet31 (32, 128, 3) 55.4 M 88.10 88.88 94.83 95.29 4.9
vitstr_small (32, 128, 3) 21.4 M 88.00 88.82 95.40 95.78 1.5
vitstr_base (32, 128, 3) 85.2 M 88.33 89.09 95.32 95.71 4.1
parseq (32, 128, 3) 23.8 M 88.53 89.24 95.56 95.91 2.2
viptr_tiny (32, 128, 3) 3.2 M 86.03 86.71 93.08 93.47 0.08

All text recognition models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. doctr.datasets). Explanations about the metric being used (exact match) are available in Task evaluation.

While most of our recognition models were trained on our french vocab (cf. Supported Vocabs), you can easily access the vocab of any model as follows:

from doctr.models import recognition_predictor predictor = recognition_predictor('crnn_vgg16_bn') print(predictor.model.cfg['vocab'])

Disclaimer: both FUNSD subsets combine have 30595 word-level crops which might not be representative enough of the model capabilities

Seconds per iteration (with a batch size of 64) is computed after a warmup phase of 100 tensors, by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz.

Recognition predictors

recognition_predictor wraps your recognition model to make it easily useable with your favorite deep learning framework seamlessly.

import numpy as np from doctr.models import recognition_predictor model = recognition_predictor('crnn_vgg16_bn') dummy_img = (255 * np.random.rand(50, 150, 3)).astype(np.uint8) out = model([dummy_img])

End-to-End OCR

The task consists of both localizing and transcribing textual elements in a given image.

Available architectures

You can use any combination of detection and recognition models supported by docTR.

For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets:

FUNSD CORD
Architecture Recall | Precision Recall Precision
db_resnet50 + crnn_vgg16_bn 73.37 76.11 84.80 79.09
db_resnet50 + crnn_mobilenet_v3_small 73.06 75.79 84.64 78.94
db_resnet50 + crnn_mobilenet_v3_large 73.17 75.90 84.96 79.25
db_resnet50 + master 73.90 76.66 85.84 80.07
db_resnet50 + sar_resnet31 73.58 76.33 85.64 79.88
db_resnet50 + vitstr_small 73.06 75.79 85.95 80.17
db_resnet50 + vitstr_base 73.70 76.46 85.76 79.99
db_resnet50 + parseq 73.52 76.27 85.91 80.13
Gvision text detection 59.50 62.50 75.30 59.03
Gvision doc. text detection 64.00 53.30 68.90 61.10
AWS textract 78.10 83.00 87.50 66.00
Azure Form Recognizer (v3.2) 79.42 85.89 89.62 88.93

All OCR models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. doctr.datasets). Explanations about the metrics being used are available in Task evaluation.

Disclaimer: both FUNSD subsets combine have 199 pages which might not be representative enough of the model capabilities

Two-stage approaches

Those architectures involve one stage of text detection, and one stage of text recognition. The text detection will be used to produces cropped images that will be passed into the text recognition block. Everything is wrapped up with ocr_predictor.

import numpy as np from doctr.models import ocr_predictor model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) input_page = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) out = model([input_page])

You can pass specific boolean arguments to the predictor:

Those 3 are going straight to the detection predictor, as mentioned above (in the detection part).

Additional arguments which can be passed to the ocr_predictor are:

For instance, this snippet instantiates an end-to-end ocr_predictor working with rotated documents, which preserves the aspect ratio of the documents, and returns polygons:

from doctr.models import ocr_predictor model = ocr_predictor('linknet_resnet18', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)

Additionally, you can change the batch size of the underlying detection and recognition predictors to optimize the performance depending on your hardware:

from doctr.models import ocr_predictor model = ocr_predictor(pretrained=True, det_bs=4, reco_bs=1024)

To modify the output structure you can pass the following arguments to the predictor which will be handled by the underlying DocumentBuilder:

For example to disable the automatic grouping of lines into blocks:

from doctr.models import ocr_predictor model = ocr_predictor(pretrained=True, resolve_blocks=False)

Running the predictors on GPU

You can run the predictors on GPU by specifying the appropriate device.

Here’s how to do it for both NVIDIA and Apple Silicon (MPS) GPUs:

import torch from doctr.models import ocr_predictor

For NVIDIA GPU

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') predictor = ocr_predictor(pretrained=True).to(device)

Alternatively: predictor = ocr_predictor(pretrained=True).cuda()

For Apple Silicon (MPS)

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu') predictor = ocr_predictor(pretrained=True).to(device)

The same approach applies to all standalone predictors:

Just create the predictor instance and move it to the appropriate device. To enable half-precision inference, you can append .half() after moving the predictor to the device.

What should I do with the output?

The ocr_predictor returns a Document object with a nested structure (with Page, Block, Line, Word, Artefact). To get a better understanding of our document model, check our Document structure section

Here is a typical Document layout:

Document( (pages): [Page( dimensions=(340, 600) (blocks): [Block( (lines): [Line( (words): [ Word(value='No.', confidence=0.91), Word(value='RECEIPT', confidence=0.99), Word(value='DATE', confidence=0.96), ] )] (artefacts): [] )] )] )

To get only the text content of the Document, you can use the render method:

text_output = result.render()

For reference, here is the output for the Document above:

You can also export them as a nested dict, more appropriate for JSON format:

json_output = result.export()

For reference, here is the export for the same Document as above:

{ 'pages': [ { 'page_idx': 0, 'dimensions': (340, 600), 'orientation': {'value': None, 'confidence': None}, 'language': {'value': None, 'confidence': None}, 'blocks': [ { 'geometry': ((0.1357421875, 0.0361328125), (0.8564453125, 0.8603515625)), 'lines': [ { 'geometry': ((0.1357421875, 0.0361328125), (0.8564453125, 0.8603515625)), 'words': [ { 'value': 'No.', 'confidence': 0.914085328578949, 'geometry': ((0.5478515625, 0.06640625), (0.5810546875, 0.0966796875)), 'objectness_score': 0.96, 'crop_orientation': {'value': 0, 'confidence': None}, }, { 'value': 'RECEIPT', 'confidence': 0.9949972033500671, 'geometry': ((0.1357421875, 0.0361328125), (0.51171875, 0.1630859375)), 'objectness_score': 0.99, 'crop_orientation': {'value': 0, 'confidence': None}, }, { 'value': 'DATE', 'confidence': 0.9578408598899841, 'geometry': ((0.1396484375, 0.3232421875), (0.185546875, 0.3515625)), 'objectness_score': 0.99, 'crop_orientation': {'value': 0, 'confidence': None}, } ] } ], 'artefacts': [] } ] } ] }

To export the output as XML (hocr-format) you can use the export_as_xml method:

xml_output = result.export_as_xml() for output in xml_output: xml_bytes_string = output[0] xml_element = output[1]

For reference, here is a sample XML byte string output:

docTR - hOCR

Hello XML World

Advanced options

We provide a few advanced options to customize the behavior of the predictor to your needs:

This is useful to detect (possible less) text regions more accurately with a higher threshold, or to detect more text regions with a lower threshold.

import numpy as np from doctr.models import ocr_predictor predictor = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)

Modify the binarization threshold and the box threshold

predictor.det_predictor.model.postprocessor.bin_thresh = 0.5 predictor.det_predictor.model.postprocessor.box_thresh = 0.2

input_page = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) out = predictor([input_page])

If you deal with documents which contains only small rotations (~ -45 to 45 degrees), you can disable the page orientation classification to speed up the inference.

This will only have an effect with assume_straight_pages=False and/or straighten_pages=True and/or detect_orientation=True.

from doctr.models import ocr_predictor model = ocr_predictor(pretrained=True, assume_straight_pages=False, disable_page_orientation=True)

If you deal with documents which contains only horizontal text, you can disable the crop orientation classification to speed up the inference.

This will only have an effect with assume_straight_pages=False and/or straighten_pages=True.

from doctr.models import ocr_predictor model = ocr_predictor(pretrained=True, assume_straight_pages=False, disable_crop_orientation=True)

from doctr.models import ocr_predictor

class CustomHook: def call(self, loc_preds): # Manipulate the location predictions here # 1. The output structure needs to be the same as the input location predictions # 2. Be aware that the coordinates are relative and needs to be between 0 and 1 return loc_preds

my_hook = CustomHook()

predictor = ocr_predictor(pretrained=True)

Add a hook in the middle of the pipeline

predictor.add_hook(my_hook)

You can also add multiple hooks which will be executed sequentially

for hook in [my_hook, my_hook, my_hook]: predictor.add_hook(hook)