doctr.io - docTR documentation (original) (raw)

The io module enables users to easily access content from documents and export analysis results to structured formats.

Document structure

Structural organization of the documents.

Word

A Word is an uninterrupted sequence of characters.

class doctr.io.Word(value: str, confidence: float, geometry: tuple[tuple[float, float], tuple[float, float]] | ndarray, objectness_score: float, crop_orientation: dict[str, Any])[source]

Implements a word element

Parameters:

Line

A Line is a collection of Words aligned spatially and meant to be read together (on a two-column page, on the same horizontal, we will consider that there are two Lines).

class doctr.io.Line(words: list[Word], geometry: tuple[tuple[float, float], tuple[float, float]] | ndarray | None = None, objectness_score: float | None = None)[source]

Implements a line element as a collection of words

Parameters:

Artefact

An Artefact is a non-textual element (e.g. QR code, picture, chart, signature, logo, etc.).

class doctr.io.Artefact(artefact_type: str, confidence: float, geometry: tuple[tuple[float, float], tuple[float, float]])[source]

Implements a non-textual element

Parameters:

Block

A Block is a collection of Lines (e.g. an address written on several lines) and Artefacts (e.g. a graph with its title underneath).

class doctr.io.Block(lines: list[Line] = [], artefacts: list[Artefact] = [], geometry: tuple[tuple[float, float], tuple[float, float]] | ndarray | None = None, objectness_score: float | None = None)[source]

Implements a block element as a collection of lines and artefacts

Parameters:

Page

A Page is a collection of Blocks that were on the same physical page.

class doctr.io.Page(page: ndarray, blocks: list[Block], page_idx: int, dimensions: tuple[int, int], orientation: dict[str, Any] | None = None, language: dict[str, Any] | None = None)[source]

Implements a page element as a collection of blocks

Parameters:

show(interactive: bool = True, preserve_aspect_ratio: bool = False, **kwargs) → None[source]

Overlay the result on a given image

Parameters:

Document

A Document is a collection of Pages.

class doctr.io.Document(pages: list[Page])[source]

Implements a document element as a collection of pages

Parameters:

pages – list of page elements

show(**kwargs) → None[source]

Overlay the result on a given image

File reading

High-performance file reading and conversion to processable structured data.

doctr.io.read_pdf(file: str | Path | bytes, scale: int = 2, rgb_mode: bool = True, password: str | None = None, **kwargs: Any) → list[ndarray][source]

Read a PDF file and convert it into an image in numpy format

from doctr.io import read_pdf doc = read_pdf("path/to/your/doc.pdf")

Parameters:

Returns:

the list of pages decoded as numpy ndarray of shape H x W x C

doctr.io.read_img_as_numpy(file: str | Path | bytes, output_size: tuple[int, int] | None = None, rgb_output: bool = True) → ndarray[source]

Read an image file into numpy format

from doctr.io import read_img_as_numpy page = read_img_as_numpy("path/to/your/doc.jpg")

Parameters:

Returns:

the page decoded as numpy ndarray of shape H x W x 3

doctr.io.read_img_as_tensor(img_path: str | Path, dtype: dtype = torch.float32) → Tensor[source]

Read an image file as a PyTorch tensor

Parameters:

Returns:

decoded image as a tensor

doctr.io.decode_img_as_tensor(img_content: bytes, dtype: dtype = torch.float32) → Tensor[source]

Read a byte stream as a PyTorch tensor

Parameters:

Returns:

decoded image as a tensor

doctr.io.read_html(url: str, **kwargs: Any) → bytes[source]

Read a PDF file and convert it into an image in numpy format

from doctr.io import read_html doc = read_html("https://www.yoursite.com")

Parameters:

Returns:

decoded PDF file as a bytes stream

class doctr.io.DocumentFile[source]

Read a document from multiple extensions

classmethod from_pdf(file: str | Path | bytes, **kwargs) → list[ndarray][source]

Read a PDF file

from doctr.io import DocumentFile doc = DocumentFile.from_pdf("path/to/your/doc.pdf")

Parameters:

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3

classmethod from_url(url: str, **kwargs) → list[ndarray][source]

Interpret a web page as a PDF document

from doctr.io import DocumentFile doc = DocumentFile.from_url("https://www.yoursite.com")

Parameters:

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3

classmethod from_images(files: Sequence[str | Path | bytes] | str | Path | bytes, **kwargs) → list[ndarray][source]

Read an image file (or a collection of image files) and convert it into an image in numpy format

from doctr.io import DocumentFile pages = DocumentFile.from_images(["path/to/your/page1.png", "path/to/your/page2.png"])

Parameters:

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3