📖 Usage - Pix2Text (original) (raw)

模型文件自动下载

首次使用 Pix2Text 时,系统会自动下载所需的开源模型,并存于 ~/.pix2text 目录(Windows下默认路径为 C:\Users\<username>\AppData\Roaming\pix2text)。 CnOCR 和 CnSTD 中的模型分别存于 ~/.cnocr~/.cnstd 中(Windows 下默认路径为 C:\Users\<username>\AppData\Roaming\cnocrC:\Users\<username>\AppData\Roaming\cnstd)。 下载过程请耐心等待,无法科学上网时系统会自动尝试其他可用站点进行下载,所以可能需要等待较长时间。 对于没有网络连接的机器,可以先把模型下载到其他机器上,然后拷贝到对应目录。

如果系统无法自动成功下载模型文件,则需要手动下载模型文件,可以参考 huggingface.co/breezedeus国内镜像)自己手动下载。

具体说明见 模型下载

初始化

Pix2Text 是识别主类,包含了多个识别函数识别不同类型的 图片PDF文件 中的内容。

Pix2Text 支持两种初始化模式:

方法一:基于配置初始化(推荐)

通过传入 total_configs 配置字典来初始化,这是最常用的方式:

`from pix2text import Pix2Text

使用默认配置

p2t = Pix2Text()

使用自定义配置

total_config = { 'layout': {'scores_thresh': 0.45}, 'text_formula': text_formula_config, } p2t = Pix2Text(total_configs=total_config, enable_table=True, device='cuda') `

完整的初始化参数如下:

class Pix2Text(object): def __init__( self, *, total_configs: Optional[dict] = None, enable_formula: bool = True, enable_table: bool = True, device: Optional[str] = None, layout_parser: Optional[LayoutParser] = None, text_formula_ocr: Optional[TextFormulaOCR] = None, table_ocr: Optional[TableOCR] = None, **kwargs, ):

其中的几个参数含义如下:

当传入 total_configs 时,会使用基于配置的初始化模式;否则会使用组件模式(直接传入已构建的引擎对象)。

一个包含配置信息的示例如下:

`import os from pix2text import Pix2Text

text_formula_config = dict( languages=('en', 'ch_sim'), # 设置识别的语言 mfd=dict( # 声明 MFD 的初始化参数 model_path=os.path.expanduser( '/.pix2text/1.1/mfd-onnx/mfd-v20240618.onnx' ), # 注:修改成你的模型文件所存储的路径 ), formula=dict( model_name='mfr-pro', model_backend='onnx', model_dir=os.path.expanduser( '/.pix2text/1.1/mfr-pro-onnx' ), # 注:修改成你的模型文件所存储的路径 ), text=dict( rec_model_name='doc-densenet_lite_666-gru_large', rec_model_backend='onnx', rec_model_fp=os.path.expanduser( '~/.cnocr/2.3/doc-densenet_lite_666-gru_large/cnocr-v2.3-doc-densenet_lite_666-gru_large-epoch=005-ft-model.onnx' # noqa ), # 注:修改成你的模型文件所存储的路径 ), ) total_config = { 'layout': {'scores_thresh': 0.45}, 'text_formula': text_formula_config, } p2t = Pix2Text(total_configs=total_config) `

使用 VLM API 做文字和公式识别的示例如下:

`import os from pix2text import Pix2Text

model_name=os.getenv("GEMINI_MODEL") # "gemini/gemini-2.0-flash-lite" api_key=os.getenv("GEMINI_API_KEY") # ""

total_config = { 'layout': None, 'text_formula': { "model_type": "VlmTextFormulaOCR", # 指定类名 "model_name": model_name, "api_key": api_key, }, "table": { "model_type": "VlmTableOCR", # 指定类名 "model_name": model_name, "api_key": api_key, }, } p2t = Pix2Text(total_configs=total_config) `

model_nameapi_key 的取值,具体可参考 LiteLLM 文档

方法二:使用 from_config 类方法

也可以通过 from_config 类方法来初始化,功能与方法一完全相同:

`` @classmethod def from_config( cls, total_configs: Optional[dict] = None, enable_formula: bool = True, enable_table: bool = True, device: str = None, **kwargs, ): """ Create a Pix2Text object from the configuration. Args: total_configs (dict): The total configuration; default value is None, which means to use the default configuration. If not None, it should contain the following keys:

            * `layout`: The layout parser configuration
            * `text_formula`: The TextFormulaOCR configuration
            * `table`: The table OCR configuration
    enable_formula (bool): Whether to enable formula recognition; default value is `True`
    enable_table (bool): Whether to enable table recognition; default value is `True`
    device (str): The device to run the model; optional values are 'cpu', 'gpu' or 'cuda';
        default value is `None`, which means to select the device automatically
    **kwargs (dict): Other arguments

Returns: a Pix2Text object

"""

``

使用示例:p2t = Pix2Text.from_config(total_configs=total_config)

更多初始化的示例请参见 tests/test_pix2text.py

各种识别接口

Pix2Text 提供了不同的识别函数来识别不同类似的图片或者 PDF 文件内容,下面分别说明。

1. 函数 .recognize_pdf()

此函数用于识别一整个 PDF 文件中的内容。PDF 文件的内容可以只包含图片而无文字内容, 如示例文件 examples/test-doc.pdf。 识别时,可以指定识别的页数,也可以指定识别的 PDF 文件编号。 函数定义如下:

`` def recognize_pdf( self, pdf_fp: Union[str, Path], pdf_number: int = 0, pdf_id: Optional[str] = None, page_numbers: Optional[List[int]] = None, **kwargs, ) -> Document: """ recognize a pdf file Args: pdf_fp (Union[str, Path]): pdf file path pdf_number (int): pdf number pdf_id (str): pdf id page_numbers (List[int]): page numbers to recognize; default is None, which means to recognize all pages kwargs (dict): Optional keyword arguments. The same as recognize_page

Returns: a Document object. Use `doc.to_markdown('output-dir')` to get the markdown output of the recognized document.

"""

``

函数说明

返回值:返回一个 Document 对象,可以使用 doc.to_markdown('output-dir') 来获取识别结果的 markdown 输出。

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/test-doc.pdf' p2t = Pix2Text.from_config() out_md = p2t.recognize_pdf( img_fp, page_numbers=[0, 1], table_as_image=True, save_debug_res=f'./output-debug', ) out_md.to_markdown('output-pdf-md') `

2. 函数 .recognize_page()

此函数用于识别一张包含复杂排版的页面图片中的内容。图片可以包含多列、图片、表格等内容,如示例图片 examples/page2.png。 函数定义如下:

`` def recognize_page( self, img: Union[str, Path, Image.Image], page_number: int = 0, page_id: Optional[str] = None, **kwargs, ) -> Page: """ Analyze the layout of the image, and then recognize the information contained in each section.

Args:
    img (str or Image.Image): an image path, or `Image.Image` loaded by `Image.open()`
    page_number (str): page number; default value is `0`
    page_id (str): page id; default value is `None`, which means to use the `str(page_number)`
    kwargs ():
        * resized_shape (int): Resize the image width to this size for processing; default value is `768`
        * mfr_batch_size (int): batch size for MFR; When running on GPU, this value is suggested to be set to greater than 1; default value is `1`
        * embed_sep (tuple): Prefix and suffix for embedding latex; only effective when `return_text` is `True`; default value is `(' <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">&#x27;, &#x27;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9463em;vertical-align:-0.1944em;"></span><span class="mord"><span></span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mpunct"><span class="mpunct">,</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span></span></span></span> ')`
        * isolated_sep (tuple): Prefix and suffix for isolated latex; only effective when `return_text` is `True`; default value is two-dollar signs
        * line_sep (str): The separator between lines of text; only effective when `return_text` is `True`; default value is a line break
        * auto_line_break (bool): Automatically line break the recognized text; only effective when `return_text` is `True`; default value is `True`
        * det_text_bbox_max_width_expand_ratio (float): Expand the width of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.3`
        * det_text_bbox_max_height_expand_ratio (float): Expand the height of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.2`
        * embed_ratio_threshold (float): The overlap threshold for embed formulas and text lines; default value is `0.6`.
            When the overlap between an embed formula and a text line is greater than or equal to this threshold,
            the embed formula and the text line are considered to be on the same line;
            otherwise, they are considered to be on different lines.
        * table_as_image (bool): If `True`, the table will be recognized as an image (don't parse the table content as text) ; default value is `False`
        * title_contain_formula (bool): If `True`, the title of the page will be recognized as a mixed image (text and formula). If `False`, it will be recognized as a text; default value is `False`
        * text_contain_formula (bool): If `True`, the text of the page will be recognized as a mixed image (text and formula). If `False`, it will be recognized as a text; default value is `True`
        * formula_rec_kwargs (dict): generation arguments passed to formula recognizer `latex_ocr`; default value is `{}`
        * save_debug_res (str): if `save_debug_res` is set, the directory to save the debug results; default value is `None`, which means not to save

Returns: a Page object. Use `page.to_markdown('output-dir')` to get the markdown output of the recognized page.
"""

``

函数说明

返回值:返回一个 Page 对象,可以使用 page.to_markdown('output-dir') 来获取识别结果的 markdown 输出。

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/page2.png' p2t = Pix2Text.from_config() out_page = p2t.recognize_page( img_fp, title_contain_formula=False, text_contain_formula=False, save_debug_res=f'./output-debug', ) out_page.to_markdown('output-page-md') `

3. 函数 .recognize_text_formula()

此函数用于识别一张包含文字和公式的图片(如段落截图)中的内容,如示例图片 examples/mixed.jpg。 函数定义如下:

`` def recognize_text_formula( self, img: Union[str, Path, Image.Image], return_text: bool = True, **kwargs, ) -> Union[str, List[str], List[Any], List[List[Any]]]: """ Analyze the layout of the image, and then recognize the information contained in each section.

Args:
    img (str or Image.Image): an image path, or `Image.Image` loaded by `Image.open()`
    return_text (bool): Whether to return the recognized text; default value is `True`
    kwargs ():
        * resized_shape (int): Resize the image width to this size for processing; default value is `768`
        * save_analysis_res (str): Save the mfd result image in this file; default is `None`, which means not to save
        * mfr_batch_size (int): batch size for MFR; When running on GPU, this value is suggested to be set to greater than 1; default value is `1`
        * embed_sep (tuple): Prefix and suffix for embedding latex; only effective when `return_text` is `True`; default value is `(' <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">&#x27;, &#x27;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9463em;vertical-align:-0.1944em;"></span><span class="mord"><span></span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mpunct"><span class="mpunct">,</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span></span></span></span> ')`
        * isolated_sep (tuple): Prefix and suffix for isolated latex; only effective when `return_text` is `True`; default value is two-dollar signs
        * line_sep (str): The separator between lines of text; only effective when `return_text` is `True`; default value is a line break
        * auto_line_break (bool): Automatically line break the recognized text; only effective when `return_text` is `True`; default value is `True`
        * det_text_bbox_max_width_expand_ratio (float): Expand the width of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.3`
        * det_text_bbox_max_height_expand_ratio (float): Expand the height of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.2`
        * embed_ratio_threshold (float): The overlap threshold for embed formulas and text lines; default value is `0.6`.
            When the overlap between an embed formula and a text line is greater than or equal to this threshold,
            the embed formula and the text line are considered to be on the same line;
            otherwise, they are considered to be on different lines.
        * table_as_image (bool): If `True`, the table will be recognized as an image; default value is `False`
        * formula_rec_kwargs (dict): generation arguments passed to formula recognizer `latex_ocr`; default value is `{}`

Returns: a str when `return_text` is `True`; or a list of ordered (top to bottom, left to right) dicts when `return_text` is `False`,
    with each dict representing one detected box, containing keys:

       * `type`: The category of the image; Optional: 'text', 'isolated', 'embedding'
       * `text`: The recognized text or Latex formula
       * `score`: The confidence score [0, 1]; the higher, the more confident
       * `position`: Position information of the block, `np.ndarray`, with shape of [4, 2]
       * `line_number`: The line number of the box (first line `line_number==0`), boxes with the same value indicate they are on the same line

"""

``

函数说明

返回值:当 return_textTrue 时,返回一个字符串;当 return_textFalse 时,返回一个有序的(从上到下,从左到右)字典列表,每个字典表示一个检测框,包含以下键值: - type:图像的类别;可选值:'text'、'isolated'、'embedding' - text:识别的文本或 LaTeX 公式 - score:置信度分数 [0, 1];分数越高,置信度越高 - position:块的位置信息,np.ndarray,形状为 [4, 2] - line_number:框的行号(第一行 line_number==0),具有相同值的框表示它们在同一行

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/mixed.jpg' p2t = Pix2Text.from_config() out = p2t.recognize_text_formula( img_fp, save_analysis_res=f'./output-debug', ) `

4. 函数 .recognize_formula()

此函数用于识别一张纯公式的图片中的内容,如示例图片 examples/formula2.png。 函数定义如下:

`` def recognize_formula( self, imgs: Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]], batch_size: int = 1, return_text: bool = True, rec_config: Optional[dict] = None, **kwargs, ) -> Union[str, List[str], Dict[str, Any], List[Dict[str, Any]]]: """ Recognize pure Math Formula images to LaTeX Expressions Args: imgs (Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]): The image or list of images batch_size (int): The batch size return_text (bool): Whether to return only the recognized text; default value is True rec_config (Optional[dict]): The config for recognition **kwargs (): Special model parameters. Not used for now

Returns: The LaTeX Expression or list of LaTeX Expressions;
    str or List[str] when `return_text` is True;
    Dict[str, Any] or List[Dict[str, Any]] when `return_text` is False, with the following keys:

        * `text`: The recognized LaTeX text
        * `score`: The confidence score [0, 1]; the higher, the more confident

"""

``

函数说明

返回值:当 return_textTrue 时,返回一个字符串;当 return_textFalse 时,返回一个有序的(从上到下,从左到右)字典列表,每个字典表示一个检测框,包含以下键值: - text:识别的 LaTeX 文本 - score:置信度分数 [0, 1];分数越高,置信度越高

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/formula2.png' p2t = Pix2Text.from_config() out = p2t.recognize_formula( img_fp, save_analysis_res=f'./output-debug', ) `

5. 函数 .recognize_text()

此函数用于识别一张纯文字的图片中的内容,如示例图片 examples/general.jpg。 函数定义如下:

`` def recognize_text( self, imgs: Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]], return_text: bool = True, rec_config: Optional[dict] = None, **kwargs, ) -> Union[str, List[str], List[Any], List[List[Any]]]: """ Recognize a pure Text Image. Args: imgs (Union[str, Path, Image.Image], List[str], List[Path], List[Image.Image]): The image or list of images return_text (bool): Whether to return only the recognized text; default value is True rec_config (Optional[dict]): The config for recognition kwargs (): Other parameters for text_ocr.ocr()

Returns: Text str or list of text strs when `return_text` is True;
    `List[Any]` or `List[List[Any]]` when `return_text` is False, with the same length as `imgs` and the following keys:

        * `position`: Position information of the block, `np.ndarray`, with a shape of [4, 2]
        * `text`: The recognized text
        * `score`: The confidence score [0, 1]; the higher, the more confident

"""

``

函数说明

返回值:当 return_textTrue 时,返回一个字符串;当 return_textFalse 时,返回一个有序的(从上到下,从左到右)字典列表,每个字典表示一个检测框,包含以下键值: - position:块的位置信息,np.ndarray,形状为 [4, 2] - text:识别的文本 - score:置信度分数 [0, 1];分数越高,置信度越高

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/general.jpg' p2t = Pix2Text.from_config() out = p2t.recognize_text(img_fp) `

6. 函数 .recognize()

是不是觉得上面的接口太丰富了,使用起来有点麻烦?没关系,这个函数可以根据指定的图片类型调用上面的不同函数进行识别。

`` def recognize( self, img: Union[str, Path, Image.Image], file_type: Literal[ 'pdf', 'page', 'text_formula', 'formula', 'text' ] = 'text_formula', **kwargs, ) -> Union[Document, Page, str, List[str], List[Any], List[List[Any]]]: """ Recognize the content of the image or pdf file according to the specified type. It will call the corresponding recognition function .recognize_{file_type}() according to the file_type. Args: img (Union[str, Path, Image.Image]): The image/pdf file path or Image.Image object file_type (str): Supported image types: 'pdf', 'page', 'text_formula', 'formula', 'text' **kwargs (dict): Arguments for the corresponding recognition function

Returns: recognized results

"""

``

函数说明

返回值:根据 file_type 的不同,返回不同的结果。具体说明参考上面的函数。

调用示例

`from pix2text import Pix2Text

img_fp = 'examples/general.jpg' p2t = Pix2Text.from_config() out = p2t.recognize(img_fp, file_type='text') # 等价于 p2t.recognize_text(img_fp) `

更多使用示例请参见 tests/test_pix2text.py