📖 Usage - Pix2Text (original) (raw)

模型文件自动下载

首次使用 Pix2Text 时，系统会自动下载所需的开源模型，并存于 ~/.pix2text 目录（Windows下默认路径为 C:\Users\<username>\AppData\Roaming\pix2text）。 CnOCR 和 CnSTD 中的模型分别存于 ~/.cnocr 和 ~/.cnstd 中（Windows 下默认路径为 C:\Users\<username>\AppData\Roaming\cnocr 和 C:\Users\<username>\AppData\Roaming\cnstd）。下载过程请耐心等待，无法科学上网时系统会自动尝试其他可用站点进行下载，所以可能需要等待较长时间。对于没有网络连接的机器，可以先把模型下载到其他机器上，然后拷贝到对应目录。

如果系统无法自动成功下载模型文件，则需要手动下载模型文件，可以参考 huggingface.co/breezedeus （国内镜像）自己手动下载。

具体说明见模型下载。

初始化

类 Pix2Text 是识别主类，包含了多个识别函数识别不同类型的图片或 PDF文件 中的内容。

Pix2Text 支持两种初始化模式：

方法一：基于配置初始化（推荐）

通过传入 total_configs 配置字典来初始化，这是最常用的方式：

`from pix2text import Pix2Text

使用默认配置

p2t = Pix2Text()

使用自定义配置

total_config = { 'layout': {'scores_thresh': 0.45}, 'text_formula': text_formula_config, } p2t = Pix2Text(total_configs=total_config, enable_table=True, device='cuda') `

完整的初始化参数如下：

class Pix2Text(object): def __init__( self, *, total_configs: Optional[dict] = None, enable_formula: bool = True, enable_table: bool = True, device: Optional[str] = None, layout_parser: Optional[LayoutParser] = None, text_formula_ocr: Optional[TextFormulaOCR] = None, table_ocr: Optional[TableOCR] = None, **kwargs, ):

其中的几个参数含义如下：

total_configs：总配置，包含以下几个键值：
- layout：版面分析模型的配置；
- text_formula：文字与公式识别模型的配置；
- table：表格识别模型的配置；默认值为 None，表示使用默认配置。
enable_formula：是否启用公式识别，默认值为 True；
enable_table：是否启用表格识别，默认值为 True（仅在基于配置初始化时有效）；
device：运行模型的设备，可选值为 'cpu', 'gpu' 或 'cuda'，默认值为 None，表示自动选择设备（仅在基于配置初始化时有效）；
layout_parser：版面分析模型对象，默认值为 None，表示使用默认的版面分析模型（仅在组件模式下使用）；
text_formula_ocr：文字与公式识别模型对象，默认值为 None，表示使用默认的文字与公式识别模型（仅在组件模式下使用）；
table_ocr：表格识别模型对象，默认值为 None，表示不识别表格（仅在组件模式下使用）；
**kwargs：其他参数。

当传入 total_configs 时，会使用基于配置的初始化模式；否则会使用组件模式（直接传入已构建的引擎对象）。

一个包含配置信息的示例如下：

`import os from pix2text import Pix2Text

text_formula_config = dict( languages=('en', 'ch_sim'), # 设置识别的语言 mfd=dict( # 声明 MFD 的初始化参数 model_path=os.path.expanduser( '/.pix2text/1.1/mfd-onnx/mfd-v20240618.onnx' ), # 注：修改成你的模型文件所存储的路径 ), formula=dict( model_name='mfr-pro', model_backend='onnx', model_dir=os.path.expanduser( '/.pix2text/1.1/mfr-pro-onnx' ), # 注：修改成你的模型文件所存储的路径 ), text=dict( rec_model_name='doc-densenet_lite_666-gru_large', rec_model_backend='onnx', rec_model_fp=os.path.expanduser( '~/.cnocr/2.3/doc-densenet_lite_666-gru_large/cnocr-v2.3-doc-densenet_lite_666-gru_large-epoch=005-ft-model.onnx' # noqa ), # 注：修改成你的模型文件所存储的路径 ), ) total_config = { 'layout': {'scores_thresh': 0.45}, 'text_formula': text_formula_config, } p2t = Pix2Text(total_configs=total_config) `

使用 VLM API 做文字和公式识别的示例如下：

`import os from pix2text import Pix2Text

model_name=os.getenv("GEMINI_MODEL") # "gemini/gemini-2.0-flash-lite" api_key=os.getenv("GEMINI_API_KEY") # ""

total_config = { 'layout': None, 'text_formula': { "model_type": "VlmTextFormulaOCR", # 指定类名 "model_name": model_name, "api_key": api_key, }, "table": { "model_type": "VlmTableOCR", # 指定类名 "model_name": model_name, "api_key": api_key, }, } p2t = Pix2Text(total_configs=total_config) `

model_name 和 api_key 的取值，具体可参考 LiteLLM 文档。

方法二：使用 `from_config` 类方法

也可以通过 from_config 类方法来初始化，功能与方法一完全相同：

`` @classmethod def from_config( cls, total_configs: Optional[dict] = None, enable_formula: bool = True, enable_table: bool = True, device: str = None, **kwargs, ): """ Create a Pix2Text object from the configuration. Args: total_configs (dict): The total configuration; default value is None, which means to use the default configuration. If not None, it should contain the following keys:

            * `layout`: The layout parser configuration
            * `text_formula`: The TextFormulaOCR configuration
            * `table`: The table OCR configuration
    enable_formula (bool): Whether to enable formula recognition; default value is `True`
    enable_table (bool): Whether to enable table recognition; default value is `True`
    device (str): The device to run the model; optional values are 'cpu', 'gpu' or 'cuda';
        default value is `None`, which means to select the device automatically
    **kwargs (dict): Other arguments

Returns: a Pix2Text object

"""

使用示例：p2t = Pix2Text.from_config(total_configs=total_config)。

更多初始化的示例请参见 tests/test_pix2text.py。

各种识别接口

类 Pix2Text 提供了不同的识别函数来识别不同类似的图片或者 PDF 文件内容，下面分别说明。

1. 函数 `.recognize_pdf()`

此函数用于识别一整个 PDF 文件中的内容。PDF 文件的内容可以只包含图片而无文字内容，如示例文件 examples/test-doc.pdf。识别时，可以指定识别的页数，也可以指定识别的 PDF 文件编号。函数定义如下：

`` def recognize_pdf( self, pdf_fp: Union[str, Path], pdf_number: int = 0, pdf_id: Optional[str] = None, page_numbers: Optional[List[int]] = None, **kwargs, ) -> Document: """ recognize a pdf file Args: pdf_fp (Union[str, Path]): pdf file path pdf_number (int): pdf number pdf_id (str): pdf id page_numbers (List[int]): page numbers to recognize; default is None, which means to recognize all pages kwargs (dict): Optional keyword arguments. The same as recognize_page

Returns: a Document object. Use `doc.to_markdown('output-dir')` to get the markdown output of the recognized document.

"""

函数说明：

输入参数 pdf_fp：PDF 文件的路径；
输入参数 pdf_number：PDF 文件的编号，默认值为 0；
输入参数 pdf_id：PDF 文件的 ID，默认值为 None；
输入参数 page_numbers：需要识别的页码列表（页码从 0 开始计数，如 [0, 1] 表示只识别文件的第 1、2 页内容），默认值为 None，表示识别所有页；
输入参数 **kwargs：其他参数，具体说明参考下面的函数 recognize_page()。

返回值：返回一个 Document 对象，可以使用 doc.to_markdown('output-dir') 来获取识别结果的 markdown 输出。

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/test-doc.pdf' p2t = Pix2Text.from_config() out_md = p2t.recognize_pdf( img_fp, page_numbers=[0, 1], table_as_image=True, save_debug_res=f'./output-debug', ) out_md.to_markdown('output-pdf-md') `

2. 函数 `.recognize_page()`

此函数用于识别一张包含复杂排版的页面图片中的内容。图片可以包含多列、图片、表格等内容，如示例图片 examples/page2.png。函数定义如下：

`` def recognize_page( self, img: Union[str, Path, Image.Image], page_number: int = 0, page_id: Optional[str] = None, **kwargs, ) -> Page: """ Analyze the layout of the image, and then recognize the information contained in each section.

Args:
    img (str or Image.Image): an image path, or `Image.Image` loaded by `Image.open()`
    page_number (str): page number; default value is `0`
    page_id (str): page id; default value is `None`, which means to use the `str(page_number)`
    kwargs ():
        * resized_shape (int): Resize the image width to this size for processing; default value is `768`
        * mfr_batch_size (int): batch size for MFR; When running on GPU, this value is suggested to be set to greater than 1; default value is `1`
        * embed_sep (tuple): Prefix and suffix for embedding latex; only effective when `return_text` is `True`; default value is `(' <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">&#x27;, &#x27;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9463em;vertical-align:-0.1944em;"></span><span class="mord"><span></span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mpunct"><span class="mpunct">,</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span></span></span></span> ')`
        * isolated_sep (tuple): Prefix and suffix for isolated latex; only effective when `return_text` is `True`; default value is two-dollar signs
        * line_sep (str): The separator between lines of text; only effective when `return_text` is `True`; default value is a line break
        * auto_line_break (bool): Automatically line break the recognized text; only effective when `return_text` is `True`; default value is `True`
        * det_text_bbox_max_width_expand_ratio (float): Expand the width of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.3`
        * det_text_bbox_max_height_expand_ratio (float): Expand the height of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.2`
        * embed_ratio_threshold (float): The overlap threshold for embed formulas and text lines; default value is `0.6`.
            When the overlap between an embed formula and a text line is greater than or equal to this threshold,
            the embed formula and the text line are considered to be on the same line;
            otherwise, they are considered to be on different lines.
        * table_as_image (bool): If `True`, the table will be recognized as an image (don't parse the table content as text) ; default value is `False`
        * title_contain_formula (bool): If `True`, the title of the page will be recognized as a mixed image (text and formula). If `False`, it will be recognized as a text; default value is `False`
        * text_contain_formula (bool): If `True`, the text of the page will be recognized as a mixed image (text and formula). If `False`, it will be recognized as a text; default value is `True`
        * formula_rec_kwargs (dict): generation arguments passed to formula recognizer `latex_ocr`; default value is `{}`
        * save_debug_res (str): if `save_debug_res` is set, the directory to save the debug results; default value is `None`, which means not to save

Returns: a Page object. Use `page.to_markdown('output-dir')` to get the markdown output of the recognized page.
"""

函数说明：

输入参数 img：图片路径或者 Image.Image 对象；
输入参数 page_number：页码，默认值为 0；
输入参数 page_id：页码 ID，默认值为 None，此时会使用 str(page_number) 作为其取值；
kwargs：其他参数，具体说明如下：
- resized_shape：调整图片的宽度为此大小以进行处理，默认值为 768；
- mfr_batch_size：MFR 预测时使用的批大小；在 GPU 上运行时，建议将此值设置为大于 1；默认值为 1；
- embed_sep：嵌入 LaTeX 的前缀和后缀；仅在 return_text 为 True 时有效；默认值为 (' <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">', '</annotation></semantics></math>′,′ ')；
- isolated_sep：孤立 LaTeX 的前缀和后缀；仅在 return_text 为 True 时有效；默认值为两个美元符号；
- line_sep：文本行之间的分隔符；仅在 return_text 为 True 时有效；默认值为换行符；
- auto_line_break：自动换行识别的文本；仅在 return_text 为 True 时有效；默认值为 True；
- det_text_bbox_max_width_expand_ratio：扩展检测文本框的宽度。此值表示相对于原始框高度的最大扩展比率；默认值为 0.3；
- det_text_bbox_max_height_expand_ratio：扩展检测文本框的高度。此值表示相对于原始框高度的最大扩展比率；默认值为 0.2；
- embed_ratio_threshold：嵌入公式和文本行之间的重叠阈值；默认值为 0.6。当嵌入公式和文本行之间的重叠大于或等于此阈值时，认为嵌入公式和文本行在同一行；否则，认为它们在不同行
- table_as_image：如果为 True，则将表格识别为图像（不将表格内容解析为文本）；默认值为 False
- title_contain_formula：如果为 True，则将页面标题作为为混合图像（文本和公式）进行识别。如果为 False，则将其作为文本图片进行识别（不识别公式）；默认值为 False
- text_contain_formula：如果为 True，则将页面文本作为混合图像（文本和公式）进行识别。如果为 False，则将其作为文本进行识别（不识别公式）；默认值为 True
- formula_rec_kwargs：传递给公式识别器 latex_ocr 的生成参数；默认值为 {}
- save_debug_res：如果设置了 save_debug_res，则把各种中间的解析结果存入此目录以便于调试；默认值为 None，表示不保存

返回值：返回一个 Page 对象，可以使用 page.to_markdown('output-dir') 来获取识别结果的 markdown 输出。

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/page2.png' p2t = Pix2Text.from_config() out_page = p2t.recognize_page( img_fp, title_contain_formula=False, text_contain_formula=False, save_debug_res=f'./output-debug', ) out_page.to_markdown('output-page-md') `

3. 函数 `.recognize_text_formula()`

此函数用于识别一张包含文字和公式的图片（如段落截图）中的内容，如示例图片 examples/mixed.jpg。函数定义如下：

`` def recognize_text_formula( self, img: Union[str, Path, Image.Image], return_text: bool = True, **kwargs, ) -> Union[str, List[str], List[Any], List[List[Any]]]: """ Analyze the layout of the image, and then recognize the information contained in each section.

Args:
    img (str or Image.Image): an image path, or `Image.Image` loaded by `Image.open()`
    return_text (bool): Whether to return the recognized text; default value is `True`
    kwargs ():
        * resized_shape (int): Resize the image width to this size for processing; default value is `768`
        * save_analysis_res (str): Save the mfd result image in this file; default is `None`, which means not to save
        * mfr_batch_size (int): batch size for MFR; When running on GPU, this value is suggested to be set to greater than 1; default value is `1`
        * embed_sep (tuple): Prefix and suffix for embedding latex; only effective when `return_text` is `True`; default value is `(' <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">&#x27;, &#x27;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9463em;vertical-align:-0.1944em;"></span><span class="mord"><span></span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mpunct"><span class="mpunct">,</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span></span></span></span> ')`
        * isolated_sep (tuple): Prefix and suffix for isolated latex; only effective when `return_text` is `True`; default value is two-dollar signs
        * line_sep (str): The separator between lines of text; only effective when `return_text` is `True`; default value is a line break
        * auto_line_break (bool): Automatically line break the recognized text; only effective when `return_text` is `True`; default value is `True`
        * det_text_bbox_max_width_expand_ratio (float): Expand the width of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.3`
        * det_text_bbox_max_height_expand_ratio (float): Expand the height of the detected text bbox. This value represents the maximum expansion ratio above and below relative to the original bbox height; default value is `0.2`
        * embed_ratio_threshold (float): The overlap threshold for embed formulas and text lines; default value is `0.6`.
            When the overlap between an embed formula and a text line is greater than or equal to this threshold,
            the embed formula and the text line are considered to be on the same line;
            otherwise, they are considered to be on different lines.
        * table_as_image (bool): If `True`, the table will be recognized as an image; default value is `False`
        * formula_rec_kwargs (dict): generation arguments passed to formula recognizer `latex_ocr`; default value is `{}`

Returns: a str when `return_text` is `True`; or a list of ordered (top to bottom, left to right) dicts when `return_text` is `False`,
    with each dict representing one detected box, containing keys:

       * `type`: The category of the image; Optional: 'text', 'isolated', 'embedding'
       * `text`: The recognized text or Latex formula
       * `score`: The confidence score [0, 1]; the higher, the more confident
       * `position`: Position information of the block, `np.ndarray`, with shape of [4, 2]
       * `line_number`: The line number of the box (first line `line_number==0`), boxes with the same value indicate they are on the same line

"""

函数说明：

输入参数 img：图片路径或者 Image.Image 对象；
输入参数 return_text：是否返回纯文本；取值为 False 时返回带有结构化信息的 list；默认值为 True；
输入参数 kwargs：其他参数，具体说明如下：
- resized_shape：调整图片的宽度为此大小以进行处理，默认值为 768；
- save_analysis_res：保存 MFD 解析结果图像的文件名；默认值为 None，表示不保存；
- mfr_batch_size：MFR 预测时使用的批大小；在 GPU 上运行时，建议将此值设置为大于 1；默认值为 1；
- embed_sep：嵌入 LaTeX 的前缀和后缀；仅在 return_text 为 True 时有效；默认值为 (' <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mrow></mrow><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><msup><mo separator="true">,</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup></mrow><annotation encoding="application/x-tex">', '</annotation></semantics></math>′,′ ')；
- isolated_sep：孤立 LaTeX 的前缀和后缀；仅在 return_text 为 True 时有效；默认值为两个美元符号；
- line_sep：文本行之间的分隔符；仅在 return_text 为 True 时有效；默认值为换行符；
- auto_line_break：自动换行识别的文本；仅在 return_text 为 True 时有效；默认值为 True；
- det_text_bbox_max_width_expand_ratio：扩展检测文本框的宽度。此值表示相对于原始框高度的最大扩展比率；默认值为 0.3；
- det_text_bbox_max_height_expand_ratio：扩展检测文本框的高度。此值表示相对于原始框高度的最大扩展比率；默认值为 0.2；
- embed_ratio_threshold：嵌入公式和文本行之间的重叠阈值；默认值为 0.6。当嵌入公式和文本行之间的重叠大于或等于此阈值时，认为嵌入公式和文本行在同一行；否则，认
- table_as_image：如果为 True，则将表格识别为图像；默认值为 False
- formula_rec_kwargs：传递给公式识别器 latex_ocr 的生成参数；默认值为 {}

返回值：当 return_text 为 True 时，返回一个字符串；当 return_text 为 False 时，返回一个有序的（从上到下，从左到右）字典列表，每个字典表示一个检测框，包含以下键值： - type：图像的类别；可选值：'text'、'isolated'、'embedding' - text：识别的文本或 LaTeX 公式 - score：置信度分数 [0, 1]；分数越高，置信度越高 - position：块的位置信息，np.ndarray，形状为 [4, 2] - line_number：框的行号（第一行 line_number==0），具有相同值的框表示它们在同一行

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/mixed.jpg' p2t = Pix2Text.from_config() out = p2t.recognize_text_formula( img_fp, save_analysis_res=f'./output-debug', ) `

4. 函数 `.recognize_formula()`

此函数用于识别一张纯公式的图片中的内容，如示例图片 examples/formula2.png。函数定义如下：

`` def recognize_formula( self, imgs: Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]], batch_size: int = 1, return_text: bool = True, rec_config: Optional[dict] = None, **kwargs, ) -> Union[str, List[str], Dict[str, Any], List[Dict[str, Any]]]: """ Recognize pure Math Formula images to LaTeX Expressions Args: imgs (Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]): The image or list of images batch_size (int): The batch size return_text (bool): Whether to return only the recognized text; default value is True rec_config (Optional[dict]): The config for recognition **kwargs (): Special model parameters. Not used for now

Returns: The LaTeX Expression or list of LaTeX Expressions;
    str or List[str] when `return_text` is True;
    Dict[str, Any] or List[Dict[str, Any]] when `return_text` is False, with the following keys:

        * `text`: The recognized LaTeX text
        * `score`: The confidence score [0, 1]; the higher, the more confident

"""

函数说明：

输入参数 imgs：图片路径或者 Image.Image 对象，或者图片路径或者 Image.Image 对象的列表；
输入参数 batch_size：批大小，默认值为 1；
输入参数 return_text：是否返回纯文本；取值为 False 时返回带有结构化信息的 list；默认值为 True；
输入参数 rec_config：识别配置，可选值；
输入参数 kwargs：其他参数，目前未使用。

返回值：当 return_text 为 True 时，返回一个字符串；当 return_text 为 False 时，返回一个有序的（从上到下，从左到右）字典列表，每个字典表示一个检测框，包含以下键值： - text：识别的 LaTeX 文本 - score：置信度分数 [0, 1]；分数越高，置信度越高

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/formula2.png' p2t = Pix2Text.from_config() out = p2t.recognize_formula( img_fp, save_analysis_res=f'./output-debug', ) `

5. 函数 `.recognize_text()`

此函数用于识别一张纯文字的图片中的内容，如示例图片 examples/general.jpg。函数定义如下：

`` def recognize_text( self, imgs: Union[str, Path, Image.Image, List[str], List[Path], List[Image.Image]], return_text: bool = True, rec_config: Optional[dict] = None, **kwargs, ) -> Union[str, List[str], List[Any], List[List[Any]]]: """ Recognize a pure Text Image. Args: imgs (Union[str, Path, Image.Image], List[str], List[Path], List[Image.Image]): The image or list of images return_text (bool): Whether to return only the recognized text; default value is True rec_config (Optional[dict]): The config for recognition kwargs (): Other parameters for text_ocr.ocr()

Returns: Text str or list of text strs when `return_text` is True;
    `List[Any]` or `List[List[Any]]` when `return_text` is False, with the same length as `imgs` and the following keys:

        * `position`: Position information of the block, `np.ndarray`, with a shape of [4, 2]
        * `text`: The recognized text
        * `score`: The confidence score [0, 1]; the higher, the more confident

"""

函数说明：

输入参数 imgs：图片路径或者 Image.Image 对象，或者图片路径或者 Image.Image 对象的列表；
输入参数 return_text：是否返回纯文本；取值为 False 时返回带有结构化信息的 list；默认值为 True；
输入参数 rec_config：识别配置，可选值；
输入参数 kwargs：其他参数，具体说明参考函数 text_ocr.ocr()。

返回值：当 return_text 为 True 时，返回一个字符串；当 return_text 为 False 时，返回一个有序的（从上到下，从左到右）字典列表，每个字典表示一个检测框，包含以下键值： - position：块的位置信息，np.ndarray，形状为 [4, 2] - text：识别的文本 - score：置信度分数 [0, 1]；分数越高，置信度越高

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/general.jpg' p2t = Pix2Text.from_config() out = p2t.recognize_text(img_fp) `

6. 函数 `.recognize()`

是不是觉得上面的接口太丰富了，使用起来有点麻烦？没关系，这个函数可以根据指定的图片类型调用上面的不同函数进行识别。

`` def recognize( self, img: Union[str, Path, Image.Image], file_type: Literal[ 'pdf', 'page', 'text_formula', 'formula', 'text' ] = 'text_formula', **kwargs, ) -> Union[Document, Page, str, List[str], List[Any], List[List[Any]]]: """ Recognize the content of the image or pdf file according to the specified type. It will call the corresponding recognition function .recognize_{file_type}() according to the file_type. Args: img (Union[str, Path, Image.Image]): The image/pdf file path or Image.Image object file_type (str): Supported image types: 'pdf', 'page', 'text_formula', 'formula', 'text' **kwargs (dict): Arguments for the corresponding recognition function

Returns: recognized results

"""

函数说明：

输入参数 img：图片/PDF文件路径或者 Image.Image 对象；
输入参数 file_type：图片类型，可选值为 'pdf', 'page', 'text_formula', 'formula', 'text'；
输入参数 kwargs：其他参数，具体说明参考上面的函数。

返回值：根据 file_type 的不同，返回不同的结果。具体说明参考上面的函数。

调用示例：

`from pix2text import Pix2Text

img_fp = 'examples/general.jpg' p2t = Pix2Text.from_config() out = p2t.recognize(img_fp, file_type='text') # 等价于 p2t.recognize_text(img_fp) `

更多使用示例请参见 tests/test_pix2text.py。

📖 Usage - Pix2Text (original) (raw)

模型文件自动下载

初始化

方法一：基于配置初始化（推荐）

使用默认配置

使用自定义配置

方法二：使用 from_config 类方法

各种识别接口

1. 函数 .recognize_pdf()

2. 函数 .recognize_page()

3. 函数 .recognize_text_formula()

4. 函数 .recognize_formula()

5. 函数 .recognize_text()

6. 函数 .recognize()

方法二：使用 `from_config` 类方法

1. 函数 `.recognize_pdf()`

2. 函数 `.recognize_page()`

3. 函数 `.recognize_text_formula()`

4. 函数 `.recognize_formula()`

5. 函数 `.recognize_text()`

6. 函数 `.recognize()`