Calling recover_span_quad() fails because (dict) argument 'span' must have attribute 'chars' as well as key 'chars' (original) (raw)

Please provide all mandatory information!

Describe the bug (mandatory)

I'm calling recover_span_quad(line['dir'], span, chars) where span is a dict (via get_text('rawdict')). But recover_span_quad() aborts with "need 'rawdict' option to sub-select chars" because its dict argument (span) does not have a 'chars' attribute (because it's a dict: 'chars' is a key, not an attribute). I tried wrapping span in a class, so span['chars'] and span.chars both work, but now the call to recover_quad() (from recover_span_quad()) aborts with "bad span argument" because type(span) is not 'dict'. I'm not a python expert, but I don't see how recover_span_quad() can possibly work, since it requires span to both be a dict and have a 'chars' attribute. Am I missing something?

To Reproduce (mandatory)

## filename is any PDF document
doc = fitz.open(filename)
for page in doc:
    page_dict = page.get_text('rawdict')
    for block in filter(lambda x: x['type'] == 0, page_dict['blocks']):
        for line in block['lines']:
            for span in line['spans']:
                word = []
                for char in span['chars']:
                    c = char['c']
                    if re.findall(r'\s+', c):
                        if word:
                            text = ''.join([*map(lambda x: x['c'], word)])
                            ## this call aborts because span is a dict and doesn't have a 'chars' attribute!
                            quad = fitz.recover_span_quad(line['dir'], span, chars = word)
                    else:
                        word.append(char)

Expected behavior (optional)

The call to recover_span_quad() should succeed.

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

>>> print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.9.5 (default, Jun 24 2021, 15:50:05)
[Clang 12.0.0 (clang-1200.0.32.29)]
 darwin

PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:01.
Built for Python 3.9 on darwin (64-bit).

Additional context (optional)

Here's the source code: https://github.com/pymupdf/PyMuPDF/blob/master/fitz/utils.py

Broader context is: I'm trying to extract numbers using get_text('rawdict') and highlight the extracted numbers, and I want to ignore superscripts, so I have to use get_text('rawdict') to get access to the span's 'flags' key. But when I try to concatenate the individual chars together so I can match them with my regex and try to get the quad that spans the extracted number I run into this error.

Wonderful library: thanks so much for providing it!

[Update: fixed example to pass in list of 'char' dicts, instead of list of characters, but doesn't change the problem]