Inconsistent behavior of `getText()` (original) (raw)

Describe the bug

Method getText() returns different text content depending on the combination of used PyMuPDF version and value of parameter opt. It seems to happen since version v1.18.2. If it is possible to get the text as HTML, why not as plain text (and others)?

Returned text per version and opt:

v1.18.1-
- text "Hello World\n"
- html "Hello World\n"
- json "Hello World\n"
- blocks "Hello World\n"
- words "Hello World\n"
v1.18.2
- text "\n"
- html "Hello World\n"
- json "\n"
- blocks ""
- words ""
v1.18.3+
- text ""
- html "Hello World\n"
- json "\n"
- blocks ""
- words ""

To Reproduce

import fitz as pymupdf

doc = pymupdf.open('pdf-example-with-bug') # see section above page = doc.loadPage(0) print(page.getText('text')) print(page.getText('html')) print(page.getText('json')) print(page.getText('blocks')) print(page.getText('words'))

NOTE: '\n' in text mode is visible only after encoding and it is not directly visible in other modes as well.

Expected behavior

The returned text should be consistent among all opts. If it can be obtained as HTML, it should be obtainable for other opt as well (see v1.18.1).

Your configuration

Linux 4.19.160-1-MANJARO
gcc (GCC) 10.2.0
Python 3.8.6 (default, Sep 30 2020, 04:00:38)
PyMuPDF installed via pip (all mentioned versions), Python bindings for the MuPDF 1.18.0 library.

Inconsistent behavior of getText() (original) (raw)

Describe the bug

To Reproduce

Expected behavior

Your configuration

Inconsistent behavior of `getText()` (original) (raw)