Inconsistent behavior of getText() (original) (raw)
Describe the bug
File: pdf-example-with-bug.zip
Method getText() returns different text content depending on the combination of used PyMuPDF version and value of parameter opt. It seems to happen since version v1.18.2. If it is possible to get the text as HTML, why not as plain text (and others)?
Returned text per version and opt:
v1.18.1-text"Hello World\n"html"Hello World\n"json"Hello World\n"blocks"Hello World\n"words"Hello World\n"
v1.18.2text"\n"html"Hello World\n"json"\n"blocks""words""
v1.18.3+text""html"Hello World\n"json"\n"blocks""words""
To Reproduce
import fitz as pymupdf
doc = pymupdf.open('pdf-example-with-bug') # see section above page = doc.loadPage(0) print(page.getText('text')) print(page.getText('html')) print(page.getText('json')) print(page.getText('blocks')) print(page.getText('words'))
NOTE: '\n' in text mode is visible only after encoding and it is not directly visible in other modes as well.
Expected behavior
The returned text should be consistent among all opts. If it can be obtained as HTML, it should be obtainable for other opt as well (see v1.18.1).
Your configuration
- Linux 4.19.160-1-MANJARO
- gcc (GCC) 10.2.0
- Python 3.8.6 (default, Sep 30 2020, 04:00:38)
- PyMuPDF installed via
pip(all mentioned versions), Python bindings for the MuPDF 1.18.0 library.