persistent get_text() formatting (original) (raw)
Is your feature request related to a problem? Please describe.
I have problem with loading the document. The problem is, that the page is loaded in wrong order
import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4)
so I added sort=True to resolve this
import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4, sort=True)
That resolved problem with sorting, but new problem has appeared. Some characters from text were replaced with <?>.
I found some info about this behavior in this page https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#problem-unreadable-text
But I think, that this isn't desired behaviour.
Describe the solution you'd like
I don't know how is the package implemented, but would it be possible to use same text formatters from get_text("text") in get_text("blocks")?
It would resolve the inconsistent formatting when sort argument is changed.
Additional context
I'm sorry, I cannot send you the PDF file. It's internal file from company.