Document.select() behaves weirdly in some particular kind of pdf files (original) (raw)
Description of the bug
Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf
Here is my code.
import os
import pathlib
import fitz
def get_all_page_from_pdf(document, last_page=None):
if last_page:
document.select(list(range(0, last_page)))
if document.page_count > 30:
document.select(list(range(0, 30)))
return iter(page for page in document)
path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix
read_file = open(path, "rb")
file_data = read_file.read()
doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)
for i, page in enumerate(get_all_page_from_pdf(doc)):
text = page.get_text()
print(i, text)
How to reproduce the bug
You can reproduce the Bug/issue by running the given script and attached pdf file.
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.10