Document.select() behaves weirdly in some particular kind of pdf files (original) (raw)

Description of the bug

Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Here is my code.

import os
import pathlib

import fitz


def get_all_page_from_pdf(document, last_page=None):
    if last_page:
        document.select(list(range(0, last_page)))
    if document.page_count > 30:
        document.select(list(range(0, 30)))
    return iter(page for page in document)


path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix

read_file = open(path, "rb")
file_data = read_file.read()

doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)

for i, page in enumerate(get_all_page_from_pdf(doc)):
    text = page.get_text()
    print(i, text)

How to reproduce the bug

You can reproduce the Bug/issue by running the given script and attached pdf file.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.10