Cannot get Tessdata with Tesseract-OCR 5 (original) (raw)

Description of the bug

The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

import pymupdf pymupdf.get_tessdata() Traceback (most recent call last): File "", line 1, in File "<...>/venv/lib/python3.11/site-packages/pymupdf/init.py", line 18082, in get_tessdata for sub_response in response.iterdir(): ^^^^^^^^^^^^^^^^ AttributeError: 'list' object has no attribute 'iterdir'

pymupdf.version ('1.24.9', '1.24.8', '20240724000001')

How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here:

# determine tessdata via iteration over subfolders
tessdata = None
for sub_response in response.iterdir():
for sub_sub in sub_response.iterdir():
if str(sub_sub).endswith("tessdata"):
tessdata = sub_sub
break

I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/.
The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.

However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.

import subprocess cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0) cp CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'') response = cp.stdout.strip().split() response [b'tesseract-ocr:', b'/usr/share/tesseract-ocr'] type(response), type(response[0]) (<class 'list'>, <class 'bytes'>)

response.iterdir() Traceback (most recent call last): File "", line 1, in AttributeError: 'list' object has no attribute 'iterdir'

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?

import subprocess cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0) response = cp.stdout.strip().split() import pathlib response_dir = pathlib.Path(response[1].decode("utf-8"))

response_dir == PosixPath('/usr/share/tesseract-ocr')

for sub_dir in response_dir.iterdir(): for sub_sub_dir in sub_dir.iterdir(): if sub_sub_dir.name.endswith("tessdata"): tessdata = str(sub_sub_dir) break

tessdata == '/usr/share/tesseract-ocr/5/tessdata'

Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.11