PDF renderer: Tesseract inserts spaces for non-text blocks it finds · Issue #3957 · tesseract-ocr/tesseract (original) (raw)
Navigation Menu
- Explore
- Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Description
Environment
- Tesseract Version: 5.2.0 and 4.1.3
- Platform: Windows 10, x64
Current Behavior:
Spaces are found by Tesseract and inserted in output PDF. It seems to be confused by the horizontal lines in the attached image.
I use this PDF to overlay on top of another PDF which already contains text, and this causes problems in the resulting PDF due to intersections.
Expected Behavior:
Tesseract finding standalone spaces does not make sense to me. I would expect Tesseract to only find characters in the logo and disregard the lines. At least the lines should be underscores or something - not spaces.
Suggested Fix:
Unsure if problem should be fixed in Tesseract or if the spaces should be filtered in the PDF renderer. I think Tesseract as it will fix all output formats.