PDF renderer: Tesseract inserts spaces for non-text blocks it finds · Issue #3957 · tesseract-ocr/tesseract (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

@bleze

Description

@bleze

Environment

Current Behavior:

Spaces are found by Tesseract and inserted in output PDF. It seems to be confused by the horizontal lines in the attached image.
I use this PDF to overlay on top of another PDF which already contains text, and this causes problems in the resulting PDF due to intersections.

Expected Behavior:

Tesseract finding standalone spaces does not make sense to me. I would expect Tesseract to only find characters in the logo and disregard the lines. At least the lines should be underscores or something - not spaces.

Suggested Fix:

Unsure if problem should be fixed in Tesseract or if the spaces should be filtered in the PDF renderer. I think Tesseract as it will fix all output formats.

spaces