Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

Closed
bleze opened this issue Nov 4, 2022 · 5 comments · Fixed by #3959
Closed

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

bleze opened this issue Nov 4, 2022 · 5 comments · Fixed by #3959

Comments

@bleze
Copy link

bleze commented Nov 4, 2022

Environment

  • Tesseract Version: 5.2.0 and 4.1.3
  • Platform: Windows 10, x64

Current Behavior:

Spaces are found by Tesseract and inserted in output PDF. It seems to be confused by the horizontal lines in the attached image.
I use this PDF to overlay on top of another PDF which already contains text, and this causes problems in the resulting PDF due to intersections.

Expected Behavior:

Tesseract finding standalone spaces does not make sense to me. I would expect Tesseract to only find characters in the logo and disregard the lines. At least the lines should be underscores or something - not spaces.

Suggested Fix:

Unsure if problem should be fixed in Tesseract or if the spaces should be filtered in the PDF renderer. I think Tesseract as it will fix all output formats.

spaces

@amitdo
Copy link
Collaborator

amitdo commented Nov 6, 2022

Please post hocr output.

@bleze
Copy link
Author

bleze commented Nov 7, 2022

hocr file created like this; tesseract.exe "C:\Tesseract\v.png" "c:\tesseract\v" --tessdata-dir "C:\Tesseract\tessdata_best-main" -l dan --psm 4 --oem 1 -c tessedit_create_hocr=1

Let me know if these parameters at not correct, thanks.

v.zip

@amitdo
Copy link
Collaborator

amitdo commented Nov 7, 2022

Tesseract see this as a photo + a few line separators.

The hocr renderer correctly drops the spaces.

The pdf renderer should ignore any non-text blocks.

@amitdo
Copy link
Collaborator

amitdo commented Nov 7, 2022

Related PR: #3723.

@amitdo amitdo added the bug label Nov 7, 2022
@amitdo amitdo changed the title Tesseract finds standalone spaces and inserts them in rendered PDF PDF renderer: Tesseract inserts spaces for non-text blocks it finds Nov 7, 2022
amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022
amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022
amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022
@amitdo amitdo added the PDF label Nov 7, 2022
amitdo added a commit to amitdo/tesseract that referenced this issue Nov 8, 2022
@amitdo
Copy link
Collaborator

amitdo commented Nov 8, 2022

@bleze,

See the fixed pdf output here:
#3959 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants