PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

bleze · 2022-11-04T09:42:49Z

Environment

Tesseract Version: 5.2.0 and 4.1.3
Platform: Windows 10, x64

Current Behavior:

Spaces are found by Tesseract and inserted in output PDF. It seems to be confused by the horizontal lines in the attached image.
I use this PDF to overlay on top of another PDF which already contains text, and this causes problems in the resulting PDF due to intersections.

Expected Behavior:

Tesseract finding standalone spaces does not make sense to me. I would expect Tesseract to only find characters in the logo and disregard the lines. At least the lines should be underscores or something - not spaces.

Suggested Fix:

Unsure if problem should be fixed in Tesseract or if the spaces should be filtered in the PDF renderer. I think Tesseract as it will fix all output formats.

amitdo · 2022-11-06T16:24:09Z

Please post hocr output.

bleze · 2022-11-07T07:02:00Z

hocr file created like this; tesseract.exe "C:\Tesseract\v.png" "c:\tesseract\v" --tessdata-dir "C:\Tesseract\tessdata_best-main" -l dan --psm 4 --oem 1 -c tessedit_create_hocr=1

Let me know if these parameters at not correct, thanks.

v.zip

amitdo · 2022-11-07T07:32:22Z

Tesseract see this as a photo + a few line separators.

The hocr renderer correctly drops the spaces.

The pdf renderer should ignore any non-text blocks.

amitdo · 2022-11-07T07:41:58Z

Related PR: #3723.

Fix tesseract-ocr#3957.

amitdo · 2022-11-08T07:03:29Z

@bleze,

See the fixed pdf output here:
#3959 (comment)

amitdo added the bug label Nov 7, 2022

amitdo changed the title ~~Tesseract finds standalone spaces and inserts them in rendered PDF~~ PDF renderer: Tesseract inserts spaces for non-text blocks it finds Nov 7, 2022

amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022

pdfrenderer.cpp: Ignore non-text blocks

40a944c

Fix tesseract-ocr#3957.

amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022

pdfrenderer.cpp: Ignore non-text blocks

1b38293

Fix tesseract-ocr#3957.

amitdo mentioned this issue Nov 7, 2022

pdfrenderer.cpp: Ignore non-text blocks #3959

Merged

amitdo added a commit to amitdo/tesseract that referenced this issue Nov 7, 2022

pdfrenderer.cpp: Ignore non-text blocks

f4c1946

Fix tesseract-ocr#3957.

amitdo added the PDF label Nov 7, 2022

amitdo added a commit to amitdo/tesseract that referenced this issue Nov 8, 2022

pdfrenderer.cpp: Ignore non-text blocks

c196456

Fix tesseract-ocr#3957.

stweil closed this as completed in #3959 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

bleze commented Nov 4, 2022

amitdo commented Nov 6, 2022

bleze commented Nov 7, 2022

amitdo commented Nov 7, 2022 •

edited

Loading

amitdo commented Nov 7, 2022

amitdo commented Nov 8, 2022

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

Comments

bleze commented Nov 4, 2022

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

amitdo commented Nov 6, 2022

bleze commented Nov 7, 2022

amitdo commented Nov 7, 2022 • edited Loading

amitdo commented Nov 7, 2022

amitdo commented Nov 8, 2022

amitdo commented Nov 7, 2022 •

edited

Loading