Tesseract LSTM 4.0: letters repeat in recognized text #884

blueshade7 · 2017-05-05T17:11:48Z

When I run tesseract command line program (Windows prebuilt binary, 4.0.0 alpha) on this image in LSTM mode, I get:
LoOrenm 1pPpSsSUlI

Why letters repeat? Stuttering?
In Tesseract mode (oem=0), I get correct text: Lorem ipsum

stweil · 2017-05-05T17:26:54Z

This might be one more example where the old 3.x recognizer produces better results than 4.x with LSTM. See here for the related discussion.

blueshade7 · 2017-05-12T06:01:52Z

I dug into this and found that letter bboxes are narrower than they should be.
Debugged hand-built Tess4 on another platform so output is a bit different from above but PageIterator::BoundingBox returns bbox narrower than actual as shown below. Seems like the same glyph image is recognized a couple of times while horizontal scan striding shorter than it should:

letter L BoundingBox=(2, 48, 375, 230)
letter o BoundingBox=(484, 76, 521, 236)
letter O BoundingBox=(521, 76, 559, 236)
letter r BoundingBox=(559, 76, 1043, 236)
letter e BoundingBox=(1119, 76, 1438, 236)
letter I BoundingBox=(1527, 76, 1564, 230)
letter n BoundingBox=(1564, 76, 1611, 230)
letter m BoundingBox=(1611, 76, 1658, 230)
letter n BoundingBox=(1658, 76, 1890, 230)
letter 1 BoundingBox=(2182, 1, 2436, 230)
letter p BoundingBox=(2607, 76, 2645, 295)
letter P BoundingBox=(2645, 76, 2682, 295)
letter p BoundingBox=(2682, 76, 2784, 295)
letter S BoundingBox=(2784, 76, 2826, 295)
letter s BoundingBox=(2999, 76, 3036, 237)
letter S BoundingBox=(3036, 76, 3129, 237)
letter U BoundingBox=(3186, 74, 3390, 236)
letter l BoundingBox=(3390, 74, 3548, 236)
letter I BoundingBox=(3548, 74, 3582, 236)
letter M BoundingBox=(3790, 76, 4172, 230)

kolomiyets · 2017-05-12T08:25:41Z

I also noticed double characters in the output, but they disappear (although not completely) as soon as the model gets better (~ < 0.1%).

blueshade7 · 2017-05-17T22:11:48Z

I drew bounding box around each recognized letter in this image. While some are spot on but many off even though text is correctly recognized as "Simple Test". Note boxes are intentionally drew off at top and bottom to minimize a chance of box overlaps.

Shreeshrii · 2017-05-18T03:06:31Z

This is probably because LSTM engine trains on text lines rather than separate letters.

@theraysmith can clarify.

stweil · 2020-11-09T07:37:26Z

I cannot reproduce the issue with latest Tesseract and with release 4.1.1. Both produce Lorem 17S tira which is not correct, but does not show duplicated characters.

Shreeshrii mentioned this issue Jun 28, 2017

Incorrect recognotion of specific words - additional letters inserted #1011

Closed

amitdo mentioned this issue Aug 1, 2017

German - Characters added to result multiple times (aä / AÄ) #1060

Open

Shreeshrii mentioned this issue Apr 22, 2018

Tesseract inserting additional alternative characters #1465

Open

stweil added the accuracy label Nov 7, 2020

stweil mentioned this issue Nov 7, 2020

Character confusion fix suggestion #3144

Open

amitdo added the diplopia label Mar 17, 2021

amitdo closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract LSTM 4.0: letters repeat in recognized text #884

Tesseract LSTM 4.0: letters repeat in recognized text #884

blueshade7 commented May 5, 2017

stweil commented May 5, 2017

blueshade7 commented May 12, 2017

kolomiyets commented May 12, 2017

blueshade7 commented May 17, 2017

Shreeshrii commented May 18, 2017

stweil commented Nov 9, 2020

Tesseract LSTM 4.0: letters repeat in recognized text #884

Tesseract LSTM 4.0: letters repeat in recognized text #884

Comments

blueshade7 commented May 5, 2017

stweil commented May 5, 2017

blueshade7 commented May 12, 2017

kolomiyets commented May 12, 2017

blueshade7 commented May 17, 2017

Shreeshrii commented May 18, 2017

stweil commented Nov 9, 2020