-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract LSTM 4.0: letters repeat in recognized text #884
Comments
This might be one more example where the old 3.x recognizer produces better results than 4.x with LSTM. See here for the related discussion. |
I dug into this and found that letter bboxes are narrower than they should be. letter L BoundingBox=(2, 48, 375, 230) |
I also noticed double characters in the output, but they disappear (although not completely) as soon as the model gets better (~ < 0.1%). |
This is probably because LSTM engine trains on text lines rather than separate letters. @theraysmith can clarify. |
I cannot reproduce the issue with latest Tesseract and with release 4.1.1. Both produce |
When I run tesseract command line program (Windows prebuilt binary, 4.0.0 alpha) on this image in LSTM mode, I get:
LoOrenm 1pPpSsSUlI
Why letters repeat? Stuttering?
In Tesseract mode (oem=0), I get correct text: Lorem ipsum
The text was updated successfully, but these errors were encountered: