-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to improve multilingual ocr outcome? #3919
Comments
I suggest to use the supported latest Tesseract version 5.2 with a model which supports all Latin based languages, for example |
Why do you think using scripts it's better then single language? How is that work? Those are models for whole alphabets? |
Yes, |
Is there any list where I can check what languages are covered by which script? |
See here: tesseract-ocr/tessdata#62 (comment) AFAIK, there is no full detailed list. You can extract any traineddata file by using:
Then you can examine the unicharset file and the dawg file. |
Isn't that answer on my own question then? https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin.langs.txt |
Do you possible know why script could be better then combination of single languages? (I mean, why script/Latin could be better then -l eng+deu+pol ? ) I mean some technical explanation :) how tesseract works? |
I was not aware of this file :) The other scripts does not have a |
All scripts have it :) |
:) I didn't notice it. I entered into a few folders and didn't see langs.txt like the Latin folder has. |
I need to OCR some poor quality documents which contain different alphabets e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be faster (I mean: -l deu+pol+eng ).
I noticed that some characters were misidentified so I thought that result will be better after reducing the number of alphabets to those that de facto appear in the document (I checked it manually).
But after the reduction of languages, there are characters replaced incorrectly with other characters even though the first identification was correct.
Example:
Current Behavior:
-l eng+pol+deu
the result:
DÖNĘR
-l eng+deu
the result:
DONER
Expected Behavior:
The correct spelling:
DÖNER
Is there anything what I possibly could do to improve the outcome? any additional parameters which I could use?
Environment
The text was updated successfully, but these errors were encountered: