Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve multilingual ocr outcome? #3919

Closed
Kate-D-S opened this issue Sep 16, 2022 · 10 comments
Closed

How to improve multilingual ocr outcome? #3919

Kate-D-S opened this issue Sep 16, 2022 · 10 comments

Comments

@Kate-D-S
Copy link

I need to OCR some poor quality documents which contain different alphabets e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be faster (I mean: -l deu+pol+eng ).

I noticed that some characters were misidentified so I thought that result will be better after reducing the number of alphabets to those that de facto appear in the document (I checked it manually).
But after the reduction of languages, there are characters replaced incorrectly with other characters even though the first identification was correct.

Example:

Current Behavior:

-l eng+pol+deu
the result: DÖNĘR

-l eng+deu
the result: DONER

Expected Behavior:

The correct spelling: DÖNER

Is there anything what I possibly could do to improve the outcome? any additional parameters which I could use?


Environment

  • Tesseract Version: Tesseract Version: 4.1.1
@stweil
Copy link
Contributor

stweil commented Sep 16, 2022

I suggest to use the supported latest Tesseract version 5.2 with a model which supports all Latin based languages, for example script/Latin (or script/Fraktur for historic texts).

@Kate-D-S
Copy link
Author

Why do you think using scripts it's better then single language? How is that work? Those are models for whole alphabets?

@stweil
Copy link
Contributor

stweil commented Sep 22, 2022

Yes, script/Latin is basically a combination of all languages which use Latin script (English, Spanish, Italian, French, German, Polish and more).

@Kate-D-S
Copy link
Author

Is there any list where I can check what languages are covered by which script?

@amitdo
Copy link
Collaborator

amitdo commented Oct 12, 2022

See here: tesseract-ocr/tessdata#62 (comment)

AFAIK, there is no full detailed list.

You can extract any traineddata file by using:

combine_tessdata -e ...

Then you can examine the unicharset file and the dawg file.

@Kate-D-S
Copy link
Author

Kate-D-S commented Oct 13, 2022

Isn't that answer on my own question then?

https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin.langs.txt

@Kate-D-S
Copy link
Author

Do you possible know why script could be better then combination of single languages? (I mean, why script/Latin could be better then -l eng+deu+pol ? ) I mean some technical explanation :) how tesseract works?

@amitdo
Copy link
Collaborator

amitdo commented Oct 13, 2022

Isn't that answer on my own question then?

https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin.langs.txt

I was not aware of this file :)

The other scripts does not have a langs.txt file.

@Kate-D-S
Copy link
Author

All scripts have it :)
https://github.com/tesseract-ocr/langdata_lstm/tree/main/script
(at the bottom of the page)

@amitdo
Copy link
Collaborator

amitdo commented Oct 13, 2022

:)

I didn't notice it. I entered into a few folders and didn't see langs.txt like the Latin folder has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants