How to improve multilingual ocr outcome? #3919

Kate-D-S · 2022-09-16T11:52:50Z

I need to OCR some poor quality documents which contain different alphabets e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be faster (I mean: -l deu+pol+eng ).

I noticed that some characters were misidentified so I thought that result will be better after reducing the number of alphabets to those that de facto appear in the document (I checked it manually).
But after the reduction of languages, there are characters replaced incorrectly with other characters even though the first identification was correct.

Example:

Current Behavior:

-l eng+pol+deu
the result: DÖNĘR

-l eng+deu
the result: DONER

Expected Behavior:

The correct spelling: DÖNER

Is there anything what I possibly could do to improve the outcome? any additional parameters which I could use?

Environment

Tesseract Version: Tesseract Version: 4.1.1

The text was updated successfully, but these errors were encountered:

stweil · 2022-09-16T11:57:07Z

I suggest to use the supported latest Tesseract version 5.2 with a model which supports all Latin based languages, for example script/Latin (or script/Fraktur for historic texts).

Kate-D-S · 2022-09-22T13:06:48Z

Why do you think using scripts it's better then single language? How is that work? Those are models for whole alphabets?

stweil · 2022-09-22T15:14:31Z

Yes, script/Latin is basically a combination of all languages which use Latin script (English, Spanish, Italian, French, German, Polish and more).

Kate-D-S · 2022-10-12T07:42:02Z

Is there any list where I can check what languages are covered by which script?

amitdo · 2022-10-12T08:11:54Z

See here: tesseract-ocr/tessdata#62 (comment)

AFAIK, there is no full detailed list.

You can extract any traineddata file by using:

combine_tessdata -e ...

Then you can examine the unicharset file and the dawg file.

Kate-D-S · 2022-10-13T09:30:18Z

Isn't that answer on my own question then?

https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin.langs.txt

Kate-D-S · 2022-10-13T09:35:40Z

Do you possible know why script could be better then combination of single languages? (I mean, why script/Latin could be better then -l eng+deu+pol ? ) I mean some technical explanation :) how tesseract works?

amitdo · 2022-10-13T10:47:40Z

Isn't that answer on my own question then?

https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin.langs.txt

I was not aware of this file :)

The other scripts does not have a langs.txt file.

Kate-D-S · 2022-10-13T11:03:47Z

All scripts have it :)
https://github.com/tesseract-ocr/langdata_lstm/tree/main/script
(at the bottom of the page)

amitdo · 2022-10-13T11:34:01Z

:)

I didn't notice it. I entered into a few folders and didn't see langs.txt like the Latin folder has.

stweil added the question label Sep 16, 2022

amitdo added the multilingual ocr label Sep 19, 2022

amitdo closed this as completed Sep 19, 2022

amitdo added the documentation label Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve multilingual ocr outcome? #3919

How to improve multilingual ocr outcome? #3919

Kate-D-S commented Sep 16, 2022

stweil commented Sep 16, 2022 •

edited

Loading

Kate-D-S commented Sep 22, 2022

stweil commented Sep 22, 2022 •

edited

Loading

Kate-D-S commented Oct 12, 2022

amitdo commented Oct 12, 2022

Kate-D-S commented Oct 13, 2022 •

edited by amitdo

Loading

Kate-D-S commented Oct 13, 2022

amitdo commented Oct 13, 2022

Kate-D-S commented Oct 13, 2022

amitdo commented Oct 13, 2022

How to improve multilingual ocr outcome? #3919

How to improve multilingual ocr outcome? #3919

Comments

Kate-D-S commented Sep 16, 2022

Example:

Current Behavior:

Expected Behavior:

Environment

stweil commented Sep 16, 2022 • edited Loading

Kate-D-S commented Sep 22, 2022

stweil commented Sep 22, 2022 • edited Loading

Kate-D-S commented Oct 12, 2022

amitdo commented Oct 12, 2022

Kate-D-S commented Oct 13, 2022 • edited by amitdo Loading

Kate-D-S commented Oct 13, 2022

amitdo commented Oct 13, 2022

Kate-D-S commented Oct 13, 2022

amitdo commented Oct 13, 2022

stweil commented Sep 16, 2022 •

edited

Loading

stweil commented Sep 22, 2022 •

edited

Loading

Kate-D-S commented Oct 13, 2022 •

edited by amitdo

Loading