Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added best traineddatas for 4.00 alpha #62

Open
amitdo opened this issue Aug 1, 2017 · 22 comments
Open

Added best traineddatas for 4.00 alpha #62

amitdo opened this issue Aug 1, 2017 · 22 comments
Labels

Comments

@amitdo
Copy link

amitdo commented Aug 1, 2017

https://github.com/tesseract-ocr/tessdata/tree/3a94ddd47be0

@theraysmith
,
How to present those 'best' files to our users?
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Do you plan to push more updates to the best directory and/or to the root dir in the next few weeks?

@stweil
Copy link
Contributor

stweil commented Aug 1, 2017

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

@amitdo
Copy link
Author

amitdo commented Aug 1, 2017

Related comment from Ray:
tesseract-ocr/tesseract#995 (comment)

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

@amitdo
Copy link
Author

amitdo commented Aug 1, 2017

My guess is that the upper case traineddata files are for 'one script multi langs'.

@theraysmith
Copy link
Contributor

theraysmith commented Aug 1, 2017 via email

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 1, 2017 via email

@amitdo
Copy link
Author

amitdo commented Aug 1, 2017

New traineddata files:
Arabic.traineddata
Armenian.traineddata
Bengali.traineddata
Canadian_Aboriginal.traineddata
Cherokee.traineddata
Cyrillic.traineddata
Devanagari.traineddata
Ethiopic.traineddata
Fraktur.traineddata
Georgian.traineddata
Greek.traineddata
Gujarati.traineddata
Gurmukhi.traineddata
HanS.traineddata
HanS_vert.traineddata
HanT.traineddata
HanT_vert.traineddata
Hangul.traineddata
Hangul_vert.traineddata
Hebrew.traineddata
Japanese.traineddata
Japanese_vert.traineddata
Kannada.traineddata
Khmer.traineddata
Lao.traineddata
Latin.traineddata
Malayalam.traineddata
Myanmar.traineddata
Oriya.traineddata
Sinhala.traineddata
Syriac.traineddata
Tamil.traineddata
Telugu.traineddata
Thaana.traineddata
Thai.traineddata
Tibetan.traineddata
Vietnamese.traineddata
bre.traineddata
chi_sim_vert.traineddata
chi_tra_vert.traineddata
cos.traineddata
div.traineddata
fao.traineddata
fil.traineddata
fry.traineddata
gla.traineddata
hye.traineddata
jpn_vert.traineddata
kor_vert.traineddata
kur_ara.traineddata
ltz.traineddata
mon.traineddata
mri.traineddata
oci.traineddata
que.traineddata
snd.traineddata
sun.traineddata
tat.traineddata
ton.traineddata
yor.traineddata

@stweil
Copy link
Contributor

stweil commented Aug 2, 2017

It will be possible to add new characters by fine tuning!

That's great! Then I can add missing characters (like paragraph for Fraktur) myself. Thank you, Ray.

@stweil
Copy link
Contributor

stweil commented Aug 2, 2017

Ray, issue #65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

@theraysmith
Copy link
Contributor

theraysmith commented Aug 3, 2017 via email

@stweil
Copy link
Contributor

stweil commented Aug 4, 2017

The new files can be installed locally in tessdata/best and used like that: tesseract ... -l best/eng, so we can preserve the current directory structure (also when fast will be added), and there is no need to rename best/eng.traineddata to best_eng.traineddata in local installations.

I assume that older versions of Tesseract work with hierarchies of languages, too.
That offers new possibilities: the rather lengthy list of languages could be organized in folders for example for latin based languages, indic languages etc.

Of course tesseract --list-langs should be improved to search recursively for language files.

@Shreeshrii
Copy link
Contributor

used like that: tesseract ... -l best/eng

That is great.

I was using --tessdata-dir ../../../tessdata/best

but this is much easier :-)

@Shreeshrii
Copy link
Contributor

FYI: The wordlists are generated files, so it isn't a good idea to modify
them, as the modifications will likely get overwritten in a future training.

@theraysmith

The training wiki changes say that new traineddata can be built by providing wordlists. Here you mention that they are generated.

Can you explain, whether user provided wordlists override the ones in traineddata and how it would impact recognition.

I haven't tried training with new code yet.

PS. Hope you have seen language specific feedback provided under issues in tessdata.

@amitdo
Copy link
Author

amitdo commented Aug 4, 2017

@amitdo
Copy link
Author

amitdo commented Dec 15, 2017

http://usir.salford.ac.uk/44370/1/PID4978585.pdf
ICDAR2017 Competition on Recognition of Early Indian Printed Documents – REID2017

@Shreeshrii
Copy link
Contributor

@theraysmith commented on Aug 3, 2017

I have the required change in the code already, but haven't yet run the synthetic data generation.

I will put the deleted words in the bad_words lists, so my next run of training will not contain them.

@theraysmith @jbreiden Can you confirm that the traineddata files in Github repo are the result of this improved training?

@stweil
Copy link
Contributor

stweil commented May 25, 2018

They aren't, because they were added in July 2017 – that is before that comment.

@Shreeshrii
Copy link
Contributor

What about tessdata_fast?

Initial import to github (on behalf of Ray)
Jeff Breidenbach committed on Sep 15, 2017

@stweil
Copy link
Contributor

stweil commented May 25, 2018

tessdata_fast changed the LSTM model, but not the word list and other components. I just looked for B/ß confusions. While deu.traineddata looks good (no B/ß confusions), frk.traineddata contains lots of them, for example auBer instead of außer. frk.traineddata also contains lots of words which typically are not printed in Fraktur. Neither eBay nor PCMCIA are words which I would expect in old books or newspapers.

@ghost
Copy link

ghost commented Jun 11, 2018

@theraysmith can you update the Langdata/ara

@kmprerna
Copy link

New traineddata files:
Arabic.traineddata
Armenian.traineddata
Bengali.traineddata
Canadian_Aboriginal.traineddata
Cherokee.traineddata
Cyrillic.traineddata
Devanagari.traineddata
Ethiopic.traineddata
Fraktur.traineddata
Georgian.traineddata
Greek.traineddata
Gujarati.traineddata
Gurmukhi.traineddata
HanS.traineddata
HanS_vert.traineddata
HanT.traineddata
HanT_vert.traineddata
Hangul.traineddata
Hangul_vert.traineddata
Hebrew.traineddata
Japanese.traineddata
Japanese_vert.traineddata
Kannada.traineddata
Khmer.traineddata
Lao.traineddata
Latin.traineddata
Malayalam.traineddata
Myanmar.traineddata
Oriya.traineddata
Sinhala.traineddata
Syriac.traineddata
Tamil.traineddata
Telugu.traineddata
Thaana.traineddata
Thai.traineddata
Tibetan.traineddata
Vietnamese.traineddata
bre.traineddata
chi_sim_vert.traineddata
chi_tra_vert.traineddata
cos.traineddata
div.traineddata
fao.traineddata
fil.traineddata
fry.traineddata
gla.traineddata
hye.traineddata
jpn_vert.traineddata
kor_vert.traineddata
kur_ara.traineddata
ltz.traineddata
mon.traineddata
mri.traineddata
oci.traineddata
que.traineddata
snd.traineddata
sun.traineddata
tat.traineddata
ton.traineddata
yor.traineddata

from where we can download these trained data for better aaccuracy.

@kmprerna
Copy link

When I'm using this trained data for hindi text based image, it's taking long time to extract text and not giving 100% accurate result. So how to reduce the response time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants