Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of the 10 training fonts? #1

Open
koch-aai opened this issue Oct 1, 2018 · 6 comments
Open

List of the 10 training fonts? #1

koch-aai opened this issue Oct 1, 2018 · 6 comments

Comments

@koch-aai
Copy link

koch-aai commented Oct 1, 2018

Thanks for uploading this trained model - could you possibly provide some info about the training data?

Specifically the fonts used and the average string length. Has this been tested on SVHN by any chance?

Thanks!

@Shreeshrii
Copy link
Owner

Shreeshrii commented Oct 1, 2018

  1. It has NOT been tested at all. It is a proof of concept finetune training. Users are encouraged to finetune specific to their own user case, fonts etc.

  2. The training text used for last training version with commas is at
    https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text

  3. Modified versions of eng.punc and eng.numbers have been used. These could be further modified based on user requirements. They might cause minor improvements in recognition. Files used can be compared with the ones in langdata/eng and are made available at
    https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.punc
    https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.numbers

  4. Fonts used for the last training were those listed in src/training/language_specific.sh for Latin script (with minor modifications). Again, finetuning with fonts used in images to be OCRed will lead to better accuracy.
    Here is the list used

./digits/eng.Arial_Bold.exp0.lstmf
./digits/eng.Arial_Bold_Italic.exp0.lstmf
./digits/eng.Arial.exp0.lstmf
./digits/eng.Arial_Italic.exp0.lstmf
./digits/eng.Courier_New_Bold.exp0.lstmf
./digits/eng.Courier_New_Bold_Italic.exp0.lstmf
./digits/eng.Courier_New.exp0.lstmf
./digits/eng.Courier_New_Italic.exp0.lstmf
./digits/eng.FreeMono.exp0.lstmf
./digits/eng.FreeSans.exp0.lstmf
./digits/eng.FreeSerif.exp0.lstmf
./digits/eng.Georgia_Bold.exp0.lstmf
./digits/eng.Georgia_Bold_Italic.exp0.lstmf
./digits/eng.Georgia.exp0.lstmf
./digits/eng.Georgia_Italic.exp0.lstmf
./digits/eng.Times_New_Roman_Bold.exp0.lstmf
./digits/eng.Times_New_Roman_Bold_Italic.exp0.lstmf
./digits/eng.Times_New_Roman.exp0.lstmf
./digits/eng.Times_New_Roman_Italic.exp0.lstmf
./digits/eng.Trebuchet_MS_Bold.exp0.lstmf
./digits/eng.Trebuchet_MS_Bold_Italic.exp0.lstmf
./digits/eng.Trebuchet_MS.exp0.lstmf
./digits/eng.Trebuchet_MS_Italic.exp0.lstmf
./digits/eng.Verdana_Bold.exp0.lstmf
./digits/eng.Verdana_Bold_Italic.exp0.lstmf
./digits/eng.Verdana.exp0.lstmf
./digits/eng.Verdana_Italic.exp0.lstmf

An earlier training with 10 fonts used only the non-italic version of the fonts and did not include the freefonts - FreeMono, FreeSans, FreeSerif.

@Bech007
Copy link

Bech007 commented Sep 10, 2019

Hi @Shreeshrii
i have a dataset .txt i want train tesseract at them but i don't know how i can do that?
thank

@myjun1124
Copy link

This digit traindata was useful to me. It worked well after preprocessing that I used for recognizing temperature on the screen from thermal camera.

@Shreeshrii
Copy link
Owner

Thanks for the comment @myjun1124. Glad to know it worked for you.

@nikhilcms
Copy link

Hi @Shreeshrii , I found that tesseract 4.1.1 works good for extraction of words, but many times in failed to extract digits ( specifically bold ), how can i solve this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants