-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added best traineddatas for 4.00 alpha #62
Comments
The new files include two files for German Fraktur: Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries? |
Related comment from Ray:
|
My guess is that the upper case traineddata files are for 'one script multi langs'. |
I'm currently working on the training documentation, before committing more
code, so as not to leave training broken for more than maybe an hour or so.
Here's a quick bullet list of what's going on:
- Initial capitals indicate the one model for all langs in that script,
so eg Latin is all latin-based languages except vie, which has its own
Vietnamese. Most of the script models include English training data as well
as the script, but not for Cyrillic, as that would have a major ambiguity
problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a
combination of all the latin-based languages that have an 'old' variant,
etc... I would be interested to hear more feedback on the Script models as
Stefan already provided for Fraktur.
- The tessdata directory doesn't have to be called tessdata any more, so
I was thinking of a structuring that allowed maybe best, fast and legacy as
separate directories or repos.
- I noticed git complain about the size of Latin.traineddata (~100MB),
but didn't yet follow the pointer to git large data.
- The current code can run the 'best' models, and the existing models,
but incremental and fine tuning training will be tied to 'best' with a
future commit/push. (Due to a switch to ADAM and the move of the
unicharset/recoder).
- Fine tuning/incremental training will not be possible from the 'fast'
models, as they are 8-bit integer. It will be possible to convert a tuned
best to integer to make it faster, but some of the speed in 'fast' will be
from the smaller model.
- It will be possible to add new characters by fine tuning! I got that
working yesterday, and just need to finish updating the documentation.
…On Tue, Aug 1, 2017 at 6:50 AM, Amit D. ***@***.***> wrote:
My guess is that the upper case traineddata files are for 'one script
multi lang'.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_->
.
--
Ray.
|
… ____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Aug 1, 2017 at 11:08 PM, theraysmith <[email protected]>
wrote:
I'm currently working on the training documentation, before committing more
code, so as not to leave training broken for more than maybe an hour or so.
Here's a quick bullet list of what's going on:
- Initial capitals indicate the one model for all langs in that script,
so eg Latin is all latin-based languages except vie, which has its own
Vietnamese. Most of the script models include English training data as well
as the script, but not for Cyrillic, as that would have a major ambiguity
problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a
combination of all the latin-based languages that have an 'old' variant,
etc... I would be interested to hear more feedback on the Script models as
Stefan already provided for Fraktur.
- The tessdata directory doesn't have to be called tessdata any more, so
I was thinking of a structuring that allowed maybe best, fast and legacy as
separate directories or repos.
- I noticed git complain about the size of Latin.traineddata (~100MB),
but didn't yet follow the pointer to git large data.
- The current code can run the 'best' models, and the existing models,
but incremental and fine tuning training will be tied to 'best' with a
future commit/push. (Due to a switch to ADAM and the move of the
unicharset/recoder).
- Fine tuning/incremental training will not be possible from the 'fast'
models, as they are 8-bit integer. It will be possible to convert a tuned
best to integer to make it faster, but some of the speed in 'fast' will be
from the smaller model.
- It will be possible to add new characters by fine tuning! I got that
working yesterday, and just need to finish updating the documentation.
On Tue, Aug 1, 2017 at 6:50 AM, Amit D. ***@***.***> wrote:
> My guess is that the upper case traineddata files are for 'one script
> multi lang'.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#62#
issuecomment-319375598>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_->
> .
>
--
Ray.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7elr2M-RcaG2eiGMykqVylg0uQ1ks5sT2KmgaJpZM4OpU_->
.
|
New traineddata files: |
That's great! Then I can add missing characters (like paragraph for Fraktur) myself. Thank you, Ray. |
Ray, issue #65 lists two regressions for Fraktur (missing §, ß/B confusion in word list). |
FYI: The wordlists are generated files, so it isn't a good idea to modify
them, as the modifications will likely get overwritten in a future training.
To help prevent the ß/B confusion, the words that you want to lose from
the wordlists need to go in langdata/lang/lang.bad_words.
Since I spotted the edits to the deu/frk wordlists before overwriting them,
I will put the deleted words in the bad_words lists, so my next run of
training will not contain them.
Looks like I also need to add § to the desired_characters.
I have not yet committed the new wordlists, desired_characters etc, since I
discovered a bug. The RTL languages have their wordlists reversed, which
doesn't make sense. They should be plain text readable by someone who knows
the language, and the reversal should be done before the words are
converted to dawgs. I have the required change in the code already, but
haven't yet run the synthetic data generation.
…On Wed, Aug 2, 2017 at 9:03 AM, Stefan Weil ***@***.***> wrote:
Ray, issue #65 <#65>
lists two regressions for Fraktur (missing §, ß/B confusion in word list).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056QPzkmLa31xAVUTDXnVOGOAZAEWZks5sUJ3DgaJpZM4OpU_->
.
--
Ray.
|
The new files can be installed locally in I assume that older versions of Tesseract work with hierarchies of languages, too. Of course |
That is great. I was using but this is much easier :-) |
The training wiki changes say that new traineddata can be built by providing wordlists. Here you mention that they are generated. Can you explain, whether user provided wordlists override the ones in traineddata and how it would impact recognition. I haven't tried training with new code yet. PS. Hope you have seen language specific feedback provided under issues in tessdata. |
http://usir.salford.ac.uk/44370/1/PID4978585.pdf |
@theraysmith commented on Aug 3, 2017
@theraysmith @jbreiden Can you confirm that the traineddata files in Github repo are the result of this improved training? |
They aren't, because they were added in July 2017 – that is before that comment. |
What about tessdata_fast?
|
tessdata_fast changed the LSTM model, but not the word list and other components. I just looked for B/ß confusions. While deu.traineddata looks good (no B/ß confusions), frk.traineddata contains lots of them, for example auBer instead of außer. frk.traineddata also contains lots of words which typically are not printed in Fraktur. Neither eBay nor PCMCIA are words which I would expect in old books or newspapers. |
@theraysmith can you update the |
from where we can download these trained data for better aaccuracy. |
When I'm using this trained data for hindi text based image, it's taking long time to extract text and not giving 100% accurate result. So how to reduce the response time. |
https://github.com/tesseract-ocr/tessdata/tree/3a94ddd47be0
@theraysmith
,
How to present those 'best' files to our users?
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
Do you plan to push more updates to the best directory and/or to the root dir in the next few weeks?
The text was updated successfully, but these errors were encountered: