-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag a new version for LSTM 4.0 #995
Comments
It would be good to decide about using semantic versioning soon. Maybe it can be used for the next tag. |
I have not seen any comments against semver.
Maybe good to setup some kind of autoupdate for increasing the PATCH
version based on commit numbers to reduce manual administrative updates.
@stweil From what I have read about semver, if you were to implement the
zipped traineddata and related changes, it should cause a change in MINOR
version.
So, with that should it be 4.1.0alpha ?
```
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner,
and
PATCH version when you make backwards-compatible bug fixes.
Additional labels for pre-release and build metadata are available as
extensions to the MAJOR.MINOR.PATCH format.
```
|
First 4 version will be 4.0.0. What 4.1.0alpha are you talking about? We don't care about changes in dev branches. |
We could tag the current release as a pre-release or as a release candidate. According to semver.org, it could be called something like |
OK. Still, it will be good to have new tags when changes are substantial enough from previous commits. For example,
That said, I have only done some cursory reading regarding semver. So, I am happy with whatever tag/version is used, as long as there is some demarcation. The reason for asking for this is that people are using/trying to use master branch/4.0/LSTM and ask questions, where the version info says -alpha or -dev and it difficult to try and figure out what the issue is without knowing the version being used. |
I vote for this format which includes date - easy to identify which version is more recent.
|
An example of how 4.00.00alpha is NOT compatible with the current master branch eg. --oem options. |
@theraysmith, can you give us an update on your work? When are we going to see it? |
Hi, same: can you give us an update on your work? When are we going to see 4.0 released? |
+1 for a new tag. Since Ray does not reply, I suggest to still use 'alpha'.
|
@zdenop, can you do it, or at least add your comment here? |
I'm about ready to update the traineddatas. I have a training run almost
complete, and with accuracy that meets with my satisfaction.
There are a few regressions, but not too serious.
First though, I have to get some code reviewed in Google, and then make
some commits to github to match the new traineddatas.
Before that, there is the matter of a major pull...
Here's what's coming:
- Fix to issue 653: New components in traineddata file for the
unicharset, recoder and version string. Backwards compatible change, so the
LSTM component can still read older files.
- Change in training system. The above change makes open source training
impossible. Will add a new program to build a starter traineddata from a
unicharset and optional word lists.
- New "normalization" code to clean corpus text in all languages. That
was a big part of the work.
- Improvements to the trained networks to improve accuracy on single
characters and single words.
- 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the
speed of legacy Tesseract in real time, provided you have the required
parallelism components, and in total CPU only slightly slower for English.
Way faster for most non-latin languages, while being <5% worse than "best"
Only "best" will be retrainable, as "fast" will be integer.
I have other stuff that is still incomplete, but that is a good list for
now.
BTW, in case you hadn't noticed, there was a breaking change that made old
lstmf files unusable. That was needed to fix LSTM for OSD. It has to know
the language of each training sample.
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.
…On Tue, Jul 11, 2017 at 4:49 AM, Amit D. ***@***.***> wrote:
@zdenop <https://github.com/zdenop>, can you do it, or at least add your
comment here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056SvL5FeeE09JYW01xQ-dQyILyU8Wks5sM2ExgaJpZM4N9Nel>
.
--
Ray.
|
Superb. Anything we could do to help you ? Cheers. |
@theraysmith Thanks for the update. Look forward to it. Any estimate of expected date? @zdenop I think this is a good reason to freeze the 'alpha' state by tagging the repo with the current version as 4.0.0-alpha.YYYYMMDD, since Ray is going to be making major changes. |
That's good news.
If I got that right, it would be horrible. Being able to create new traineddata is essential for me. |
@Shreeshrii: I do not understand what do you want. Tag will not freeze anything. Tag is just specific points in history to mark something important (e.g. new version). Tagging should be driven by developer who knows roadmap and not by users... |
Exactly my point :-) When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate. My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag (missing lstm.train file etc.) have been fixed later. Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-) |
Open source training:
OK, I overstated it a bit.
One of my commits will temporarily break the training process. After doing
so, I will correct the documentation and add the new tool (which I have
already written) as quickly as possible after.
To help:
No more breaking commits! If it doesn't produce perfect results on
phototest, it broke something!
Cutting down on the code cleanup while I am working on it will also help.
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match
Dates:
I was going to get started this week, but now I have to debug my pull from
github, which has broken tests (of the legacy engine), so that will take
time to fix. I'm hoping it's simple, but it is bizarre.
Even when it is fixed, there are 1500 lines of change from github for
someone here to review.
I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks.
…On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***> wrote:
@zdenop <https://github.com/zdenop>
Tag is just specific points in history to mark something important (e.g.
new version).
Exactly my point :-)
When Ray makes his next set of commits, that will change the codebase as
well as traineddata substantially. I am sure it will be tagged by Ray at
that time, probably as a beta or release candidate.
My request to you to tag current commit (as an example) is to mark a point
in history where a lot of development has taken place since the original
4.00.00alpha tag. In fact, that original tag just marked the start of the
4.00.00alpha development and many bugs in that original tag have been fixed
by now.
Also, if the new changes by Ray will not allow for open source training
:-( then the current github version will be the one which allows users to
do their own training. So, it is certainly deserving of a tag in my opinion
:-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel>
.
--
Ray.
|
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
What kind of expertise do you need regarding the Indic scripts?
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Jul 12, 2017 at 10:58 PM, theraysmith <[email protected]>
wrote:
Open source training:
OK, I overstated it a bit.
One of my commits will temporarily break the training process. After doing
so, I will correct the documentation and add the new tool (which I have
already written) as quickly as possible after.
To help:
No more breaking commits! If it doesn't produce perfect results on
phototest, it broke something!
Cutting down on the code cleanup while I am working on it will also help.
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match
Dates:
I was going to get started this week, but now I have to debug my pull from
github, which has broken tests (of the legacy engine), so that will take
time to fix. I'm hoping it's simple, but it is bizarre.
Even when it is fixed, there are 1500 lines of change from github for
someone here to review.
I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks.
On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***>
wrote:
> @zdenop <https://github.com/zdenop>
>
> Tag is just specific points in history to mark something important (e.g.
> new version).
>
> Exactly my point :-)
>
> When Ray makes his next set of commits, that will change the codebase as
> well as traineddata substantially. I am sure it will be tagged by Ray at
> that time, probably as a beta or release candidate.
>
> My request to you to tag current commit (as an example) is to mark a
point
> in history where a lot of development has taken place since the original
> 4.00.00alpha tag. In fact, that original tag just marked the start of the
> 4.00.00alpha development and many bugs in that original tag have been
fixed
> by now.
>
> Also, if the new changes by Ray will not allow for open source training
> :-( then the current github version will be the one which allows users to
> do their own training. So, it is certainly deserving of a tag in my
opinion
> :-)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#995#
issuecomment-314667002>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel>
> .
>
--
Ray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj--aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel>
.
|
The code determines what makes a valid/invalid sequence of unicodes in the
script, for instance, is it allowed to have two matras in a row? It gets
more difficult with questions over what category the additional characters
are.
On Wed, Jul 12, 2017 at 6:40 PM, Shreeshrii <[email protected]>
wrote:
… >When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
What kind of expertise do you need regarding the Indic scripts?
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Jul 12, 2017 at 10:58 PM, theraysmith ***@***.***>
wrote:
> Open source training:
> OK, I overstated it a bit.
> One of my commits will temporarily break the training process. After
doing
> so, I will correct the documentation and add the new tool (which I have
> already written) as quickly as possible after.
>
> To help:
> No more breaking commits! If it doesn't produce perfect results on
> phototest, it broke something!
> Cutting down on the code cleanup while I am working on it will also help.
> When I have committed the new corpus cleanup code, it would be useful to
> have any experts in any of the following scripts review the code and make
> comments:
> Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
> Malayalam, Sinhala, Thai, Myanmar, Khmer.
> There are script-specific cleanup rules in there.
> Since I plan to commit new copies of the training data (unicharsets,
> wordlists, training text etc) then at that point they will match
>
>
> Dates:
> I was going to get started this week, but now I have to debug my pull
from
> github, which has broken tests (of the legacy engine), so that will take
> time to fix. I'm hoping it's simple, but it is bizarre.
> Even when it is fixed, there are 1500 lines of change from github for
> someone here to review.
> I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks.
>
> On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***>
> wrote:
>
> > @zdenop <https://github.com/zdenop>
> >
> > Tag is just specific points in history to mark something important
(e.g.
> > new version).
> >
> > Exactly my point :-)
> >
> > When Ray makes his next set of commits, that will change the codebase
as
> > well as traineddata substantially. I am sure it will be tagged by Ray
at
> > that time, probably as a beta or release candidate.
> >
> > My request to you to tag current commit (as an example) is to mark a
> point
> > in history where a lot of development has taken place since the
original
> > 4.00.00alpha tag. In fact, that original tag just marked the start of
the
> > 4.00.00alpha development and many bugs in that original tag have been
> fixed
> > by now.
> >
> > Also, if the new changes by Ray will not allow for open source training
> > :-( then the current github version will be the one which allows users
to
> > do their own training. So, it is certainly deserving of a tag in my
> opinion
> > :-)
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#995#
> issuecomment-314667002>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/
> AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel>
> > .
> >
>
>
>
> --
> Ray.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#995#
issuecomment-314839820>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj--
aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056cfTz_q0IPjUvI65YCy4HVMGAjH2ks5sNXWDgaJpZM4N9Nel>
.
--
Ray.
|
No, it is not valid to have any two matras in a row - Devanagari 093E-094C. However, these can be followed by Anusvar, Chandrabindu or Visarga i.e. 0901-0903 In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. using unicode points as example 093E followed by 0947 to create 094b - ा े to make ो Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha It is possible that some converters from legacy font to unicode retain these errors. Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः For a sample of Vedic Sanskrit and its ground truth, see Will your new sanskrit traineddata be able to OCR this? |
Will you remove the code of the legacy engine in this round? |
On Wed, Jul 12, 2017 at 9:39 PM, Shreeshrii ***@***.***> wrote:
No, it is not valid to have any two matras in a row - Devanagari 093E-094C.
However, these can be followed by Anusvar, Chandrabindu or Visarge i.e.
0901-0903
It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?
In case of Vedic Sanskrit, these can be followed by the Vedic accents eg.
0951, 0952, 1CDA etc
However, I have seen samples in legacy fonts where a number of separate
matras are used to create another one eg. 093E followed by 0947 to create
094b
These are specifically dis-allowed by unicode, but the rules seem to be
very script-specific, and not very consistently documented in the unicode
standard. I don't think the rules are addressed properly for all scripts.
Similarly in legacy fonts, half letters (letter followed by virama) maybe
followed by aa maatraa to create the complete letter in cases such as ga,
sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha
It is possible that some converters from legacy font to unicode retain
these errors.
Also, in case of Vedic Sanskrit, the valid order should be matra,
combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly
use matra, vedic accent and combining mark which will lead to dotted
circle. eg. अंशाः॑ vs अंशा॑ः
The code aims to dis-allow text designed for such legacy fonts.
The documentation that I have found is very good for Devanagari, but
lacking for some of the other scripts.
For instance, there is a big table in the unicode standard for Myanmar, (
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover
any of the extension Myanmar characters, and isn't explicit about whether
the table represents a specific valid order or not. The existence of a lot
of legacy Myanmar text on the web that is designed for non-compliant fonts
doesn't help make it easier to determine whether the filter is correct.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056SFD_JftIXTWSw6Crvgb1j3-ZBT3ks5sNZ-XgaJpZM4N9Nel>
.
--
Ray.
|
That is still an open question.
I have limited time to spend on it (therefore resistant to delaying tactics
changing types in the dead code to POSIX).
Whether enough uses of Tesseract can be covered by the new engine is still
being debated, and the new models that I have need to be evaluated before
enough of the community is convinced.
I accept the requirement to add one or more new characters without the need
for full retraining, and will not delete the legacy code until that need is
addressed. (I think it can be done).
The legacy code is used by the OSD model and deletion of the legacy code is
also blocked by a good enough replacement.
…On Thu, Jul 13, 2017 at 5:18 AM, Amit D. ***@***.***> wrote:
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.
Will you remove the *code* of the legacy engine in this round?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056VPOW6xmGYPbAsOF_D3yEFAAfEshks5sNgr6gaJpZM4N9Nel>
.
--
Ray.
|
That does not sound right. Please see I did a search on ംം (two anusvarams in malayalam script) and most of them show up in the search result in pdfs. FYI, pdfs created with documents having text in unicode fonts for complex scripts do not save the unicode text correctly. Devanagari text copied from these pdf is not correct, I assume similarly for malayalam and other Indian scripts, and that might be causing this double anusvar problem. newer pdfs created in a special manner, eg. with 'actual text' with xelatex are ok (eg. http://sanskritdocuments.org/doc_devii/annapurna.pdf), but those created from various other software are not (http://www.sanskritweb.net/sansdocs/nala-d.pdf). @jbreiden can give you the technical reasoning for this. Google search does show pdfs as part of the search results, so there is some internal OCR (is it tesseract???) being done on the pdfs, books etc as part of the search process. But it may not be fully correct. So for the corpus for training, I would suggest to avoid text taken from pdfs (in case it is being used). |
@theraysmith Regarding Malayalam, double anusvara Please see zero in Malayalam script - pujyam looks very much like the sign for anusvar. Also, there are different anusvars shown in unicode chart-- 0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA I will look up more info and post under an issue in langdata |
Direct from the unicode standard:
Anusvara. The anusvara can be seen multiple times after vowels, whether
independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02,
0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx
<0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be
prepared to handle Malayalam letters (including vowel letters), digits
(both European and Malayalam), dashes, U+00A0 no-break space and U+25CC
dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam
sign visarga. They should also be prepared to handle multiple combining
marks on those bases.
Is it wrong?
…On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith> Regarding Malayalam, double
anusvara
Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.omniglot.com/language/numbers/malayalam.htm
zero in Malayalam script - pujyam looks very much like the sign for
anusvar.
Also, there are different anusvars shown in unicode chart--
0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following
consonant
0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA
I will look up more info and post under an issue in langdata
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel>
.
--
Ray.
|
Very Sorry! I misread the dashboards. Looks like the slightly older code |
|
That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release. That should minimize problems for other people. The release of today would be 4.0.0-alpha.20180302. |
ok. but do we expect more code/fixes to come for 4.0 release?
Dňa pi 2. 3. 2018, 7:23 Stefan Weil <[email protected]> napísal(a):
… Tagging repo will cause release in github [...]
That's desired. GitHub also allows marking such releases as *pre-release*
– just edit the release information of the new release.
The release of today would be *4.0.0-alpha.20180302*.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAjCzIIWW3apaQrYsMG4GXodJd9gQZftks5taOVegaJpZM4N9Nel>
.
|
Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a 4.0.0-alpha.20180401, or a 4.0.0 without alpha, or a 4.0.1, or Ray sends a bunch of code which justifies a 4.1.0, ... |
@jbreiden Please check with Ray. Thanks! |
I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term. |
Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status. |
If you decide to release it soon, don't forget to first update |
Ha! Looks like they took |
Jeff,
is there any info from Ray about 4.00 release? Or at least how to tag
"Ubuntu" release (4.00RC1, 4.00beta?...)?
Zdenko
2018-03-03 5:21 GMT+01:00 jbreiden <[email protected]>:
… Ha! Looks like they took 40f4311 after all, one day after deadline.
https://launchpad.net/ubuntu/+source/tesseract
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAjCzCbeVivFu4oYMvMPgTLdJgz3NaUUks5tahpZgaJpZM4N9Nel>
.
|
We should make the decision ourselves. What about this proposal:
https://semver.org/ Mark any non final 4.0.0 as 'pre-release'. https://help.github.com/articles/creating-releases/
For each (pre-)release, update |
No info.
Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with |
Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now? |
No more changes possible. Everything will look exactly as described here: #995 (comment) |
done.
Zdenko
2018-03-09 19:25 GMT+01:00 jbreiden <[email protected]>:
… is there any info from Ray about 4.00 release?
No info. Ray is very busy with other work, so I don't expect major changes
from him in short or medium term.
Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?
Millions of people will use commit 40f4311
<40f4311>
because of Ubuntu, and I think the main benefit of a tag is to help
understand bug reports coming from these users. There have been many good
tag proposals in this thread from @amitdo <https://github.com/amitdo> and
@stweil <https://github.com/stweil> and @zdenop
<https://github.com/zdenop> and @WilliamTambellini
<https://github.com/williamtambellini>. I don't have a strong opinion
about which one is best. If I was forced to choose, I'd probably tag commit
40f4311
<40f4311>
with 4.0.0-beta.1 If that feels like too much commitment, then use a very
specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense
to apply the same tag to the fast training data at commit 0e00fe6.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAjCzH06ukxAiEbv4gYXlxJNbEYyrIzMks5tcskSgaJpZM4N9Nel>
.
|
Great!!!!
…On Sat 10 Mar, 2018, 1:12 PM zdenop, ***@***.***> wrote:
done.
Zdenko
|
I suggest that we release |
April 2018 :-) |
For the final release should the files in tessdata repo be updated? These have models for legacy tesseract that we need to keep. However the LSTM models in those were improved in tessdata_best and then further improved / made faster in tessdata_fast. I suggest that we update all the lstm related files in tessdata with files from tessdata_fast. eg. for Hindi.
For hindi, following files in the traineddata in tessdata repo
should be replaced by the following from tessdata_fast
Also, the version string should be updated appropriately to reflect the combo. This will also make the size of traineddata files in tessdata repo smaller. |
Good idea, but It should not delay the final 4.0.0 release. |
Thinking about this some more, I think a better alternative will be to remove the lstm files from the traineddata in tessdata. This will ensure there is no conflict in different config files needed for legacy and LSTM models. The traineddata file will become smaller. There will be no need to update the lstm models in tessdata in future. It will be easier for users: tessdata for --oem 0 @stweil could implement a check that --oem 0 is only being used with traineddata files that that have a version string of However, this misses the case where default --oem mode was set to 2 or 1 in the config files in tessdata. I will look to see how many such cases are there. |
$grep engine_mode *.config Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much better than 2 in tessdata. I propose, to replace these two traineddata files in Since their version string will not be Version string:Pre-4.0.0, the program should not crash, if the check is implemented. We can document this in readme in tessdata repo. |
@stweil Since you probably use --oem 0 in your projects, what do you think of this idea?
|
Those ocr_engine_mode s may be due to the historical presence of cube, and
may not be optimal for the current implementation.
…On Mon, Mar 19, 2018 at 1:24 AM Shreeshrii ***@***.***> wrote:
$grep engine_mode *.config
ara.config:tessedit_ocr_engine_mode 1
hin.config:tessedit_ocr_engine_mode 2
Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much
better than 2 in tessdata.
I propose, to replace these two traineddata files by their counterparts in
tessdata_fast. Since their version string will not be Version
string:Pre-4.0.0, the program should not crash, if the check is implemented.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056XP9S5BJnL4tCd5NPgy-HwddpsPyks5tf2tUgaJpZM4N9Nel>
.
--
Ray.
|
Ray, Since you mentioned that best can be integerized to make it faster, and there are already three repos with traineddata files, I thought of updating the lstm files in the traineddata in tessdata with the integerized best with a Version string such as: Version string:Pre-4.0.0+4.00.00alpha:nld:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]:best2int20180321 i.e. Version string from tessdata+ Version string from tessdata_best appended by best2int20180321 |
Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.
@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!
The text was updated successfully, but these errors were encountered: