Tag a new version for LSTM 4.0 #995

Shreeshrii · 2017-06-17T12:18:45Z

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

stweil · 2017-06-18T11:08:10Z

It would be good to decide about using semantic versioning soon. Maybe it can be used for the next tag.

Shreeshrii · 2017-06-19T05:44:33Z

I have not seen any comments against semver. Maybe good to setup some kind of autoupdate for increasing the PATCH version based on commit numbers to reduce manual administrative updates. @stweil From what I have read about semver, if you were to implement the zipped traineddata and related changes, it should cause a change in MINOR version. So, with that should it be 4.1.0alpha ? ``` Given a version number MAJOR.MINOR.PATCH, increment the: MAJOR version when you make incompatible API changes, MINOR version when you add functionality in a backwards-compatible manner, and PATCH version when you make backwards-compatible bug fixes. Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format. ```

egorpugin · 2017-06-19T08:10:28Z

First 4 version will be 4.0.0. What 4.1.0alpha are you talking about? We don't care about changes in dev branches.

stweil · 2017-06-19T08:23:47Z

We could tag the current release as a pre-release or as a release candidate. According to semver.org, it could be called something like 4.0.0-rc.1 (that's how semver.org named its own releases), 4.0.0-beta.1 or 4.0.0-beta.20170619.

Shreeshrii · 2017-06-19T08:25:11Z

We don't care about changes in dev branches.

OK.

Still, it will be good to have new tags when changes are substantial enough from previous commits. For example,

change of LSTM mode from --oem 4 to --oem 1 after removal of cube
change in .lstmf and .lstm file formats after update regarding endianness
proposed change in traineddata files to zipped format

That said, I have only done some cursory reading regarding semver. So, I am happy with whatever tag/version is used, as long as there is some demarcation.

The reason for asking for this is that people are using/trying to use master branch/4.0/LSTM and ask questions, where the version info says -alpha or -dev and it difficult to try and figure out what the issue is without knowing the version being used.

Shreeshrii · 2017-06-19T09:51:32Z

I vote for this format which includes date - easy to identify which version is more recent.

4.0.0-beta.20170619

Shreeshrii · 2017-06-25T14:54:54Z

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/V1tyGHIenbI/SUVuXheJAwAJ

An example of how 4.00.00alpha is NOT compatible with the current master branch eg. --oem options.

amitdo · 2017-06-26T19:26:33Z

@theraysmith, can you give us an update on your work? When are we going to see it?

WilliamTambellini · 2017-06-29T01:09:58Z

Hi, same: can you give us an update on your work? When are we going to see 4.0 released?

amitdo · 2017-06-29T11:39:45Z

+1 for a new tag.

Since Ray does not reply, I suggest to still use 'alpha'.

4.0.0-alpha.YYYYMMDD

amitdo · 2017-07-11T11:48:52Z

@zdenop, can you do it, or at least add your comment here?

theraysmith · 2017-07-12T00:24:56Z

I'm about ready to update the traineddatas. I have a training run almost complete, and with accuracy that meets with my satisfaction. There are a few regressions, but not too serious. First though, I have to get some code reviewed in Google, and then make some commits to github to match the new traineddatas. Before that, there is the matter of a major pull... Here's what's coming: - Fix to issue 653: New components in traineddata file for the unicharset, recoder and version string. Backwards compatible change, so the LSTM component can still read older files. - Change in training system. The above change makes open source training impossible. Will add a new program to build a starter traineddata from a unicharset and optional word lists. - New "normalization" code to clean corpus text in all languages. That was a big part of the work. - Improvements to the trained networks to improve accuracy on single characters and single words. - 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer. I have other stuff that is still incomplete, but that is a good list for now. BTW, in case you hadn't noticed, there was a breaking change that made old lstmf files unusable. That was needed to fix LSTM for OSD. It has to know the language of each training sample. The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed.

…

On Tue, Jul 11, 2017 at 4:49 AM, Amit D. ***@***.***> wrote: @zdenop <https://github.com/zdenop>, can you do it, or at least add your comment here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056SvL5FeeE09JYW01xQ-dQyILyU8Wks5sM2ExgaJpZM4N9Nel> .

-- Ray.

WilliamTambellini · 2017-07-12T01:02:45Z

Superb. Anything we could do to help you ? Cheers.

Shreeshrii · 2017-07-12T03:57:34Z

@theraysmith Thanks for the update. Look forward to it. Any estimate of expected date?

@zdenop I think this is a good reason to freeze the 'alpha' state by tagging the repo with the current version as 4.0.0-alpha.YYYYMMDD, since Ray is going to be making major changes.

stweil · 2017-07-12T05:24:38Z

I'm about ready to update the traineddatas.

That's good news.

The above change makes open source training impossible.

If I got that right, it would be horrible. Being able to create new traineddata is essential for me.

zdenop · 2017-07-12T06:00:03Z

@Shreeshrii: I do not understand what do you want. Tag will not freeze anything. Tag is just specific points in history to mark something important (e.g. new version). Tagging should be driven by developer who knows roadmap and not by users...

Shreeshrii · 2017-07-12T06:27:12Z

@zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag (missing lstm.train file etc.) have been fixed later.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

theraysmith · 2017-07-12T17:27:45Z

Open source training: OK, I overstated it a bit. One of my commits will temporarily break the training process. After doing so, I will correct the documentation and add the new tool (which I have already written) as quickly as possible after. To help: No more breaking commits! If it doesn't produce perfect results on phototest, it broke something! Cutting down on the code cleanup while I am working on it will also help. When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match Dates: I was going to get started this week, but now I have to debug my pull from github, which has broken tests (of the legacy engine), so that will take time to fix. I'm hoping it's simple, but it is bizarre. Even when it is fixed, there are 1500 lines of change from github for someone here to review. I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks.

…

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***> wrote: @zdenop <https://github.com/zdenop> Tag is just specific points in history to mark something important (e.g. new version). Exactly my point :-) When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate. My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag have been fixed by now. Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel> .

-- Ray.

Shreeshrii · 2017-07-13T01:39:47Z

When I have committed the new corpus cleanup code, it would be useful to

have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. What kind of expertise do you need regarding the Indic scripts? ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jul 12, 2017 at 10:58 PM, theraysmith <[email protected]> wrote:

Open source training: OK, I overstated it a bit. One of my commits will temporarily break the training process. After doing so, I will correct the documentation and add the new tool (which I have already written) as quickly as possible after. To help: No more breaking commits! If it doesn't produce perfect results on phototest, it broke something! Cutting down on the code cleanup while I am working on it will also help. When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match Dates: I was going to get started this week, but now I have to debug my pull from github, which has broken tests (of the legacy engine), so that will take time to fix. I'm hoping it's simple, but it is bizarre. Even when it is fixed, there are 1500 lines of change from github for someone here to review. I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks. On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***> wrote: > @zdenop <https://github.com/zdenop> > > Tag is just specific points in history to mark something important (e.g. > new version). > > Exactly my point :-) > > When Ray makes his next set of commits, that will change the codebase as > well as traineddata substantially. I am sure it will be tagged by Ray at > that time, probably as a beta or release candidate. > > My request to you to tag current commit (as an example) is to mark a point > in history where a lot of development has taken place since the original > 4.00.00alpha tag. In fact, that original tag just marked the start of the > 4.00.00alpha development and many bugs in that original tag have been fixed > by now. > > Also, if the new changes by Ray will not allow for open source training > :-( then the current github version will be the one which allows users to > do their own training. So, it is certainly deserving of a tag in my opinion > :-) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#995# issuecomment-314667002>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel> > . > -- Ray. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj--aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel> .

theraysmith · 2017-07-13T04:13:17Z

The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are. On Wed, Jul 12, 2017 at 6:40 PM, Shreeshrii <[email protected]> wrote:

…

>When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. What kind of expertise do you need regarding the Indic scripts? ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jul 12, 2017 at 10:58 PM, theraysmith ***@***.***> wrote: > Open source training: > OK, I overstated it a bit. > One of my commits will temporarily break the training process. After doing > so, I will correct the documentation and add the new tool (which I have > already written) as quickly as possible after. > > To help: > No more breaking commits! If it doesn't produce perfect results on > phototest, it broke something! > Cutting down on the code cleanup while I am working on it will also help. > When I have committed the new corpus cleanup code, it would be useful to > have any experts in any of the following scripts review the code and make > comments: > Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, > Malayalam, Sinhala, Thai, Myanmar, Khmer. > There are script-specific cleanup rules in there. > Since I plan to commit new copies of the training data (unicharsets, > wordlists, training text etc) then at that point they will match > > > Dates: > I was going to get started this week, but now I have to debug my pull from > github, which has broken tests (of the legacy engine), so that will take > time to fix. I'm hoping it's simple, but it is bizarre. > Even when it is fixed, there are 1500 lines of change from github for > someone here to review. > I *really* want to get 4.00 finished (in beta) in the next 5-6 weeks. > > On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii ***@***.***> > wrote: > > > @zdenop <https://github.com/zdenop> > > > > Tag is just specific points in history to mark something important (e.g. > > new version). > > > > Exactly my point :-) > > > > When Ray makes his next set of commits, that will change the codebase as > > well as traineddata substantially. I am sure it will be tagged by Ray at > > that time, probably as a beta or release candidate. > > > > My request to you to tag current commit (as an example) is to mark a > point > > in history where a lot of development has taken place since the original > > 4.00.00alpha tag. In fact, that original tag just marked the start of the > > 4.00.00alpha development and many bugs in that original tag have been > fixed > > by now. > > > > Also, if the new changes by Ray will not allow for open source training > > :-( then the current github version will be the one which allows users to > > do their own training. So, it is certainly deserving of a tag in my > opinion > > :-) > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#995# > issuecomment-314667002>, > > or mute the thread > > <https://github.com/notifications/unsubscribe-auth/ > AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel> > > . > > > > > > -- > Ray. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#995# issuecomment-314839820>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj-- aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056cfTz_q0IPjUvI65YCy4HVMGAjH2ks5sNXWDgaJpZM4N9Nel> .

-- Ray.

Shreeshrii · 2017-07-13T04:39:34Z

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarga i.e. 0901-0903

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. using unicode points as example 093E followed by 0947 to create 094b - ा े to make ो

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

For a sample of Vedic Sanskrit and its ground truth, see
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.tif
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.txt

Will your new sanskrit traineddata be able to OCR this?

amitdo · 2017-07-13T12:17:47Z

The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?

theraysmith · 2017-07-13T16:37:53Z

On Wed, Jul 12, 2017 at 9:39 PM, Shreeshrii ***@***.***> wrote: No, it is not valid to have any two matras in a row - Devanagari 093E-094C. However, these can be followed by Anusvar, Chandrabindu or Visarge i.e. 0901-0903

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. 093E followed by 0947 to create 094b

These are specifically dis-allowed by unicode, but the rules seem to be very script-specific, and not very consistently documented in the unicode standard. I don't think the rules are addressed properly for all scripts.

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha It is possible that some converters from legacy font to unicode retain these errors. Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

The code aims to dis-allow text designed for such legacy fonts. The documentation that I have found is very good for Devanagari, but lacking for some of the other scripts. For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056SFD_JftIXTWSw6Crvgb1j3-ZBT3ks5sNZ-XgaJpZM4N9Nel> .

-- Ray.

theraysmith · 2017-07-13T16:43:04Z

That is still an open question. I have limited time to spend on it (therefore resistant to delaying tactics changing types in the dead code to POSIX). Whether enough uses of Tesseract can be covered by the new engine is still being debated, and the new models that I have need to be evaluated before enough of the community is convinced. I accept the requirement to add one or more new characters without the need for full retraining, and will not delete the legacy code until that need is addressed. (I think it can be done). The legacy code is used by the OSD model and deletion of the legacy code is also blocked by a good enough replacement.

…

On Thu, Jul 13, 2017 at 5:18 AM, Amit D. ***@***.***> wrote: The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed. Will you remove the *code* of the legacy engine in this round? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056VPOW6xmGYPbAsOF_D3yEFAAfEshks5sNgr6gaJpZM4N9Nel> .

-- Ray.

Shreeshrii · 2017-07-13T17:11:19Z

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

That does not sound right. Please see
https://en.wikipedia.org/wiki/Malayalam_script#Anusvaram

I did a search on ംം (two anusvarams in malayalam script) and most of them show up in the search result in pdfs.

FYI, pdfs created with documents having text in unicode fonts for complex scripts do not save the unicode text correctly. Devanagari text copied from these pdf is not correct, I assume similarly for malayalam and other Indian scripts, and that might be causing this double anusvar problem.

newer pdfs created in a special manner, eg. with 'actual text' with xelatex are ok (eg. http://sanskritdocuments.org/doc_devii/annapurna.pdf), but those created from various other software are not (http://www.sanskritweb.net/sansdocs/nala-d.pdf).

@jbreiden can give you the technical reasoning for this.

Google search does show pdfs as part of the search results, so there is some internal OCR (is it tesseract???) being done on the pdfs, books etc as part of the search process. But it may not be fully correct.

So for the corpus for training, I would suggest to avoid text taken from pdfs (in case it is being used).

Shreeshrii · 2017-07-14T07:00:17Z

@theraysmith Regarding Malayalam, double anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.alanwood.net/unicode/malayalam.html
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

theraysmith · 2017-07-14T15:52:37Z

Direct from the unicode standard: Anusvara. The anusvara can be seen multiple times after vowels, whether independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be prepared to handle Malayalam letters (including vowel letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam sign visarga. They should also be prepared to handle multiple combining marks on those bases. Is it wrong?

…

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> Regarding Malayalam, double anusvara Please see http://unicode.org/charts/PDF/U0D00.pdf http://www.omniglot.com/language/numbers/malayalam.htm zero in Malayalam script - pujyam looks very much like the sign for anusvar. Also, there are different anusvars shown in unicode chart-- 0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D02 $ം MALAYALAM SIGN ANUSVARA • used in Prakrit language texts to indicate gemination of the following consonant 0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA 0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA I will look up more info and post under an issue in langdata — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel> .

-- Ray.

jbreiden · 2018-03-02T04:39:41Z

Very Sorry! I misread the dashboards. Looks like the slightly older code 4.00~git2207-766b7bd6-3.1 will ship, which is missing some of the last minute improvements. I believe it is no longer possible to change the version string (or anything else about Tesseract) for Ubuntu 18.04.

zdenop · 2018-03-02T06:01:21Z

Tagging repo will cause release in github and AFAIR it cause problem for some people.
Other distribution will took:

the latest github master (to include all additional fixes)
the latest stable release
Nobody would care what did other distribution...
I would prefer Ray give clear statement about next step for 4.0 release.

stweil · 2018-03-02T06:23:22Z

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release. That should minimize problems for other people.

The release of today would be 4.0.0-alpha.20180302.

zdenop · 2018-03-02T06:26:56Z

ok. but do we expect more code/fixes to come for 4.0 release? Dňa pi 2. 3. 2018, 7:23 Stefan Weil <[email protected]> napísal(a):

…

Tagging repo will cause release in github [...] That's desired. GitHub also allows marking such releases as *pre-release* – just edit the release information of the new release. The release of today would be *4.0.0-alpha.20180302*. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjCzIIWW3apaQrYsMG4GXodJd9gQZftks5taOVegaJpZM4N9Nel> .

stweil · 2018-03-02T06:33:17Z

Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a 4.0.0-alpha.20180401, or a 4.0.0 without alpha, or a 4.0.1, or Ray sends a bunch of code which justifies a 4.1.0, ...

Shreeshrii · 2018-03-02T11:08:28Z

I would prefer Ray give clear statement about next step for 4.0 release.

@jbreiden Please check with Ray. Thanks!

jbreiden · 2018-03-02T16:54:45Z

I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term.

amitdo · 2018-03-02T17:38:32Z

Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status.

amitdo · 2018-03-02T18:06:52Z

If you decide to release it soon, don't forget to first update ccutil/version.h

jbreiden · 2018-03-03T04:21:40Z

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract

zdenop · 2018-03-05T09:30:03Z

Jeff, is there any info from Ray about 4.00 release? Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)? Zdenko 2018-03-03 5:21 GMT+01:00 jbreiden <[email protected]>:

…

Ha! Looks like they took 40f4311 after all, one day after deadline. https://launchpad.net/ubuntu/+source/tesseract — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjCzCbeVivFu4oYMvMPgTLdJgz3NaUUks5tahpZgaJpZM4N9Nel> .

amitdo · 2018-03-09T13:35:44Z

We should make the decision ourselves.

What about this proposal:

Tag commit 40f4311 as 4.00-alpha.2+git.2219.40f43111.
3-4 weeks from now, tag the latest commit in master as 4.0.0-beta.1.
Release 4.0.0 30-60 days after beta1 (maybe with one more beta and one rc).

https://semver.org/
https://packages.ubuntu.com/bionic/tesseract-ocr

Mark any non final 4.0.0 as 'pre-release'.

https://help.github.com/articles/creating-releases/

If the release is unstable, select This is a pre-release to notify users that it's not ready for production.

For each (pre-)release, update ccutil/version.h.
https://github.com/tesseract-ocr/tesseract/blob/master/ccutil/version.h

jbreiden · 2018-03-09T18:25:17Z

is there any info from Ray about 4.00 release?

No info.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.

Shreeshrii · 2018-03-09T18:58:09Z

Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now?

jbreiden · 2018-03-10T03:22:49Z

No more changes possible. Everything will look exactly as described here: #995 (comment)

zdenop · 2018-03-10T07:41:58Z

done. Zdenko 2018-03-09 19:25 GMT+01:00 jbreiden <[email protected]>:

…

is there any info from Ray about 4.00 release? No info. Ray is very busy with other work, so I don't expect major changes from him in short or medium term. Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)? Millions of people will use commit 40f4311 <40f4311> because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo <https://github.com/amitdo> and @stweil <https://github.com/stweil> and @zdenop <https://github.com/zdenop> and @WilliamTambellini <https://github.com/williamtambellini>. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 <40f4311> with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjCzH06ukxAiEbv4gYXlxJNbEYyrIzMks5tcskSgaJpZM4N9Nel> .

Shreeshrii · 2018-03-10T08:23:15Z

Great!!!!

…

On Sat 10 Mar, 2018, 1:12 PM zdenop, ***@***.***> wrote: done. Zdenko

amitdo · 2018-03-18T09:47:16Z

I suggest that we release 4.0.0 (final) until end of April. ~2 week before this release, we should release 4.0.0-rc.1.

amitdo · 2018-03-18T10:01:32Z

April 2018 :-)

Shreeshrii · 2018-03-18T16:26:02Z

For the final release should the files in tessdata repo be updated?

These have models for legacy tesseract that we need to keep.

However the LSTM models in those were improved in tessdata_best and then further improved / made faster in tessdata_fast.

I suggest that we update all the lstm related files in tessdata with files from tessdata_fast.

eg. for Hindi.

# combine_tessdata -d ./tessdata/hin.traineddata
Version string:Pre-4.0.0
0:config:size=739, offset=192
1:unicharset:size=180616, offset=931
2:unicharambigs:size=90293, offset=181547
3:inttemp:size=12791027, offset=271840
4:pffmtable:size=24823, offset=13062867
5:normproto:size=225187, offset=13087690
6:punc-dawg:size=426, offset=13312877
7:word-dawg:size=837458, offset=13313303
8:number-dawg:size=410, offset=14150761
9:freq-dawg:size=1242, offset=14151171
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

# combine_tessdata -d ./tessdata_best/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=633, offset=192
17:lstm:size=11738347, offset=825
18:lstm-punc-dawg:size=3154, offset=11739172
19:lstm-word-dawg:size=143834, offset=11742326
20:lstm-number-dawg:size=234, offset=11886160
21:lstm-unicharset:size=7975, offset=11886394
22:lstm-recoder:size=1111, offset=11894369
23:version:size=80, offset=11895480

# combine_tessdata -d ./tessdata_fast/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629
0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606
23:version:size=30, offset=1122717

For hindi, following files in the traineddata in tessdata repo

0:config:size=739, offset=192
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

should be replaced by the following from tessdata_fast

0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606

Also, the version string should be updated appropriately to reflect the combo.

This will also make the size of traineddata files in tessdata repo smaller.

amitdo · 2018-03-18T18:03:24Z

Good idea, but It should not delay the final 4.0.0 release.

Shreeshrii · 2018-03-19T08:07:15Z

Thinking about this some more, I think a better alternative will be to remove the lstm files from the traineddata in tessdata.

This will ensure there is no conflict in different config files needed for legacy and LSTM models.

The traineddata file will become smaller.

There will be no need to update the lstm models in tessdata in future.

It will be easier for users:

tessdata for --oem 0
tessdata_fast for --oem 1
tessdata_best for LSTM training

@stweil could implement a check that --oem 0 is only being used with traineddata files that that have a version string of Version string:Pre-4.0.0.

However, this misses the case where default --oem mode was set to 2 or 1 in the config files in tessdata. I will look to see how many such cases are there.

Shreeshrii · 2018-03-19T08:24:29Z

$grep engine_mode *.config
ara.config:tessedit_ocr_engine_mode 1
hin.config:tessedit_ocr_engine_mode 2

Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much better than 2 in tessdata.

I propose, to replace these two traineddata files in tessdata by their counterparts from tessdata_fast.

Since their version string will not be Version string:Pre-4.0.0, the program should not crash, if the check is implemented.

We can document this in readme in tessdata repo.

Shreeshrii · 2018-03-21T08:31:34Z

@stweil Since you probably use --oem 0 in your projects, what do you think of this idea?

remove the lstm files from the traineddata in tessdata.

theraysmith · 2018-03-21T17:42:09Z

Those ocr_engine_mode s may be due to the historical presence of cube, and may not be optimal for the current implementation.

…

On Mon, Mar 19, 2018 at 1:24 AM Shreeshrii ***@***.***> wrote: $grep engine_mode *.config ara.config:tessedit_ocr_engine_mode 1 hin.config:tessedit_ocr_engine_mode 2 Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much better than 2 in tessdata. I propose, to replace these two traineddata files by their counterparts in tessdata_fast. Since their version string will not be Version string:Pre-4.0.0, the program should not crash, if the check is implemented. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056XP9S5BJnL4tCd5NPgy-HwddpsPyks5tf2tUgaJpZM4N9Nel> .

-- Ray.

Shreeshrii · 2018-03-21T17:56:02Z

Ray,

Since you mentioned that best can be integerized to make it faster, and there are already three repos with traineddata files, I thought of updating the lstm files in the traineddata in tessdata with the integerized best with a Version string such as:

Version string:Pre-4.0.0+4.00.00alpha:nld:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]:best2int20180321

i.e. Version string from tessdata+ Version string from tessdata_best appended by best2int20180321

hotchkiss87 mentioned this issue Jul 13, 2017

Compiler Warnings #1036

Closed

This was referenced Jul 14, 2017

Would like to help for Burmese/Myanmar language training? tesseract-ocr/langdata#13

Open

khmer - not working with --oem 1 tesseract-ocr/langdata#43

Closed

amitdo mentioned this issue Mar 20, 2018

fast vs. best #1404

Closed

Shreeshrii mentioned this issue Mar 27, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

zdenop closed this as completed Sep 27, 2018

amitdo added the RFC label Mar 24, 2021

Tag a new version for LSTM 4.0 #995

Tag a new version for LSTM 4.0 #995

Comments

Shreeshrii commented Jun 17, 2017 • edited Loading

stweil commented Jun 18, 2017

Shreeshrii commented Jun 19, 2017 via email

egorpugin commented Jun 19, 2017 • edited Loading

stweil commented Jun 19, 2017

Shreeshrii commented Jun 19, 2017 • edited Loading

Shreeshrii commented Jun 19, 2017

Shreeshrii commented Jun 25, 2017

amitdo commented Jun 26, 2017

WilliamTambellini commented Jun 29, 2017

amitdo commented Jun 29, 2017

amitdo commented Jul 11, 2017

theraysmith commented Jul 12, 2017 via email

WilliamTambellini commented Jul 12, 2017

Shreeshrii commented Jul 12, 2017

stweil commented Jul 12, 2017

zdenop commented Jul 12, 2017

Shreeshrii commented Jul 12, 2017 • edited Loading

theraysmith commented Jul 12, 2017 via email

Shreeshrii commented Jul 13, 2017 via email

theraysmith commented Jul 13, 2017 via email

Shreeshrii commented Jul 13, 2017 • edited Loading

amitdo commented Jul 13, 2017

theraysmith commented Jul 13, 2017 via email

theraysmith commented Jul 13, 2017 via email

Shreeshrii commented Jul 13, 2017 • edited Loading

Shreeshrii commented Jul 14, 2017 • edited Loading

theraysmith commented Jul 14, 2017 via email

jbreiden commented Mar 2, 2018

zdenop commented Mar 2, 2018

stweil commented Mar 2, 2018 • edited Loading

zdenop commented Mar 2, 2018 via email

stweil commented Mar 2, 2018

Shreeshrii commented Mar 2, 2018

jbreiden commented Mar 2, 2018

amitdo commented Mar 2, 2018 • edited Loading

amitdo commented Mar 2, 2018 • edited Loading

jbreiden commented Mar 3, 2018

zdenop commented Mar 5, 2018 via email

amitdo commented Mar 9, 2018 • edited Loading

jbreiden commented Mar 9, 2018 • edited Loading

Shreeshrii commented Mar 9, 2018

jbreiden commented Mar 10, 2018 • edited Loading

zdenop commented Mar 10, 2018 via email

Shreeshrii commented Mar 10, 2018 via email

amitdo commented Mar 18, 2018 • edited Loading

amitdo commented Mar 18, 2018

Shreeshrii commented Mar 18, 2018 • edited Loading

amitdo commented Mar 18, 2018

Shreeshrii commented Mar 19, 2018

Shreeshrii commented Mar 19, 2018 • edited Loading

Shreeshrii commented Mar 21, 2018

theraysmith commented Mar 21, 2018 via email

Shreeshrii commented Mar 21, 2018 • edited Loading

Shreeshrii commented Jun 17, 2017 •

edited

Loading

egorpugin commented Jun 19, 2017 •

edited

Loading

Shreeshrii commented Jun 19, 2017 •

edited

Loading

Shreeshrii commented Jul 12, 2017 •

edited

Loading

Shreeshrii commented Jul 13, 2017 •

edited

Loading

Shreeshrii commented Jul 13, 2017 •

edited

Loading

Shreeshrii commented Jul 14, 2017 •

edited

Loading

stweil commented Mar 2, 2018 •

edited

Loading

amitdo commented Mar 2, 2018 •

edited

Loading

amitdo commented Mar 2, 2018 •

edited

Loading

amitdo commented Mar 9, 2018 •

edited

Loading

jbreiden commented Mar 9, 2018 •

edited

Loading

jbreiden commented Mar 10, 2018 •

edited

Loading

amitdo commented Mar 18, 2018 •

edited

Loading

Shreeshrii commented Mar 18, 2018 •

edited

Loading

Shreeshrii commented Mar 19, 2018 •

edited

Loading

Shreeshrii commented Mar 21, 2018 •

edited

Loading