Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag a new version for LSTM 4.0 #995

Closed
Shreeshrii opened this issue Jun 17, 2017 · 108 comments
Closed

Tag a new version for LSTM 4.0 #995

Shreeshrii opened this issue Jun 17, 2017 · 108 comments
Labels

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 17, 2017

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

@stweil
Copy link
Contributor

stweil commented Jun 18, 2017

It would be good to decide about using semantic versioning soon. Maybe it can be used for the next tag.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jun 19, 2017 via email

@egorpugin
Copy link
Contributor

egorpugin commented Jun 19, 2017

First 4 version will be 4.0.0. What 4.1.0alpha are you talking about? We don't care about changes in dev branches.

@stweil
Copy link
Contributor

stweil commented Jun 19, 2017

We could tag the current release as a pre-release or as a release candidate. According to semver.org, it could be called something like 4.0.0-rc.1 (that's how semver.org named its own releases), 4.0.0-beta.1 or 4.0.0-beta.20170619.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jun 19, 2017

We don't care about changes in dev branches.

OK.

Still, it will be good to have new tags when changes are substantial enough from previous commits. For example,

  • change of LSTM mode from --oem 4 to --oem 1 after removal of cube
  • change in .lstmf and .lstm file formats after update regarding endianness
  • proposed change in traineddata files to zipped format

That said, I have only done some cursory reading regarding semver. So, I am happy with whatever tag/version is used, as long as there is some demarcation.

The reason for asking for this is that people are using/trying to use master branch/4.0/LSTM and ask questions, where the version info says -alpha or -dev and it difficult to try and figure out what the issue is without knowing the version being used.

@Shreeshrii
Copy link
Collaborator Author

I vote for this format which includes date - easy to identify which version is more recent.

4.0.0-beta.20170619

@Shreeshrii
Copy link
Collaborator Author

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/V1tyGHIenbI/SUVuXheJAwAJ

An example of how 4.00.00alpha is NOT compatible with the current master branch eg. --oem options.

@amitdo
Copy link
Collaborator

amitdo commented Jun 26, 2017

@theraysmith, can you give us an update on your work? When are we going to see it?

@WilliamTambellini
Copy link

Hi, same: can you give us an update on your work? When are we going to see 4.0 released?

@amitdo
Copy link
Collaborator

amitdo commented Jun 29, 2017

+1 for a new tag.

Since Ray does not reply, I suggest to still use 'alpha'.

4.0.0-alpha.YYYYMMDD

@amitdo
Copy link
Collaborator

amitdo commented Jul 11, 2017

@zdenop, can you do it, or at least add your comment here?

@theraysmith
Copy link
Contributor

theraysmith commented Jul 12, 2017 via email

@WilliamTambellini
Copy link

Superb. Anything we could do to help you ? Cheers.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith Thanks for the update. Look forward to it. Any estimate of expected date?

@zdenop I think this is a good reason to freeze the 'alpha' state by tagging the repo with the current version as 4.0.0-alpha.YYYYMMDD, since Ray is going to be making major changes.

@stweil
Copy link
Contributor

stweil commented Jul 12, 2017

I'm about ready to update the traineddatas.

That's good news.

The above change makes open source training impossible.

If I got that right, it would be horrible. Being able to create new traineddata is essential for me.

@zdenop
Copy link
Contributor

zdenop commented Jul 12, 2017

@Shreeshrii: I do not understand what do you want. Tag will not freeze anything. Tag is just specific points in history to mark something important (e.g. new version). Tagging should be driven by developer who knows roadmap and not by users...

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 12, 2017

@zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag (missing lstm.train file etc.) have been fixed later.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

@theraysmith
Copy link
Contributor

theraysmith commented Jul 12, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 13, 2017 via email

@theraysmith
Copy link
Contributor

theraysmith commented Jul 13, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 13, 2017

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarga i.e. 0901-0903

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. using unicode points as example 093E followed by 0947 to create 094b - ा े to make ो

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

For a sample of Vedic Sanskrit and its ground truth, see
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.tif
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.txt

Will your new sanskrit traineddata be able to OCR this?

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?

@theraysmith
Copy link
Contributor

theraysmith commented Jul 13, 2017 via email

@theraysmith
Copy link
Contributor

theraysmith commented Jul 13, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 13, 2017

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

That does not sound right. Please see
https://en.wikipedia.org/wiki/Malayalam_script#Anusvaram

I did a search on ംം (two anusvarams in malayalam script) and most of them show up in the search result in pdfs.

FYI, pdfs created with documents having text in unicode fonts for complex scripts do not save the unicode text correctly. Devanagari text copied from these pdf is not correct, I assume similarly for malayalam and other Indian scripts, and that might be causing this double anusvar problem.

newer pdfs created in a special manner, eg. with 'actual text' with xelatex are ok (eg. http://sanskritdocuments.org/doc_devii/annapurna.pdf), but those created from various other software are not (http://www.sanskritweb.net/sansdocs/nala-d.pdf).

@jbreiden can give you the technical reasoning for this.

Google search does show pdfs as part of the search results, so there is some internal OCR (is it tesseract???) being done on the pdfs, books etc as part of the search process. But it may not be fully correct.

So for the corpus for training, I would suggest to avoid text taken from pdfs (in case it is being used).

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jul 14, 2017

@theraysmith Regarding Malayalam, double anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.alanwood.net/unicode/malayalam.html
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

@theraysmith
Copy link
Contributor

theraysmith commented Jul 14, 2017 via email

@jbreiden
Copy link
Contributor

jbreiden commented Mar 2, 2018

Very Sorry! I misread the dashboards. Looks like the slightly older code 4.00~git2207-766b7bd6-3.1 will ship, which is missing some of the last minute improvements. I believe it is no longer possible to change the version string (or anything else about Tesseract) for Ubuntu 18.04.

@zdenop
Copy link
Contributor

zdenop commented Mar 2, 2018

  1. Tagging repo will cause release in github and AFAIR it cause problem for some people.
  2. Other distribution will took:
  • the latest github master (to include all additional fixes)
  • the latest stable release
    Nobody would care what did other distribution...
    I would prefer Ray give clear statement about next step for 4.0 release.

@stweil
Copy link
Contributor

stweil commented Mar 2, 2018

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release. That should minimize problems for other people.

The release of today would be 4.0.0-alpha.20180302.

@zdenop
Copy link
Contributor

zdenop commented Mar 2, 2018 via email

@stweil
Copy link
Contributor

stweil commented Mar 2, 2018

Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a 4.0.0-alpha.20180401, or a 4.0.0 without alpha, or a 4.0.1, or Ray sends a bunch of code which justifies a 4.1.0, ...

@Shreeshrii
Copy link
Collaborator Author

I would prefer Ray give clear statement about next step for 4.0 release.

@jbreiden Please check with Ray. Thanks!

@jbreiden
Copy link
Contributor

jbreiden commented Mar 2, 2018

I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term.

@amitdo
Copy link
Collaborator

amitdo commented Mar 2, 2018

Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status.

@amitdo
Copy link
Collaborator

amitdo commented Mar 2, 2018

If you decide to release it soon, don't forget to first update ccutil/version.h

@jbreiden
Copy link
Contributor

jbreiden commented Mar 3, 2018

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract

@zdenop
Copy link
Contributor

zdenop commented Mar 5, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 9, 2018

We should make the decision ourselves.

What about this proposal:

  • Tag commit 40f4311 as 4.00-alpha.2+git.2219.40f43111.
  • 3-4 weeks from now, tag the latest commit in master as 4.0.0-beta.1.
  • Release 4.0.0 30-60 days after beta1 (maybe with one more beta and one rc).

https://semver.org/
https://packages.ubuntu.com/bionic/tesseract-ocr

Mark any non final 4.0.0 as 'pre-release'.

https://help.github.com/articles/creating-releases/

  1. If the release is unstable, select This is a pre-release to notify users that it's not ready for production.

For each (pre-)release, update ccutil/version.h.
https://github.com/tesseract-ocr/tesseract/blob/master/ccutil/version.h

@jbreiden
Copy link
Contributor

jbreiden commented Mar 9, 2018

is there any info from Ray about 4.00 release?

No info.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.

@Shreeshrii
Copy link
Collaborator Author

Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now?

@jbreiden
Copy link
Contributor

jbreiden commented Mar 10, 2018

No more changes possible. Everything will look exactly as described here: #995 (comment)

@zdenop
Copy link
Contributor

zdenop commented Mar 10, 2018 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 10, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2018

I suggest that we release 4.0.0 (final) until end of April. ~2 week before this release, we should release 4.0.0-rc.1.

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2018

April 2018 :-)

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 18, 2018

For the final release should the files in tessdata repo be updated?

These have models for legacy tesseract that we need to keep.

However the LSTM models in those were improved in tessdata_best and then further improved / made faster in tessdata_fast.

I suggest that we update all the lstm related files in tessdata with files from tessdata_fast.

eg. for Hindi.

# combine_tessdata -d ./tessdata/hin.traineddata
Version string:Pre-4.0.0
0:config:size=739, offset=192
1:unicharset:size=180616, offset=931
2:unicharambigs:size=90293, offset=181547
3:inttemp:size=12791027, offset=271840
4:pffmtable:size=24823, offset=13062867
5:normproto:size=225187, offset=13087690
6:punc-dawg:size=426, offset=13312877
7:word-dawg:size=837458, offset=13313303
8:number-dawg:size=410, offset=14150761
9:freq-dawg:size=1242, offset=14151171
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

# combine_tessdata -d ./tessdata_best/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=633, offset=192
17:lstm:size=11738347, offset=825
18:lstm-punc-dawg:size=3154, offset=11739172
19:lstm-word-dawg:size=143834, offset=11742326
20:lstm-number-dawg:size=234, offset=11886160
21:lstm-unicharset:size=7975, offset=11886394
22:lstm-recoder:size=1111, offset=11894369
23:version:size=80, offset=11895480

# combine_tessdata -d ./tessdata_fast/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629
0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606
23:version:size=30, offset=1122717

For hindi, following files in the traineddata in tessdata repo

0:config:size=739, offset=192
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

should be replaced by the following from tessdata_fast

0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606

Also, the version string should be updated appropriately to reflect the combo.

This will also make the size of traineddata files in tessdata repo smaller.

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2018

Good idea, but It should not delay the final 4.0.0 release.

@Shreeshrii
Copy link
Collaborator Author

Thinking about this some more, I think a better alternative will be to remove the lstm files from the traineddata in tessdata.

This will ensure there is no conflict in different config files needed for legacy and LSTM models.

The traineddata file will become smaller.

There will be no need to update the lstm models in tessdata in future.

It will be easier for users:

tessdata for --oem 0
tessdata_fast for --oem 1
tessdata_best for LSTM training

@stweil could implement a check that --oem 0 is only being used with traineddata files that that have a version string of Version string:Pre-4.0.0.

However, this misses the case where default --oem mode was set to 2 or 1 in the config files in tessdata. I will look to see how many such cases are there.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 19, 2018

$grep engine_mode *.config
ara.config:tessedit_ocr_engine_mode 1
hin.config:tessedit_ocr_engine_mode 2

Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much better than 2 in tessdata.

I propose, to replace these two traineddata files in tessdata by their counterparts from tessdata_fast.

Since their version string will not be Version string:Pre-4.0.0, the program should not crash, if the check is implemented.

We can document this in readme in tessdata repo.

@amitdo amitdo mentioned this issue Mar 20, 2018
@Shreeshrii
Copy link
Collaborator Author

@stweil Since you probably use --oem 0 in your projects, what do you think of this idea?

remove the lstm files from the traineddata in tessdata.

@theraysmith
Copy link
Contributor

theraysmith commented Mar 21, 2018 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 21, 2018

Ray,

Since you mentioned that best can be integerized to make it faster, and there are already three repos with traineddata files, I thought of updating the lstm files in the traineddata in tessdata with the integerized best with a Version string such as:

Version string:Pre-4.0.0+4.00.00alpha:nld:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]:best2int20180321

i.e. Version string from tessdata+ Version string from tessdata_best appended by best2int20180321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants