Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deu: Remove unwanted dependency #17

Merged
merged 1 commit into from
Feb 2, 2018
Merged

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Feb 1, 2018

The data included a configuration which required frk.traineddata
("tessedit_load_sublangs frk"). Remove that.

Signed-off-by: Stefan Weil [email protected]

The data included a configuration which required frk.traineddata
("tessedit_load_sublangs frk"). Remove that.

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Contributor Author

stweil commented Feb 1, 2018

@jbreiden, there is no such dependency for tessdata_fast/deu.traineddata, so that's fine for the new Debian package tesseract-ocr-deu. Other traineddata files have dependencies which are currently not modeled in their Debian package (for example tesseract-ocr-aze), resulting in errors for the user:

# tesseract -l aze anyimage.png -
Error opening data file /usr/share/tesseract-ocr/4.00/aze_cyrl.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'aze_cyrl'

@jbreiden
Copy link

jbreiden commented Feb 1, 2018

How can I see a complete list of these dependencies?

@stweil
Copy link
Contributor Author

stweil commented Feb 1, 2018

Here it is:

$ grep sublangs *.config
aze.config:tessedit_load_sublangs aze_cyrl
aze_cyrl.config:tessedit_load_sublangs aze
srp_latn.config:# tessedit_load_sublangs srp
uzb.config:tessedit_load_sublangs uzb_cyrl
uzb_cyrl.config:tessedit_load_sublangs uzb

So aze depends on aze_cyrl and aze_cyrl depends on aze. In addition uzb_ depends on uzb_cyrl and vice versa.

I extracted the *.config files from all tessdata_fast/*.traineddata using combine_tessdata -u.

@jbreiden
Copy link

jbreiden commented Feb 1, 2018

I'd better get that fixed before the International Summit of the Book, which will be held in Baku this year.

@jbreiden
Copy link

jbreiden commented Feb 1, 2018

Okay, package dependencies updated. Will be in Debian unstable tomorrow. Thanks for pointing this out.

@amitdo
Copy link

amitdo commented Feb 1, 2018

tesseract-ocr/tessdata@3a94ddd#commitcomment-23584234

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.

That does not match with Stefan's list. Same principle for the two 'chi'.

@stweil
Copy link
Contributor Author

stweil commented Feb 1, 2018

That comment is currently wrong for tessdata_fast (which is used for Debian).
tessdata_best includes more dependencies (for aze, aze_cyrl, ben, chi_sim, chi_tra, ell, jpn, kor, mal, srp, srp_latn, tel, uzb and uzb_cyrl).
I wonder whether this difference between tessdata_best and tessdata_fast is really intentional.

@jbreiden
Copy link

jbreiden commented Feb 1, 2018

I wonder whether this difference between tessdata_best and tessdata_fast is really intentional.

Can't possibly be intentional. @theraysmith

@Shreeshrii
Copy link
Contributor

Probably just confirms the view that tessdata_fast is NOT the integer version of tessdata_best. Rather, it is result of a different training, maybe with a different network spec.

@stweil
Copy link
Contributor Author

stweil commented Feb 2, 2018

See also my related question on the tesseract-dev forum.

@zdenop zdenop merged commit 3e6ec16 into tesseract-ocr:master Feb 2, 2018
@stweil stweil deleted the deu branch February 2, 2018 11:16
@theraysmith
Copy link
Contributor

Jeff is right. The differences in dependencies are not intentional.

How does fast relate to best:
Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
The "best value for money" network configuration was then integerized for further speed.
If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.
It seemed pointless to add to the confusopoly of langdatas further by providing the integerized best.

For languages that have no eval data, both best and fast are a guess, based on using a configuration that worked well for the most closely related language.

@Shreeshrii
Copy link
Contributor

Thanks for the clarification, Ray.

For some languages, this is still best, but for most not.

Can you share the list of languages where it is best?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants