deu: Remove unwanted dependency #17

stweil · 2018-02-01T14:30:02Z

The data included a configuration which required frk.traineddata
("tessedit_load_sublangs frk"). Remove that.

Signed-off-by: Stefan Weil [email protected]

The data included a configuration which required frk.traineddata ("tessedit_load_sublangs frk"). Remove that. Signed-off-by: Stefan Weil <[email protected]>

stweil · 2018-02-01T15:05:30Z

@jbreiden, there is no such dependency for tessdata_fast/deu.traineddata, so that's fine for the new Debian package tesseract-ocr-deu. Other traineddata files have dependencies which are currently not modeled in their Debian package (for example tesseract-ocr-aze), resulting in errors for the user:

# tesseract -l aze anyimage.png -
Error opening data file /usr/share/tesseract-ocr/4.00/aze_cyrl.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'aze_cyrl'

jbreiden · 2018-02-01T18:45:59Z

How can I see a complete list of these dependencies?

stweil · 2018-02-01T18:53:04Z

Here it is:

$ grep sublangs *.config
aze.config:tessedit_load_sublangs aze_cyrl
aze_cyrl.config:tessedit_load_sublangs aze
srp_latn.config:# tessedit_load_sublangs srp
uzb.config:tessedit_load_sublangs uzb_cyrl
uzb_cyrl.config:tessedit_load_sublangs uzb

So aze depends on aze_cyrl and aze_cyrl depends on aze. In addition uzb_ depends on uzb_cyrl and vice versa.

I extracted the *.config files from all tessdata_fast/*.traineddata using combine_tessdata -u.

jbreiden · 2018-02-01T19:14:50Z

I'd better get that fixed before the International Summit of the Book, which will be held in Baku this year.

jbreiden · 2018-02-01T19:46:28Z

Okay, package dependencies updated. Will be in Debian unstable tomorrow. Thanks for pointing this out.

amitdo · 2018-02-01T20:05:12Z

tesseract-ocr/tessdata@3a94ddd#commitcomment-23584234

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.

That does not match with Stefan's list. Same principle for the two 'chi'.

stweil · 2018-02-01T20:15:57Z

That comment is currently wrong for tessdata_fast (which is used for Debian).
tessdata_best includes more dependencies (for aze, aze_cyrl, ben, chi_sim, chi_tra, ell, jpn, kor, mal, srp, srp_latn, tel, uzb and uzb_cyrl).
I wonder whether this difference between tessdata_best and tessdata_fast is really intentional.

jbreiden · 2018-02-01T20:29:07Z

I wonder whether this difference between tessdata_best and tessdata_fast is really intentional.

Can't possibly be intentional. @theraysmith

Shreeshrii · 2018-02-02T03:33:00Z

Probably just confirms the view that tessdata_fast is NOT the integer version of tessdata_best. Rather, it is result of a different training, maybe with a different network spec.

stweil · 2018-02-02T06:17:49Z

See also my related question on the tesseract-dev forum.

theraysmith · 2018-03-20T02:58:07Z

Jeff is right. The differences in dependencies are not intentional.

How does fast relate to best:
Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
The "best value for money" network configuration was then integerized for further speed.
If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.
It seemed pointless to add to the confusopoly of langdatas further by providing the integerized best.

For languages that have no eval data, both best and fast are a guess, based on using a configuration that worked well for the most closely related language.

Shreeshrii · 2018-03-20T04:46:14Z

Thanks for the clarification, Ray.

For some languages, this is still best, but for most not.

Can you share the list of languages where it is best?

deu: Remove unwanted dependency

ed5410b

The data included a configuration which required frk.traineddata ("tessedit_load_sublangs frk"). Remove that. Signed-off-by: Stefan Weil <[email protected]>

zdenop merged commit 3e6ec16 into tesseract-ocr:master Feb 2, 2018

stweil deleted the deu branch February 2, 2018 11:16

Shreeshrii referenced this pull request in nguyenq/VietOCR3 Mar 20, 2018

Upgrade Tesseract 4.00 fast language packs

e3f5490

Shreeshrii mentioned this pull request Mar 20, 2018

FYI - tesseract4.0.0-beta.1 traineddata files for scripts manisandro/gImageReader#323

Closed

Shreeshrii mentioned this pull request Mar 20, 2018

Updated based on Ray's comment tesseract-ocr/tessdata_fast#13

Merged

amitdo mentioned this pull request Mar 20, 2018

fast vs. best tesseract-ocr/tesseract#1404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deu: Remove unwanted dependency #17

deu: Remove unwanted dependency #17

stweil commented Feb 1, 2018

stweil commented Feb 1, 2018

jbreiden commented Feb 1, 2018

stweil commented Feb 1, 2018 •

edited

Loading

jbreiden commented Feb 1, 2018

jbreiden commented Feb 1, 2018

amitdo commented Feb 1, 2018 •

edited

Loading

stweil commented Feb 1, 2018 •

edited

Loading

jbreiden commented Feb 1, 2018 •

edited

Loading

Shreeshrii commented Feb 2, 2018

stweil commented Feb 2, 2018 •

edited

Loading

theraysmith commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

deu: Remove unwanted dependency #17

deu: Remove unwanted dependency #17

Conversation

stweil commented Feb 1, 2018

stweil commented Feb 1, 2018

jbreiden commented Feb 1, 2018

stweil commented Feb 1, 2018 • edited Loading

jbreiden commented Feb 1, 2018

jbreiden commented Feb 1, 2018

amitdo commented Feb 1, 2018 • edited Loading

stweil commented Feb 1, 2018 • edited Loading

jbreiden commented Feb 1, 2018 • edited Loading

Shreeshrii commented Feb 2, 2018

stweil commented Feb 2, 2018 • edited Loading

theraysmith commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018

stweil commented Feb 1, 2018 •

edited

Loading

amitdo commented Feb 1, 2018 •

edited

Loading

stweil commented Feb 1, 2018 •

edited

Loading

jbreiden commented Feb 1, 2018 •

edited

Loading

stweil commented Feb 2, 2018 •

edited

Loading