Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast vs. best #1404

Closed
amitdo opened this issue Mar 20, 2018 · 25 comments
Closed

fast vs. best #1404

amitdo opened this issue Mar 20, 2018 · 25 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Mar 20, 2018

#943 (comment)

theraysmith commented on May 23, 2017

Far greater performance improvements can be made by making the network smaller. As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and much faster than the legacy code for complex scripts.

#995 (comment)

theraysmith commented on Jul 12, 2017

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best".

tesseract-ocr/tessdata_best#17 (comment)

@theraysmith commented on Mar 20, 2018

How does fast relate to best:
Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
The "best value for money" network configuration was then integerized for further speed.
If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.
It seemed pointless to add to the confusopoly of langdatas further by providing the integerized best.

For languages that have no eval data, both best and fast are a guess, based on using a configuration that worked well for the most closely related language.

@stweil
Copy link
Contributor

stweil commented Mar 20, 2018

Still unsolved: How to build tessdata_fast from tessdata_best. We know how to replace the float data in tessdata_best by integer data, but we don't know how the network was made smaller. See also the discussion on Google Groups.

@Shreeshrii
Copy link
Collaborator

Fast maybe using a diff network spec. Eg. Ray's das 2016 slides show a diff network string. It is possible that Ray has not posted some of those methods to github.

@stweil
Copy link
Contributor

stweil commented Mar 20, 2018

Yes, I also think that tessdata_fast uses a different network spec. Ray said that one step was making the network smaller. So which spec was used for each language / script, and how was the conversion done? Maybe @jbreiden can find that out.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 20, 2018

My guess is that fast and best were trained independently.

First, fast is trained with a spec that produces a smaller net than best. As a result of smaller model, the prediction will be faster.
Then, the float->int conversion is done, which further reduces the size of the model and makes it even faster if your CPU supports AVX2.

@jbreiden
Copy link
Contributor

@amitdo is exactly right. You cannot derive fast from best. I'm sorry that I said the opposite earlier; I was confused and wrong.

@theraysmith
Copy link
Contributor

theraysmith commented Mar 20, 2018 via email

@Shreeshrii
Copy link
Collaborator

The network configuration is stored in the lstm data in the traineddata.
With a small change to combine_tessdata, I produced the attached.

Ray,

Some of the languages have two entries, eg. for Hindi

Version string:4.00.00alpha:hin:synth20170629
LSTM training info:Network str:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx384O1c1], flags=41, iteration=2736900, sample_iteration=2737076, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999

and

Version string:4.00.00alpha:hin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
LSTM training info:Network str:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1], flags=40, iteration=803600, sample_iteration=803651, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999

I see flags=41 and flags=40. What is the difference between the two?

I don't see any ref to these flags in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line

@godofcheerup

This comment has been minimized.

@amitdo

This comment has been minimized.

@godofcheerup

This comment has been minimized.

@theraysmith
Copy link
Contributor

theraysmith commented Mar 21, 2018 via email

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 21, 2018

@theraysmith, your best list was ended in 'Kannada'.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 25, 2018

Then, the float->int conversion is done, which further reduces the size of the model and makes it even faster if your CPU supports AVX2.

I wonder if the int net will be faster than the float net with just avx/sse.

@stweil
Copy link
Contributor

stweil commented Mar 26, 2018

I wonder whether Tesseract could generate the integer net on the fly from a best net right after starting an OCR process.

@ghost
Copy link

ghost commented Apr 19, 2018

@Shreeshrii have you figured out how to generate Best or Fast manually?

@Shreeshrii
Copy link
Collaborator

@Christophered What do you mean by 'manually'?

They can be generated using tesstrain.sh, lstmtraining etc as described in the wiki page.

Same checkpoint can be used for creating both 'best' and 'fast' formats i.e. float and integer models.

  lstmtraining \
  --stop_training \
  --continue_from $train_output_dir/plusdeva_checkpoint \
  --traineddata $train_output_dir/$Lang/$Lang.traineddata \
  --model_output $best_trained_data_file
##
    lstmtraining \
  --stop_training \
  --convert_to_int \
  --continue_from $train_output_dir/plusdeva_checkpoint \
  --traineddata $train_output_dir/$Lang/$Lang.traineddata \
  --model_output $fast_trained_data_file

An existing 'best' float model can be converted to 'fast' integer using combine_tessdata.

@ghost
Copy link

ghost commented Apr 19, 2018

@Shreeshrii I meant what are the original specifications and settings to generate a training model similar to Best or Fast from scratch, not fine-tuning.

@Shreeshrii
Copy link
Collaborator

To duplicate Ray's training we will need the same langdata, font list etc. That info is not available.

The network spec is listed in the version string for best. For fast, I have added the info from Ray on a page in wiki.

@ghost
Copy link

ghost commented Apr 19, 2018

@theraysmith Can you share the base data that you used for training Arabic? ( configs, prohibited characters, word-list, etc..). Also how many lines & fonts did you use? How much time did it take to train? etc.. Because I am thinking of seriously improving the Tesseract Arabic model for all. Waiting for your reply Ray

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 2, 2018

Network spec for 'tessdata_fast' for all languages and scripts are available on wiki page - https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification

sample

Version string:4.00.00alpha:ara:synth20170629
LSTM training info:Network str:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], flags=41, iteration=5521100, sample_iteration=5544718, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999

fontlists and langdata for LSTM training is available in the new repository https://github.com/tesseract-ocr/langdata_lstm

eg. Arabic files are in https://github.com/tesseract-ocr/langdata_lstm/tree/master/ara

@EhsanKia
Copy link

EhsanKia commented May 22, 2020

Sorry the last few comments are a little conflicting. Is it true that simply converting best to int generates fast, or do they have different net specs?

From the wiki page, eng_best uses
[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
whereas eng_fast uses:
[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]

So they are each trained separately with different configuration, and it's not possible to convert best into fast?

At this point, are there script allowing us, using all the available repos such as langdata_lstm, to exactly generate tessdata_fast/tessdata_best (assuming we have the fonts, as those maybe can't be distributed)? I know all the data for tesstrain.sh flags are in langdata_lstm, but it would be nice if we could also add a file maybe such as train_<lang>_<version>.sh which looks like

tesstrain.sh
 --flag1=value1
 --flag2=value2
 --net_spec=....

With the exact command used to generate that tessdata

@amitdo
Copy link
Collaborator Author

amitdo commented May 22, 2020

Is it true that simply converting best to int generates fast,

Not 'fast' as the one from the fast repo, but still much faster than best.

or do they have different net specs?

best(double)->int won't change the spec.

From the wiki page....
So they are each trained separately with different configuration, and it's not possible to convert best into fast?

You got that right.

@Shreeshrii
Copy link
Collaborator

converting best to int generates fast

converting tessdata_best to int generates the int version of best model, which is faster than best model and has similar accuracy.

The models in tessdata_fast are int models and may have been trained with a different network spec.

@EhsanKia

This comment has been minimized.

@amitdo

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants