-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding khmLimon.traineddata #27
Conversation
Khm (unicode) + Limon (S1S2F1F2R1R2) finetune traineddata
…etuneEngine Adding KhmLimon finetune engine
Thank you for this. Do you have any statistics on improved accuracy with this for Limon fonts? I am not sure what is the policy regarding community contributed traineddata files, though we do have ancient greek traineddata contributed by Nick White. |
Hi Shreeshrii, https://github.com/eddieantonio/isri-ocr-evaluation-tools 2018_06_04_KhmLimon_result_for_github.zip Thanks, |
Thank you! @theraysmith @jbreiden @zdenop What is the policy for adding community contributed tessdata? In addition to this, I have generated traineddata files for Coptic and Javanese. |
Accuracy_Tesseract_4.0
|
Acc_LimonS1S2F1F2R1R2UnicodeEngine: Fine Tune For Limon S1, S2, F1, F2, R1 & R2 Unicode Net_spec value “[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx384 O1c1]”
|
@zdenop I suggest that you merge this PR. The accuracy reports show that there is over 10% improvement in the recognition on newly finetuned fonts without any appreciable loss in accuracy for the earlier list. |
It is s question if we want to mix google produce data with custom data... |
@zdenop In that case may I suggest an additional repo, tessdata_contrib |
Adding a new repository would be a possible solution. But wouldn't it be easier if people managed their own repositories with their tessdata contribution, and Tesseract could maintain a list with such repositories in the Wiki? In any case such tessdata contributions should ideally document everything needed to reproduce the training process (fonts, images, ground truth, texts, scripts, documentation, ...). |
I agree with @stweil: this is the best solution. Google will not update files of contributor... |
ok. I have added a page at
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Jun 5, 2018 at 3:25 PM, zdenop ***@***.***> wrote:
I agree with @stweil <https://github.com/stweil>: this is the best
solution. Google will not update files of contributor...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7BEjrv1YCIgbTtb10HXwtMsfbmmks5t5lV5gaJpZM4UPlyj>
.
|
Thanks. |
@zdenop Will you be able to apply this PR to the new repo or does @phyrumsk need to submit a new PR for https://github.com/tesseract-ocr/tessdata_contrib? |
tesseract-ocr/tessdata_contrib#1 adds this to tessdata_contrib repo. |
It's also possible to apply the original pull request in tessdata_contrib. That would preserve the original author and time, so that is the preferred solution. @zdenop, I currently cannot do this because of missing rights for tessdata_contrib. @phyrumsk, @Shreeshrii, it would be good to also have some documentation for each contributed model (who generated it, what is the name of the language or script, how can the training be reproduced, accuracy reports, ...). I suggest one documentation file per model ( |
I would prefer it too :-) |
I have added some more links there. It is difficult to do accuracy estimates because there we do not have testdata for all languages. Also, my training is mainly proof of concept, I do not have system resources to do millions of iterations to get high accuracy (error rates below 1). @stweil If you are able to run the training with more fonts/lines etc. I can provide the training text/scripts used. |
@stweil: try now. |
Thank you. This pull request was now merged in tesseract-ocr/tessdata_contrib@05fa41a. It can be closed here (I don't have the rights to do that myself). |
Hi we finetune the tesseract_best engine which support only Khmer unicode fonts with new 6 Limon fonts such as Limon S1 S2 F1 F2 R1 R2 using the same tesseract netspec.