Fine tuning training for a mix language tessdata #15

amitm02 · 2018-01-04T08:30:01Z

If understand correctly, traineddata files that starts with a capital letter are "mixed languages" traineddata (e.g Hebrew = heb+eng).
Was is produced by combining "heb" and "eng" traineddata files or was it trained from scratch on a mix language data?
Is there anything i should do differently if i want to do a fine tune training the "Hebrew" traineddata compared to the "heb" traineddata?

ngduyanhece · 2018-01-10T10:33:35Z

I also have the same question for jpn.traindeddata and japanese.traindeddata, what is the differences between them and how can i fine tune for japanese.traindeddata

Shreeshrii · 2018-01-10T11:25:56Z

Unpack the traineddata (Hebrew or Japanese). Run dawg2wordlist to get the input wordlist files, in case you want to change them.

You may need to add 'Hebrew' or 'Japanese" as valid language codes in training/language_specific.sh and create subfolders for them under langdata with the unpacked files.

Alternately, you can modify heb or jpn langdata folders with the new files and train using Hebrew or Japanese best traineddata for extracting the lstm model to continue from.

See

https://github.com/tesseract-ocr/tesseract/blob/e2e0b90666d3798c6e38de9ebc0524b3c2573dea/doc/combine_tessdata.1.asc

https://github.com/tesseract-ocr/tesseract/blob/e2e0b90666d3798c6e38de9ebc0524b3c2573dea/doc/dawg2wordlist.1.asc

https://github.com/tesseract-ocr/tesseract/blob/e2e0b90666d3798c6e38de9ebc0524b3c2573dea/training/language-specific.sh

amitm02 · 2018-01-16T07:40:51Z

@Shreeshrii thanks for replaying. I'm not sure i understand the answers to original
I'm using tesseract 4 with LSTM.

Was "Hebrew" tessdata produced by combining "heb" and "eng" traineddata files or was it trained from scratch on a mix language data?
I want to do a fine tuning on scanned docs images that builds upon the existing "Hebrew" tessdata, is there anything i should do differently than with "heb" tessdata

Shreeshrii · 2018-01-16T08:34:59Z

The training was done by @theraysmith @ Google. I only know based on what he has posted in these forums. Please see tesseract-ocr/tessdata#62 (comment) where he has explained about models for 'scripts' vs 'languages'.

kotebeg · 2019-09-10T17:03:28Z

I am going to fine tune one of the tesseract_best traineddata file with new fonts, but I am not sure how many pages should I use for training, or how many iterations, and not to impact badly existing traineddata file, is there any recommendations abut that parameters ??

Shreeshrii · 2019-09-11T08:05:03Z

@kotebeg Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Use the tessearct-ocr google group for asking questions.

stweil closed this as completed May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning training for a mix language tessdata #15

Fine tuning training for a mix language tessdata #15

amitm02 commented Jan 4, 2018

ngduyanhece commented Jan 10, 2018

Shreeshrii commented Jan 10, 2018

amitm02 commented Jan 16, 2018

Shreeshrii commented Jan 16, 2018

kotebeg commented Sep 10, 2019

Shreeshrii commented Sep 11, 2019

Fine tuning training for a mix language tessdata #15

Fine tuning training for a mix language tessdata #15

Comments

amitm02 commented Jan 4, 2018

ngduyanhece commented Jan 10, 2018

Shreeshrii commented Jan 10, 2018

amitm02 commented Jan 16, 2018

Shreeshrii commented Jan 16, 2018

kotebeg commented Sep 10, 2019

Shreeshrii commented Sep 11, 2019