-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geresh and Gershayim are not included #130
Comments
Duplicate of #82 (comment) |
Anyway, *.training_text files have not been updated for years. |
Is there a way to affect the scanned webpages? |
Yes, with some hints from other files. I don't remember the fine details right now. |
I think 'desired_words' and 'forbidden_words' can also be used. |
These lists are used in Ray's synthetic training data creation pipeline. As
far as I know, the tesstrain.sh training process does not use them.
…On Thu 5 Jul, 2018, 9:26 PM Amit D., ***@***.***> wrote:
I think 'desired_words' and 'forbidden_words' can also be used.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7IcVVjiarSlNnQ0hgEEvbAIH0Frks5uDjdHgaJpZM4VDvmc>
.
|
True. |
tesseract-ocr/tessdata#62 (comment) theraysmith commented on Aug 3, 2017
So for undesired words a 'lang.bad_words' file should be used. |
vie has 'alphabet' file: |
https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text
I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.
https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim
These were not widely used until pretty recently when a new keyboard layout was introduced.
The text was updated successfully, but these errors were encountered: