Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geresh and Gershayim are not included #130

Open
yarons opened this issue Jul 5, 2018 · 11 comments
Open

Geresh and Gershayim are not included #130

yarons opened this issue Jul 5, 2018 · 11 comments

Comments

@yarons
Copy link

yarons commented Jul 5, 2018

https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text

I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.

https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim

These were not widely used until pretty recently when a new keyboard layout was introduced.

@amitdo
Copy link

amitdo commented Jul 5, 2018

Duplicate of #82 (comment)

@amitdo
Copy link

amitdo commented Jul 5, 2018

Anyway, *.training_text files have not been updated for years.
They are automatically generated from a web corpus.

@yarons
Copy link
Author

yarons commented Jul 5, 2018

Is there a way to affect the scanned webpages?

@amitdo
Copy link

amitdo commented Jul 5, 2018

Yes, with some hints from other files.

I don't remember the fine details right now.

@amitdo
Copy link

amitdo commented Jul 5, 2018

@amitdo
Copy link

amitdo commented Jul 5, 2018

@amitdo
Copy link

amitdo commented Jul 5, 2018

I think 'desired_words' and 'forbidden_words' can also be used.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Jul 5, 2018 via email

@amitdo
Copy link

amitdo commented Jul 5, 2018

True.

@amitdo
Copy link

amitdo commented Jul 6, 2018

tesseract-ocr/tessdata#62 (comment)

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

So for undesired words a 'lang.bad_words' file should be used.

@amitdo
Copy link

amitdo commented Jul 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants