Geresh and Gershayim are not included #130

yarons · 2018-07-05T11:52:08Z

https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text

I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.

https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim

These were not widely used until pretty recently when a new keyboard layout was introduced.

amitdo · 2018-07-05T12:47:23Z

Duplicate of #82 (comment)

amitdo · 2018-07-05T13:37:26Z

Anyway, *.training_text files have not been updated for years.
They are automatically generated from a web corpus.

yarons · 2018-07-05T13:53:17Z

Is there a way to affect the scanned webpages?

amitdo · 2018-07-05T14:15:21Z

Yes, with some hints from other files.

I don't remember the fine details right now.

amitdo · 2018-07-05T15:31:46Z

https://github.com/tesseract-ocr/langdata/blob/master/ces/desired_characters

amitdo · 2018-07-05T15:37:16Z

The opposite:
https://github.com/tesseract-ocr/langdata/blob/master/ara/forbidden_characters

amitdo · 2018-07-05T15:56:54Z

I think 'desired_words' and 'forbidden_words' can also be used.

Shreeshrii · 2018-07-05T16:20:17Z

These lists are used in Ray's synthetic training data creation pipeline. As far as I know, the tesstrain.sh training process does not use them.

…

On Thu 5 Jul, 2018, 9:26 PM Amit D., ***@***.***> wrote: I think 'desired_words' and 'forbidden_words' can also be used. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o7IcVVjiarSlNnQ0hgEEvbAIH0Frks5uDjdHgaJpZM4VDvmc> .

amitdo · 2018-07-05T16:27:32Z

True.

amitdo · 2018-07-06T05:57:03Z

tesseract-ocr/tessdata#62 (comment)

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

So for undesired words a 'lang.bad_words' file should be used.

amitdo · 2018-07-06T08:16:26Z

vie has 'alphabet' file:
https://github.com/tesseract-ocr/langdata/blob/master/vie/alphabet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geresh and Gershayim are not included #130

Geresh and Gershayim are not included #130

yarons commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018 •

edited

Loading

yarons commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

Shreeshrii commented Jul 5, 2018 via email

amitdo commented Jul 5, 2018

amitdo commented Jul 6, 2018 •

edited

Loading

amitdo commented Jul 6, 2018

Geresh and Gershayim are not included #130

Geresh and Gershayim are not included #130

Comments

yarons commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018 • edited Loading

yarons commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

amitdo commented Jul 5, 2018

Shreeshrii commented Jul 5, 2018 via email

amitdo commented Jul 5, 2018

amitdo commented Jul 6, 2018 • edited Loading

amitdo commented Jul 6, 2018

amitdo commented Jul 5, 2018 •

edited

Loading

amitdo commented Jul 6, 2018 •

edited

Loading