Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add U+02BC to Devanagari.unicharset #34

Open
Shreeshrii opened this issue Jan 2, 2017 · 3 comments
Open

Add U+02BC to Devanagari.unicharset #34

Shreeshrii opened this issue Jan 2, 2017 · 3 comments

Comments

@Shreeshrii
Copy link
Contributor

Some languages of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or as a length mark in their texts written in Devanagari script.

eg. ख’ल्ल
ित’लकना
दख’ना
खर’
कत’ पड़ा’ गेल’?

@Shreeshrii
Copy link
Contributor Author

See tesseract-ocr/tesseract#561 for a list of fonts and links to the ttf files, that can be used for Devanagari training.

@theraysmith
Copy link
Contributor

The examples you give are all U+2019, so which is it? 2019 or 2bc?

@Shreeshrii
Copy link
Contributor Author

As per http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf page 21, it is U+02BC.

http://www.fileformat.info/info/unicode/char/02bc/index.htm quotes the folowing regarding U+02BC

"Comments
apostrophe
glottal stop, glottalization, ejective
many languages use this as a letter of their alphabets
used as a tone marker in Bodo, Dogri, and Maithili
U+2019 is the preferred character for a punctuation apostrophe"

In terms of Tesseract, it would apply to 'bih' traineddata as Bihari group of languages written in Devanagari scrpt includes Maithili.

It is quite possible that the examples that I had copied used the wrong apostrophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants