-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best traineddata feedback - Sanskrit #64
Comments
Also missing are the following vowels and their combining signs, though these are used rarely. ऌ | 2316 | ऌ | 090C | DEVANAGARI LETTER VOCALIC L ॄ | 2372 | ॄ | 0944 | DEVANAGARI VOWEL SIGN VOCALIC RR |
See the following for OCR Eval reports Sanskrit + one Vedic Sanskrit sample For old orthography and listing of alphabets images for test are at |
My suggestion for fixing this kind of problems (in ALL language traineddatas) is to add the full alphabet of the language to the desired_characters file for that language (rather than only adding characters which do not get picked up otherwise). So, basically the desired characters file becomes the desired unicharset file. |
Alternative will be to have a small hand-crafted training text which has the full alphabet and desired characters, and a mechanism to ensure that it gets picked up while building the synthetic training image for each font. |
Recognition of old orthography is quite poor - see error reports at https://shreeshrii.github.io/tess4eval_deva/ |
Looked at the Sanskrit wordlist from lstm-word-dawg It has many words with special characters which should not be there. Some examples: ~रं ~निॐत ~*थ सा$$मर्षतयेव |
See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text. |
san.lstm-unicharset does not have the following devanagari characters
ऐ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 109 0 109 ऐ # ऐ [910 ]x
औ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 92 0 92 औ # औ [914 ]x
झ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 54 0 54 झ # झ [91d ]x
ळ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 123 0 123 ळ # ळ [933 ]x
Here are some words with these:
ऐकान्तिकस्य (aikaantikasya) = ultimate
ऐक्य (aikya) = unity
ऐच्छत् (aichchhat.h) = desired
ऐरावतं (airaavataM) = Airavata
ऐश्वरं (aishvaraM) = divine
ऐश्वर्य (aishvarya) = desire for power
औद्योगिक (audyogika) = industrial
औपम्येन (aupamyena) = by comparison
औशध (aushadha) = medicine
औषध (aushhadha) = medicine
औषधं (aushhadhaM) = medicine
औषधम् (aushhadham.h) = (n) medicine
औषधसूची (aushhadhasuuchii) = (f) syringe, injection
औषधिवन (aushhadhivana) = medicinal garden
औषधीः (aushhadhiiH) = vegetables
झषाणां (jhashhaaNaaM) = of all fish
झृम्बणम् (jhRimbaNam.h) = (n) yawning
मूळ (muuLa) = Nineteenth nakshatra
मङ्गळ (ma.ngaLa) = Auspiciousness and well-being
अ॒ग्निमी॑ळे
वी॒ळु
मृळय
इळा॒
The text was updated successfully, but these errors were encountered: