Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best traineddata feedback - Sanskrit #64

Open
Shreeshrii opened this issue Aug 1, 2017 · 7 comments
Open

Best traineddata feedback - Sanskrit #64

Shreeshrii opened this issue Aug 1, 2017 · 7 comments

Comments

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 1, 2017

san.lstm-unicharset does not have the following devanagari characters

ऐ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 109 0 109 ऐ # ऐ [910 ]x
औ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 92 0 92 औ # औ [914 ]x
झ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 54 0 54 झ # झ [91d ]x
ळ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 123 0 123 ळ # ळ [933 ]x

Here are some words with these:

ऐकान्तिकस्य (aikaantikasya) = ultimate
ऐक्य (aikya) = unity
ऐच्छत् (aichchhat.h) = desired
ऐरावतं (airaavataM) = Airavata
ऐश्वरं (aishvaraM) = divine
ऐश्वर्य (aishvarya) = desire for power

औद्योगिक (audyogika) = industrial
औपम्येन (aupamyena) = by comparison
औशध (aushadha) = medicine
औषध (aushhadha) = medicine
औषधं (aushhadhaM) = medicine
औषधम् (aushhadham.h) = (n) medicine
औषधसूची (aushhadhasuuchii) = (f) syringe, injection
औषधिवन (aushhadhivana) = medicinal garden
औषधीः (aushhadhiiH) = vegetables

झषाणां (jhashhaaNaaM) = of all fish
झृम्बणम् (jhRimbaNam.h) = (n) yawning

मूळ (muuLa) = Nineteenth nakshatra
मङ्गळ (ma.ngaLa) = Auspiciousness and well-being
अ॒ग्निमी॑ळे
वी॒ळु
मृळय
इळा॒

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 1, 2017

Also missing are the following vowels and their combining signs, though these are used rarely.

ऌ | 2316 | ऌ | 090C | DEVANAGARI LETTER VOCALIC L
ॠ | 2400 | ॠ | 0960 | DEVANAGARI LETTER VOCALIC RR
ॡ | 2401 | ॡ | 0961 | DEVANAGARI LETTER VOCALIC LL

ॄ | 2372 | ॄ | 0944 | DEVANAGARI VOWEL SIGN VOCALIC RR
ॢ | 2402 | ॢ | 0962 | DEVANAGARI VOWEL SIGN VOCALIC L
ॣ | 2403 | ॣ | 0963 | DEVANAGARI VOWEL SIGN VOCALIC LL

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 1, 2017

See the following for OCR Eval reports

Sanskrit + one Vedic Sanskrit sample
https://shreeshrii.github.io/tess4eval-san/

For old orthography and listing of alphabets
https://shreeshrii.github.io/tess4eval_deva/

images for test are at
https://github.com/Shreeshrii/tess4eval_deva/tree/master/images

@Shreeshrii
Copy link
Contributor Author

My suggestion for fixing this kind of problems (in ALL language traineddatas) is to add the full alphabet of the language to the desired_characters file for that language (rather than only adding characters which do not get picked up otherwise).

So, basically the desired characters file becomes the desired unicharset file.

@Shreeshrii
Copy link
Contributor Author

Alternative will be to have a small hand-crafted training text which has the full alphabet and desired characters, and a mechanism to ensure that it gets picked up while building the synthetic training image for each font.

@Shreeshrii
Copy link
Contributor Author

Recognition of old orthography is quite poor - see error reports at https://shreeshrii.github.io/tess4eval_deva/

@Shreeshrii
Copy link
Contributor Author

Looked at the Sanskrit wordlist from lstm-word-dawg

It has many words with special characters which should not be there. Some examples:

~रं

~=
~=अनसल
~दअ
~दविरअ
~दारयत्‌
~दकइएन
~दह

~ऽ
~ध
~धनन

~निॐत
~नसर
~न¬
~ल
:((ॐ)):

~*थ
~*उभवो
~*धो

~¬त
~¬तजइ
~¬तमो
~¬¬
¬¬
~¬¬मऽअ
~¬¬चइत

सा$$मर्षतयेव
सा$मावस्या
सा$ब्रवीत्‌
सा$वीरा
सा$पि
सा$पत्रपा$न्यतः
सा$पत्रपा$न्यतः

@Shreeshrii
Copy link
Contributor Author

See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text.

ALL-san-imagessan-rpt.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant