Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Traineddata Feedback - Hindi #66

Open
Shreeshrii opened this issue Aug 1, 2017 · 5 comments
Open

Best Traineddata Feedback - Hindi #66

Shreeshrii opened this issue Aug 1, 2017 · 5 comments

Comments

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 1, 2017

hin.lstm-unicharset does not have the following devanagari characters and combining marks:

ङ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 129 0 129 ङ # ङ [919 ]x
ऍ | 2317 | ऍ | 090D | DEVANAGARI LETTER CANDRA E
ॅ 0 0,255,0,255,0,0,0,0,0,0 Devanagari 124 17 124 ॅ # ॅ [945 ]
ॐ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 158 0 158 ॐ # ॐ [950 ]x

पङ्कज
गङ्गा
ऍण्ड
ऍक्ट
डू यू हैव अ पॅन
फॅरनहाइट

ॐकार

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 1, 2017

See https://shreeshrii.github.io/tess4eval-san/

for accuracy reports with Hindi and Bihari language samples - not segregated.

The images used can be seen from
https://github.com/Shreeshrii/tess4eval-san/blob/master/0createcache.sh

I have NOT looked at wordlists yet because I was under the impression that they do not make much difference to accuracy for LSTM models. Is that correct, @theraysmith

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 2, 2017

Some of the errors in recognition of Hindi are because of use of a different orthographic style for some of the letters. Please see https://shreeshrii.github.io/tess4eval-san/index-4-hinbest.html where the errors relate to




and

for bhojpurilokgatha005035mbp_0278.tif

Interestingly, these are recognized correctly in the original hin.traineddata for 4.00.00-alpha.

These can be fixed by ensuring that fonts with different orthographies are used.

@theraysmith If you provide a list of Devanagari fonts used for training, I can check for this.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 2, 2017

For wordlists/training_text for modern languages, I will also suggest using the localization lists from unicode.org

Please see:
http://www.unicode.org/cldr/charts/31/summary/hi.html
http://www.unicode.org/cldr/charts/31/summary/mr.html

see http://www.unicode.org/cldr/charts/31/summary/root.html
for the languages for which this info is available.

@Shreeshrii
Copy link
Contributor Author

Also see comments for #64 - feedback regarding Sanskrit

@Shreeshrii
Copy link
Contributor Author

See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text.

ALL-hin-imageshin-rpt.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant