Suggest 'deva' for Devanagari #41

Shreeshrii · 2017-01-13T16:34:07Z

With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.

The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.

So, in addition to the multiple traineddata for various languages written in Devaन

Shreeshrii · 2017-01-13T16:35:24Z

Can add Deva.traineddata which is trained on training text for all these languages taken together.

amitdo · 2017-01-14T14:51:47Z

Related papers:

A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015)
Tushar Karayil, Adnan Ul-Hasan, Thomas M. Breuel

Can we build language-independent OCR using LSTM networks?
(2013)
Adnan Ul-Hasan, Thomas M. Breuel

More interesting papers about LSTM for OCR:
https://github.com/tmbdev/ocropy/wiki/Publications

Shreeshrii · 2017-01-15T15:33:32Z

List of unicode devanagari fonts that could be used for training, if not already being used

tesseract-ocr/tesseract#561 (comment)

Sample og glyphs in different fonts

tesseract-ocr/tesseract#654

amitdo · 2017-01-15T19:51:02Z

Similary. it would be nice to have a generic traineddata for multiple Latin script based langs, as described in the paper I mentioned above.

Likewise, you could provide a generic Cyrillic traineddata.

amitdo · 2017-01-15T19:53:57Z

And maybe one based on the Arabic script.

Shreeshrii · 2017-01-16T10:29:47Z

Devanagari corpus

Marathi
http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar
http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/marathicorp.tgz

Hindi
http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar
http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/hindicorp.tgz
http://ocr.iiit.ac.in/Hindi100.html

Sanskrit
https://sa.wikibooks.org/
https://sa.wikisource.org/

amitdo · 2017-03-14T08:02:10Z

#41 (comment)
@stweil
I think it's related to your message here:
https://groups.google.com/forum/#!topic/tesseract-dev/8H_4K3vPRJE

stweil · 2017-03-14T08:15:11Z

Likewise, you could provide a generic Cyrillic traineddata.

I assume the same would be needed for Greek. Or would it be better to include Greek characters in the Latin training set? Several sciences (especially Physics and Mathematics) use single Greek characters in texts which are mostly written with Latin letters.

Shreeshrii · 2017-04-01T02:22:23Z

#59 (comment)

@theraysmith commented 2 days ago

I've also added an experiment to throw all the Latin languages together
into a single engine. (Actually a separate model for each of 36 scripts).
If that works it will solve the problem of reading Citroen in German and
picking up the e umlaut.
The downside is that this model has almost 400 characters in it, despite
carefully keeping out the long-tail graphics characters. Even if it does
work, it will be slower, but possibly not much slower than running 2
languages. It will have about 56 languages in it. I have some optimism that
this may work, ever since I discovered that the vie LSTM model gets the
phototest.tif image 100% correct.

amitdo · 2017-08-01T18:28:35Z

This request was implemted by Ray:

tesseract-ocr/tessdata#62 (comment)

Shreeshrii · 2017-08-02T01:28:44Z

Thanks!

Shreeshrii closed this as completed Apr 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest 'deva' for Devanagari #41

Suggest 'deva' for Devanagari #41

Shreeshrii commented Jan 13, 2017

Shreeshrii commented Jan 13, 2017

amitdo commented Jan 14, 2017

Shreeshrii commented Jan 15, 2017 •

edited

Loading

amitdo commented Jan 15, 2017 •

edited

Loading

amitdo commented Jan 15, 2017

Shreeshrii commented Jan 16, 2017 •

edited

Loading

amitdo commented Mar 14, 2017

stweil commented Mar 14, 2017

Shreeshrii commented Apr 1, 2017

amitdo commented Aug 1, 2017

Shreeshrii commented Aug 2, 2017

Suggest 'deva' for Devanagari #41

Suggest 'deva' for Devanagari #41

Comments

Shreeshrii commented Jan 13, 2017

Shreeshrii commented Jan 13, 2017

amitdo commented Jan 14, 2017

Shreeshrii commented Jan 15, 2017 • edited Loading

amitdo commented Jan 15, 2017 • edited Loading

amitdo commented Jan 15, 2017

Shreeshrii commented Jan 16, 2017 • edited Loading

amitdo commented Mar 14, 2017

stweil commented Mar 14, 2017

Shreeshrii commented Apr 1, 2017

amitdo commented Aug 1, 2017

Shreeshrii commented Aug 2, 2017

Shreeshrii commented Jan 15, 2017 •

edited

Loading

amitdo commented Jan 15, 2017 •

edited

Loading

Shreeshrii commented Jan 16, 2017 •

edited

Loading