Kanji recognition

Introduction

This project is inspired by the Tensorflow tutorial on MNIST handwritten digit when I was learning Convolutional Nerual Networks.

This project demonstrates building a CNN to recognize Japanese kanji characters.

Write-up

Labels

Moving away from the MNIST example, my first problem was the labels. As I was learning RTK at that time, I thought those characters would be a good starting point as they fit my need (my need for using those characters during learning Japanese). I spent a few days writing some scrappers for getting those characters from a memrise course and wikipedia.

Data

While writing those scrappers, I realized that I had no dataset for training. Because of that, I created a drawing/note taking app with cordova to generate some data without labeling. That took a few weeks and I was happy with that because that was one of my first mobile app experience.

A few weeks later, I realized that I was not generating nowhere enough data for training. The MNIST example has around 10,000 records for each labels. I had ~2000 labels and less than 10 records for 20% of those labels. I needed to find a way to create data. "Fonts" - a thing that came to my mind. While learning Japanese with Anki, the default font for rendering Japanese was pretty bad for learners - the characters were not rendered as we suppose to write them. I got my hand on some of the Japanese fonts that is suitable for Japanese learners from the community. I got the ideal to use Japanese fonts to generate image data. It took me a 1-2 months to complete the project.

Training

With the data ready, I was able to train reasonable good models (in a few weeks). I spent the next few months to build some applications that utilize that model - a web app demo, an android app, and a desktop app for labeling my hard-work writing data.

New data

One days, I stumbled on ETL Character Database - an image dataset which is perfect for my need. It contains more data than I can write in the next 5 years. In addition, all of them has been labeled. I took me a few weeks to process one part of the dataset. It was the first time I had to research about text encoding (ASCII, UTF-8, SHIFT-JIS, UTF-16, etc.). With the new found dataset, the model performed significant better than being trained with the fonts dataset.

Implementations

TensorFlow - Python - Train model with Python and TensorFlow.
tfjs (on gh-pages branch) - Use trained model to recognize handwriting from HTML canvas with JavaScript.
TensorFlow Lite - Use trained model to create handwriting input app on Android device with Java/Kotlin.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
scripts @ 6f56449		scripts @ 6f56449
.gitignore		.gitignore
.gitmodules		.gitmodules
01-specify-labels.ipynb		01-specify-labels.ipynb
README.md		README.md
colab_train_classifier_with_etl9b_dataset.ipynb		colab_train_classifier_with_etl9b_dataset.ipynb
kanji_label_dict.py		kanji_label_dict.py
memrise_rtk_kanji.txt		memrise_rtk_kanji.txt
model_train.py		model_train.py
requirements.txt		requirements.txt
test_model.py		test_model.py
train_classifier_with_font_only.ipynb		train_classifier_with_font_only.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kanji recognition

Introduction

Write-up

Implementations

References

About

Releases

Packages

Languages

ichisadashioko/kanji-recognition

Folders and files

Latest commit

History

Repository files navigation

Kanji recognition

Introduction

Write-up

Implementations

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages