!!New!!
SDK to detect and recognize MICR lines released at https://github.com/DoubangoTelecom/ultimateMICR-SDK
The dataset contains more than #11 thousands images (.tif) with ground truth (.gt.txt) from real life augmented with few synthetic data.
The dataset is ready to be used with tesseract v4 for training.
If you're lazy and don't want to train the model by yourself then, try the ones under tessdata_best (float-model) or tessdata_fast (int-model) folders.
Most of the time when developing an ocr app using tesseract and you’re getting low accuracy it’s hard to determine if the issue is the model/traineddata or the image pre-processing. Off course you can dump the pre-processed image to see if it’s correctly binarized but this take time if you want to compute an accuracy score on thousands of images. To make your life easier this repo contains a command line application for Windows to test the accuracy.
This app is very easy to use:
- add your images in tesseractMICR/apps/images
- run tesseractMICR/apps/tesseract_recognizer.bat
- the predictions will be in tesseractMICR/apps/ocr.txt
This app will:
- detect MICR E-13B lines from anywhere on the image
- extract the lines, de-skew and de-slant them
- binarize the lines
- use Tesseract for recognition
You can edit tesseractMICR/apps/tesseract_recognizer.bat to change the path to the images or tessdata folders.
REM Usage: tesseract_recognizer.exe path_to_images_folder path_to_tessdata_folder
REM path_to_images_folder -> relative or absolute path to folder containing the images to process
REM path_to_tessdata_folder -> relative or absolute path to folder containing *.traineddata files
REM example: tesseract_recognizer.exe ./images ../tessdata_fast
REM another example: tesseract_recognizer.exe ./images ../tessdata_best
tesseract_recognizer.exe ./images ../tessdata_fast
The charset used in tesseractMICR/apps/ocr.txt is:
This application is GPGPU accelerated using OpenCL. Make sure to update your drivers.
This was developed as an internal R&D project and never went to production as we ended using Tensorflow.
Even as a PoC (Proof-Of-Concept) it's already more accurate than all commercial products we've tested: LEADTOLS, accusoft, recogniform and abbyy. The repo contains a command line application to compare the accuracy (see above).
You can check our state of the art implementation based on Tensorflow at https://www.doubango.org/webapps/micr/
To get help please check our discussion group or twitter account