-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
param char_whitelist
for Text::OCRTesseract::create()
should be an empty string instead of null which fallbacks to [0-9a-zA-Z]
#3457
Comments
Hi, with OpenCV 4.x/3.4, following 2 statement will occur different outputs. I feel this is hard to exprain how to use a little.
I ran test program.
There are two possible countermeasures.
Which option is better ? Environment
OpenCV/OpenCV_contrib is at 3.4 branch(2023/3/18). Sample code// g++ main.cpp -o a.out -l opencv_core -l opencv_imgcodecs -l opencv_text
#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/text.hpp>
#include <iostream>
void trial(cv::Mat &img, char *lang, char* whitelist )
{
cv::Ptr<cv::text::OCRTesseract> ocr =
cv::text::OCRTesseract::create(NULL, lang, whitelist) ;
std::string text;
ocr->run(img, text);
if ( lang == NULL ) {
std::cout << "[INPUT ] lang = NULL";
}else{
std::cout << "[INPUT ] lang = \"" << lang << "\"";
}
if ( whitelist == NULL ) {
std::cout << " whitelist = NULL" << std::endl;
}else{
std::cout << " whitelist = \"" << whitelist << "\"" << std::endl;
}
std::cout << "[OUTPUT] result is " << text << std::endl;
}
int main(void)
{
cv::Mat img = cv::imread("MPLUS1.JPG",1);
trial(img, NULL, NULL);
trial(img, (char*)"eng", NULL);
trial(img, (char*)"jpn", NULL);
trial(img, (char*)"eng+jpn", NULL);
trial(img, NULL, (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
trial(img, (char*)"eng", (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
trial(img, (char*)"jpn", (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
trial(img, (char*)"eng+jpn", (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
trial(img, NULL, (char*)"");
trial(img, (char*)"eng", (char*)"");
trial(img, (char*)"jpn", (char*)"");
trial(img, (char*)"eng+jpn", (char*)"");
return 0;
} Test imageResult
|
Thanks for your comprehensive sample to represent this issue. Personally I would prefer option 2 since the user might not even notice this fallback behavior before they encounter images without any Latin characters recognized as mystery alphanumeric soup if we just warn about this in the OpenCV document. Also for users using wrapper library in languages other than C/C++ and python, they might not check the official and latest OpenCV document very often, and Google is still indexing documents on the 3.x branch). Edit: I've found some historical info about versioning: |
About breaking changeMany libraries (including OpenCV) basically don't want breaking changes in minor version upgrades. Applications should avoid worrying about library versions as much as possible. From time to time, for some reason the library needs to be updated. So if we change the interface or change the calculation results, we need a compelling reason. Is default [0-9a-zA-Z] good for any language?I verified the contents of the dictionary data in tesseract. For example: https://github.com/tesseract-ocr/langdata/blob/main/eng/eng.wordlist Currently text module implemantation doen't accept those wordlist as default char_whitelist.
Currently wordlists contains non-[0-9a-zA-Z] characters, so I think it's difficult to technically explain why char_whitelist defaults to [0-9a-zA-Z]. I agree with this issue's proposal. I propose to set default char_whitelist to ""(null strings) from OpenCV 4.8.0/3.20.0 I believe this will improve recognition accuracy not only for languages containing non-ASCII characters such as CJK, but also for English (especially for sentences). |
LGTM |
In fact passing the empty string as the value of param
Will there be any new minor version released for 3.x branch? |
Hi, I tried to make PR. The text module seems that no test for character recognition. https://github.com/opencv/opencv_contrib/tree/3.4/modules/text/test Knowing the installed language data for character recognition is a prerequisite for conducting tests. Supporting some test implementation of the text module is likely to be more difficult than writing this patch.
Curently milestone is here https://github.com/opencv/opencv/milestones It seems that those release milestones are planed.
Version 3.4 branches are used for only bug-fix, not for implementation new features. https://github.com/opencv/opencv/wiki/ChangeLog#version3419 |
This problem also makes me waste several hours! |
I've found out this fallback has been made in the first time of introducing |
System information (version)
Detailed description
opencv_contrib/modules/text/include/opencv2/text/ocr.hpp
Lines 156 to 157 in ed1873b
This behavior spends me hours on figuring out why using Tesseract to recognize CJK chars is working on
Emgu.CV
but notOpenCvSharp
:shimat/opencvsharp#1542
shimat/opencvsharp#873
shimat/opencvsharp#1364
Steps to reproduce
Issue submission checklist
forum.opencv.org, Stack Overflow, etc and have not found any solution
The text was updated successfully, but these errors were encountered: