-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670
Comments
set the locale "C".
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jun 14, 2018 at 9:24 AM jwnsu ***@***.***> wrote:
Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library
via cffi. Now fail with following
error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192
baseapi.cpp locale assertion was introduce in commit 3292484
<3292484>
on 06/07/18.
Any suggestion to get around this issue? Thx.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1670>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6FBzYQJa32lzFfd8uPVQQI2fxkzks5t8d6HgaJpZM4UnRbY>
.
|
Thx. Any side effect by force setting to "C"? |
Sure. Setting the locale has lots of side effects. My default locale for python is Tesseract currently requires "C" locale because otherwise some functions can give bad results or fail. |
Thanks. I have added info to a new wiki page
https://github.com/tesseract-ocr/tesseract/wiki/4.0x-Common-Errors-and-Resolutions
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jun 14, 2018 at 9:46 AM Stefan Weil ***@***.***> wrote:
Sure. Setting the locale has lots of side effects. My default locale for
python is de_DE.UTF-8, so the default can be different. You have to find
out whether "C" works with your python code or must restore the original
locale after calling the Tesseract API.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1670 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ow1FgAbIMoFm3pKKJ5lxh5dUSxKhks5t8eOggaJpZM4UnRbY>
.
|
This is going to cause huge problems for people who are running Tesseract as a library. Setting locale="C" will probably cause various unwanted side-effects throughout the application. Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example. I suggest that instead of requiring locale="C", to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way. |
@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don't get wrong results without any notice. |
Even for C/C++ I usually call setlocale(LC_CTYPE, ""); as the first thing in Depending on |
Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: 2018-10-08: |
So currently there's no other solution except setting: |
I am using a pattern like below to temporary set the locale to char *old_ctype = strdup(setlocale(LC_ALL, NULL));
setlocale(LC_ALL, "C");
tesseract::TessBaseAPI api;
api.InitForAnalysePage();
setlocale(LC_ALL, old_ctype);
free(old_ctype); Is this correct or does it only bypass the assertion? |
It avoids the assertion, but the problem which was the reason why this assertion was added remains, so users risk to get wrong results or crashes later. |
Oh that's not good. In my experiments it seemed to solve the problems in #1532 and I was able to OCR japanese/korean text, which I was not before. I was hopeful that the locale-sensitive operations where done during init. In my case, tesseract is called by the user via language bindings, so I cannot permanently change the locale of the process. The only solution is to temporary set the locale in the bindings when calling the tesseract api. Our full bindings are pretty minimal. Where else we need to temporary set the locale to C? The OCR happens here: api->ClearAdaptiveClassifier();
api->SetImage(image);
if(api->GetSourceYResolution() < 70)
api->SetSourceResolution(300);
char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();
pixDestroy(&image);
api->Clear(); |
What about implementing jeroen code to tesseract api init? |
That would not be a save solution. See my previous answer. |
IMO, the right solution is here: |
I currently don't know a locale which influences
So those last three groups have to be fixed / replaced before we can remove the assertion. Instead of |
My current workaround for this looks like this: from locale import setlocale
from contextlib import contextmanager
@contextmanager
def c_locale(reset_to="C.UTF-8"):
setlocale(locale.LC_CTYPE, "C")
yield
setlocale(locale.LC_CTYPE, reset_to)
with c_locale():
from tesserocr import PyTessBaseAPI
with PyTessBaseAPI() as api:
api.Init(lang="deu")
api.SetImage(box_image)
ocr_result = api.GetUTF8Text()
print(ocr_result) |
Has anyone an idea how to set the C locale for a JNA library when calling it from Java ? |
And by the way, this also wouldn't work in a web environment, because this setting is done VM - wide, so it would affect everything else that is happening in parallel as well. |
I found a way to set the locale to "C" from Java (using JNA). See here for a discussion: It works, but I am not sure about any side effects of this. |
This includes a fix for segfault on init problem mentioned by these two issues: tesseract-ocr/tesseract#1670 tesseract-ocr/tesseract#2151
This includes a fix for segfault on init problem mentioned by these two issues: tesseract-ocr/tesseract#1670 tesseract-ocr/tesseract#2151
C++17 has https://en.cppreference.com/w/cpp/utility/to_chars Compilers support is currently partial. |
This new backend uses a command call to avoid Tesseract bug 1670 (tesseract-ocr/tesseract#1670). Signed-off-by: Roberto Rosario <[email protected]>
After typing |
Tesseract 4.1 and 5.0 no longer depend on the locale settings. |
Thanks. So my bindings need to compile with any current version of tesseract. So to summarize, I only need to set the locale if Tesseract |
That's right. |
Can we close this issue? |
There was no recent activity and I think everything was answered, so I close it now. |
Workaround for python users
|
@wd: which tesseract version are you using? AFAIK this problem is solved in recent tesseract version. |
@zdenop I know it's solved in 4.1. But the python docker use debian stable(buster) as the base image, which only include tesseract 4.0. |
Use |
@stweil I'm using debian, first I tried to add the ppa use command
I checked the source list file.
And checked the URL http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/, there isn't a dist named as After doing some research, I noticed Ubuntu 18.10(Cosmic) has the same version of libc6(2.28) with Debian buster. But I think there isn't an binary version for Cosmic, http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/cosmic/main/binary-amd64/Packages.gz is an empty file. Are there anything I missed? I still can't install tesseract-ocr 4.1 on debian buster. |
ping @AlexanderP |
buster: Fetch and install the GnuPG key
|
Thanks, finally I have upgraded tesseract-ocr to 4.1. And I also add more notes in the wiki for user's who want to install 4.1 on stable and other versions. Previously it's just a link, I didn't realize it's has instructions about how to install it in Debian stable. |
It should not be needed since tesseract 4.1, see tesseract-ocr/tesseract#1670
Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following
error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192
It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484 on 06/07/18.
Any suggestion to get around this issue? Thx.
C or C++ program seems to set default locale "C", however, it's not the case for python, where default is "en_US.UTF-8".
The text was updated successfully, but these errors were encountered: