Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

Closed
jwnsu opened this issue Jun 14, 2018 · 39 comments
Closed

Comments

@jwnsu
Copy link

jwnsu commented Jun 14, 2018

Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following
error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484 on 06/07/18.

Any suggestion to get around this issue? Thx.

C or C++ program seems to set default locale "C", however, it's not the case for python, where default is "en_US.UTF-8".

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 14, 2018 via email

@jwnsu
Copy link
Author

jwnsu commented Jun 14, 2018

Thx. Any side effect by force setting to "C"?

@stweil
Copy link
Contributor

stweil commented Jun 14, 2018

Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.

Tesseract currently requires "C" locale because otherwise some functions can give bad results or fail.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 14, 2018 via email

@laurikari
Copy link

This is going to cause huge problems for people who are running Tesseract as a library. Setting locale="C" will probably cause various unwanted side-effects throughout the application.

Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.

I suggest that instead of requiring locale="C", to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.

@stweil
Copy link
Contributor

stweil commented Jun 18, 2018

@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don't get wrong results without any notice.

@troplin
Copy link

troplin commented Jun 22, 2018

Even for C/C++ I usually call

setlocale(LC_CTYPE, "");

as the first thing in main, which sets the locale to the value specified in the environment.

Depending on "C" locale seems quite bad to me.

@stweil
Copy link
Contributor

stweil commented Jun 22, 2018

Related issues which are we reason why we currently enforce "C" locale: #1250, #1532.

@stweil
Copy link
Contributor

stweil commented Jun 22, 2018

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

2018-10-08: printf, fprint and other *printf need fixes for formatting of float and double values.

@zindarod
Copy link

So currently there's no other solution except setting: LC_ALL=C?

@jeroen
Copy link
Contributor

jeroen commented Aug 10, 2018

I am using a pattern like below to temporary set the locale to C while initiating the engine:

char *old_ctype = strdup(setlocale(LC_ALL, NULL));
setlocale(LC_ALL, "C");
tesseract::TessBaseAPI api;
api.InitForAnalysePage();
setlocale(LC_ALL, old_ctype);
free(old_ctype);

Is this correct or does it only bypass the assertion?

@stweil
Copy link
Contributor

stweil commented Aug 10, 2018

It avoids the assertion, but the problem which was the reason why this assertion was added remains, so users risk to get wrong results or crashes later.

@jeroen
Copy link
Contributor

jeroen commented Aug 10, 2018

Oh that's not good. In my experiments it seemed to solve the problems in #1532 and I was able to OCR japanese/korean text, which I was not before. I was hopeful that the locale-sensitive operations where done during init.

In my case, tesseract is called by the user via language bindings, so I cannot permanently change the locale of the process. The only solution is to temporary set the locale in the bindings when calling the tesseract api.

Our full bindings are pretty minimal. Where else we need to temporary set the locale to C? The OCR happens here:

  api->ClearAdaptiveClassifier();
  api->SetImage(image);
  if(api->GetSourceYResolution() < 70)
    api->SetSourceResolution(300);
  char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();
  pixDestroy(&image);
  api->Clear();

@zdenop
Copy link
Contributor

zdenop commented Sep 30, 2018

What about implementing jeroen code to tesseract api init?

@stweil
Copy link
Contributor

stweil commented Oct 2, 2018

That would not be a save solution. See my previous answer.

@amitdo
Copy link
Collaborator

amitdo commented Oct 2, 2018

IMO, the right solution is here:
#1670 (comment)

@stweil
Copy link
Contributor

stweil commented Oct 8, 2018

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

isspace is addressed by pull request #1965.

I currently don't know a locale which influences atoi or strtol.

strtof, strtod cannot read values with a decimal point when the locale uses a decimal comma (like locale de_DE.UTF-8).

sscanf and other *scanf functions can have problems with bytes wrongly interpreted as space (see #1250 (comment)) and can also misinterpret float or double values with a decimal point.

*printf functions write float and double using the decimal separator defined by the locale, but Tesseract always expects a decimal point.

So those last three groups have to be fixed / replaced before we can remove the assertion.

Instead of strtof and strtod, strtof_l and strtod_l (or _strtof_l and _strtod_l for Windows) can be used.

@ephes
Copy link

ephes commented Oct 20, 2018

My current workaround for this looks like this:

from locale import setlocale
from contextlib import contextmanager

@contextmanager
def c_locale(reset_to="C.UTF-8"):
    setlocale(locale.LC_CTYPE, "C")
    yield
    setlocale(locale.LC_CTYPE, reset_to)
    
with c_locale():
    from tesserocr import PyTessBaseAPI
    with PyTessBaseAPI() as api:
        api.Init(lang="deu")
        api.SetImage(box_image)
        ocr_result = api.GetUTF8Text()
        print(ocr_result)

@martin-huber
Copy link

Has anyone an idea how to set the C locale for a JNA library when calling it from Java ?
I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help.
We are using tess4j, a JNA wrapper to use tesseract from Java, and using tesseract4 does not work because of the assertion.

@martin-huber
Copy link

martin-huber commented Nov 9, 2018

I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help.

And by the way, this also wouldn't work in a web environment, because this setting is done VM - wide, so it would affect everything else that is happening in parallel as well.

@martin-huber
Copy link

I found a way to set the locale to "C" from Java (using JNA). See here for a discussion:
nguyenq/tess4j#106 (comment)

It works, but I am not sure about any side effects of this.

rhardih added a commit to rhardih/bad that referenced this issue Jan 9, 2019
This includes a fix for segfault on init problem mentioned by these two
issues:

tesseract-ocr/tesseract#1670
tesseract-ocr/tesseract#2151
rhardih added a commit to rhardih/bad that referenced this issue Jan 15, 2019
This includes a fix for segfault on init problem mentioned by these two
issues:

tesseract-ocr/tesseract#1670
tesseract-ocr/tesseract#2151
@stweil
Copy link
Contributor

stweil commented May 2, 2019

Pull request #2420 replaces strtof and strtod which fixes more dependencies on the locale settings. The critical sscanf calls were already replaced by earlier commits.

I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.

@amitdo
Copy link
Collaborator

amitdo commented May 12, 2019

C++17 has to_chars() and from_chars().

https://en.cppreference.com/w/cpp/utility/to_chars
https://en.cppreference.com/w/cpp/utility/from_chars

Compilers support is currently partial.

mayanedms pushed a commit to mayan-edms/Mayan-EDMS that referenced this issue Jun 19, 2019
This new backend uses a command call to avoid
Tesseract bug 1670
(tesseract-ocr/tesseract#1670).

Signed-off-by: Roberto Rosario <[email protected]>
@agnelvishal
Copy link

After typing export LC_ALL=C in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work. If using IDE, open the IDE from the terminal where export LC_ALL=C is entered.

@stweil
Copy link
Contributor

stweil commented Jul 26, 2019

Tesseract 4.1 and 5.0 no longer depend on the locale settings.

@jeroen
Copy link
Contributor

jeroen commented Jul 26, 2019

Thanks. So my bindings need to compile with any current version of tesseract. So to summarize, I only need to set the locale if Tesseract 4.x < 4.1 and not for 3.x and also not for 4.1 +?

@stweil
Copy link
Contributor

stweil commented Jul 26, 2019

That's right.

@amitdo
Copy link
Collaborator

amitdo commented Dec 5, 2019

Can we close this issue?

@stweil
Copy link
Contributor

stweil commented Dec 5, 2019

There was no recent activity and I think everything was answered, so I close it now.

@stweil stweil closed this as completed Dec 5, 2019
@wd
Copy link

wd commented Dec 28, 2019

Workaround for python users

import locale
locale.setlocale(locale.LC_CTYPE, 'C')  # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '')  # set locale back

@zdenop
Copy link
Contributor

zdenop commented Dec 28, 2019

@wd: which tesseract version are you using? AFAIK this problem is solved in recent tesseract version.

@wd
Copy link

wd commented Dec 28, 2019

@zdenop I know it's solved in 4.1. But the python docker use debian stable(buster) as the base image, which only include tesseract 4.0.

@stweil
Copy link
Contributor

stweil commented Dec 28, 2019

Use add-apt-repository -y ppa:alex-p/tesseract-ocr before installing Tesseract in your Dockerfile to get a newer release.

@wd
Copy link

wd commented Dec 29, 2019

@stweil I'm using debian, first I tried to add the ppa use command add-apt-repository, and run apt update, bug got 404 error.

Ign:3 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal InRelease
Err:6 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal Release
  404  Not Found [IP: 91.189.95.83 80]

I checked the source list file.

deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal main

And checked the URL http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/, there isn't a dist named as focal.

After doing some research, I noticed Ubuntu 18.10(Cosmic) has the same version of libc6(2.28) with Debian buster. But I think there isn't an binary version for Cosmic, http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/cosmic/main/binary-amd64/Packages.gz is an empty file.

Are there anything I missed? I still can't install tesseract-ocr 4.1 on debian buster.

@Shreeshrii
Copy link
Collaborator

ping @AlexanderP

@AlexanderP
Copy link

buster:
deb https://notesalexp.org/tesseract-ocr/buster/ buster main
cosmic:
deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main

Fetch and install the GnuPG key

sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update

@amitdo
Copy link
Collaborator

amitdo commented Dec 29, 2019

@wd
Copy link

wd commented Dec 30, 2019

Thanks, finally I have upgraded tesseract-ocr to 4.1. And I also add more notes in the wiki for user's who want to install 4.1 on stable and other versions. Previously it's just a link, I didn't realize it's has instructions about how to install it in Debian stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests