recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

jwnsu · 2018-06-14T03:54:32Z

Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following
error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484 on 06/07/18.

Any suggestion to get around this issue? Thx.

C or C++ program seems to set default locale "C", however, it's not the case for python, where default is "en_US.UTF-8".

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-06-14T04:04:55Z

set the locale "C". ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 14, 2018 at 9:24 AM jwnsu ***@***.***> wrote: Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192 baseapi.cpp locale assertion was introduce in commit 3292484 <3292484> on 06/07/18. Any suggestion to get around this issue? Thx. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1670>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6FBzYQJa32lzFfd8uPVQQI2fxkzks5t8d6HgaJpZM4UnRbY> .

jwnsu · 2018-06-14T04:05:25Z

Thx. Any side effect by force setting to "C"?

stweil · 2018-06-14T04:16:17Z

Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.

Tesseract currently requires "C" locale because otherwise some functions can give bad results or fail.

Shreeshrii · 2018-06-14T05:42:10Z

Thanks. I have added info to a new wiki page https://github.com/tesseract-ocr/tesseract/wiki/4.0x-Common-Errors-and-Resolutions ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 14, 2018 at 9:46 AM Stefan Weil ***@***.***> wrote: Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1670 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ow1FgAbIMoFm3pKKJ5lxh5dUSxKhks5t8eOggaJpZM4UnRbY> .

laurikari · 2018-06-18T08:13:44Z

This is going to cause huge problems for people who are running Tesseract as a library. Setting locale="C" will probably cause various unwanted side-effects throughout the application.

Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.

I suggest that instead of requiring locale="C", to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.

stweil · 2018-06-18T11:44:35Z

@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don't get wrong results without any notice.

troplin · 2018-06-22T09:37:58Z

Even for C/C++ I usually call

setlocale(LC_CTYPE, "");

as the first thing in main, which sets the locale to the value specified in the environment.

Depending on "C" locale seems quite bad to me.

stweil · 2018-06-22T15:25:11Z

Related issues which are we reason why we currently enforce "C" locale: #1250, #1532.

stweil · 2018-06-22T18:02:25Z

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

2018-10-08: printf, fprint and other *printf need fixes for formatting of float and double values.

zindarod · 2018-06-27T19:09:09Z

So currently there's no other solution except setting: LC_ALL=C?

jeroen · 2018-08-10T11:42:46Z

I am using a pattern like below to temporary set the locale to C while initiating the engine:

char *old_ctype = strdup(setlocale(LC_ALL, NULL));
setlocale(LC_ALL, "C");
tesseract::TessBaseAPI api;
api.InitForAnalysePage();
setlocale(LC_ALL, old_ctype);
free(old_ctype);

Is this correct or does it only bypass the assertion?

stweil · 2018-08-10T13:14:12Z

It avoids the assertion, but the problem which was the reason why this assertion was added remains, so users risk to get wrong results or crashes later.

jeroen · 2018-08-10T13:24:57Z

Oh that's not good. In my experiments it seemed to solve the problems in #1532 and I was able to OCR japanese/korean text, which I was not before. I was hopeful that the locale-sensitive operations where done during init.

In my case, tesseract is called by the user via language bindings, so I cannot permanently change the locale of the process. The only solution is to temporary set the locale in the bindings when calling the tesseract api.

Our full bindings are pretty minimal. Where else we need to temporary set the locale to C? The OCR happens here:

  api->ClearAdaptiveClassifier();
  api->SetImage(image);
  if(api->GetSourceYResolution() < 70)
    api->SetSourceResolution(300);
  char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();
  pixDestroy(&image);
  api->Clear();

amitdo · 2018-08-10T13:53:30Z

@stweil

For POSIX:
https://stackoverflow.com/a/13919957

For Windows:
~~https://docs.microsoft.com/en-us/windows/desktop/api/winnls/nf-winnls-setthreadlocale~~
https://docs.microsoft.com/en-us/cpp/parallel/multithreading-and-locales

zdenop · 2018-09-30T14:51:09Z

What about implementing jeroen code to tesseract api init?

stweil · 2018-10-02T17:47:46Z

That would not be a save solution. See my previous answer.

amitdo · 2018-10-02T18:11:23Z

IMO, the right solution is here:
#1670 (comment)

stweil · 2018-10-08T15:33:54Z

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

isspace is addressed by pull request #1965.

I currently don't know a locale which influences atoi or strtol.

strtof, strtod cannot read values with a decimal point when the locale uses a decimal comma (like locale de_DE.UTF-8).

sscanf and other *scanf functions can have problems with bytes wrongly interpreted as space (see #1250 (comment)) and can also misinterpret float or double values with a decimal point.

*printf functions write float and double using the decimal separator defined by the locale, but Tesseract always expects a decimal point.

So those last three groups have to be fixed / replaced before we can remove the assertion.

Instead of strtof and strtod, strtof_l and strtod_l (or _strtof_l and _strtod_l for Windows) can be used.

ephes · 2018-10-20T08:27:39Z

My current workaround for this looks like this:

from locale import setlocale
from contextlib import contextmanager

@contextmanager
def c_locale(reset_to="C.UTF-8"):
    setlocale(locale.LC_CTYPE, "C")
    yield
    setlocale(locale.LC_CTYPE, reset_to)
    
with c_locale():
    from tesserocr import PyTessBaseAPI
    with PyTessBaseAPI() as api:
        api.Init(lang="deu")
        api.SetImage(box_image)
        ocr_result = api.GetUTF8Text()
        print(ocr_result)

martin-huber · 2018-11-09T10:52:08Z

Has anyone an idea how to set the C locale for a JNA library when calling it from Java ?
I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help.
We are using tess4j, a JNA wrapper to use tesseract from Java, and using tesseract4 does not work because of the assertion.

martin-huber · 2018-11-09T10:55:16Z

I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help.

And by the way, this also wouldn't work in a web environment, because this setting is done VM - wide, so it would affect everything else that is happening in parallel as well.

martin-huber · 2018-11-09T13:42:51Z

I found a way to set the locale to "C" from Java (using JNA). See here for a discussion:
nguyenq/tess4j#106 (comment)

It works, but I am not sure about any side effects of this.

This includes a fix for segfault on init problem mentioned by these two issues: tesseract-ocr/tesseract#1670 tesseract-ocr/tesseract#2151

stweil · 2019-05-02T06:35:22Z

Pull request #2420 replaces strtof and strtod which fixes more dependencies on the locale settings. The critical sscanf calls were already replaced by earlier commits.

I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.

amitdo · 2019-05-12T11:47:09Z

C++17 has to_chars() and from_chars().

https://en.cppreference.com/w/cpp/utility/to_chars
https://en.cppreference.com/w/cpp/utility/from_chars

Compilers support is currently partial.

This new backend uses a command call to avoid Tesseract bug 1670 (tesseract-ocr/tesseract#1670). Signed-off-by: Roberto Rosario <[email protected]>

agnelvishal · 2019-07-24T12:43:13Z

After typing export LC_ALL=C in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work. If using IDE, open the IDE from the terminal where export LC_ALL=C is entered.

stweil · 2019-07-26T06:15:21Z

Tesseract 4.1 and 5.0 no longer depend on the locale settings.

jeroen · 2019-07-26T09:02:53Z

Thanks. So my bindings need to compile with any current version of tesseract. So to summarize, I only need to set the locale if Tesseract 4.x < 4.1 and not for 3.x and also not for 4.1 +?

stweil · 2019-07-26T10:47:54Z

That's right.

amitdo · 2019-12-05T17:44:47Z

Can we close this issue?

stweil · 2019-12-05T20:06:19Z

There was no recent activity and I think everything was answered, so I close it now.

wd · 2019-12-28T03:12:32Z

Workaround for python users

import locale
locale.setlocale(locale.LC_CTYPE, 'C')  # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '')  # set locale back

zdenop · 2019-12-28T07:00:16Z

@wd: which tesseract version are you using? AFAIK this problem is solved in recent tesseract version.

wd · 2019-12-28T07:41:38Z

@zdenop I know it's solved in 4.1. But the python docker use debian stable(buster) as the base image, which only include tesseract 4.0.

stweil · 2019-12-28T08:29:18Z

Use add-apt-repository -y ppa:alex-p/tesseract-ocr before installing Tesseract in your Dockerfile to get a newer release.

wd · 2019-12-29T03:01:09Z

@stweil I'm using debian, first I tried to add the ppa use command add-apt-repository, and run apt update, bug got 404 error.

Ign:3 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal InRelease
Err:6 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal Release
  404  Not Found [IP: 91.189.95.83 80]

I checked the source list file.

deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal main

And checked the URL http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/, there isn't a dist named as focal.

After doing some research, I noticed Ubuntu 18.10(Cosmic) has the same version of libc6(2.28) with Debian buster. But I think there isn't an binary version for Cosmic, http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/cosmic/main/binary-amd64/Packages.gz is an empty file.

Are there anything I missed? I still can't install tesseract-ocr 4.1 on debian buster.

Shreeshrii · 2019-12-29T04:17:33Z

ping @AlexanderP

AlexanderP · 2019-12-29T06:56:01Z

buster:
deb https://notesalexp.org/tesseract-ocr/buster/ buster main
cosmic:
deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main

Fetch and install the GnuPG key

sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update

amitdo · 2019-12-29T08:34:10Z

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

wd · 2019-12-30T01:22:11Z

Thanks, finally I have upgraded tesseract-ocr to 4.1. And I also add more notes in the wiki for user's who want to install 4.1 on stable and other versions. Previously it's just a link, I didn't realize it's has instructions about how to install it in Debian stable.

It should not be needed since tesseract 4.1, see tesseract-ocr/tesseract#1670

jxc928 mentioned this issue Jun 28, 2018

!strcmp(locale, "C"):Error:Assert failed:in file ../../../src/api/baseapi.cpp, line 191 nguyenq/tess4j#105

Closed

sirfz mentioned this issue Aug 29, 2018

Couldn't import tesserocr, because locale check error sirfz/tesserocr#137

Closed

bertsky mentioned this issue Sep 26, 2018

new locale assertions in Tesseract are incompatible with Click OCR-D/ocrd_tesserocr#23

Closed

zdenop mentioned this issue Oct 2, 2018

tesseract failed loading non-english language.traineddata #1250

Closed

stweil added the feature request label Nov 9, 2018

zdenop mentioned this issue Jan 8, 2019

SIGSEGV - new tesseract::TessBaseAPI() segfaults on Android #2151

Closed

rhardih added a commit to rhardih/bad that referenced this issue Jan 9, 2019

Updates to latest step tag

9edaa14

This includes a fix for segfault on init problem mentioned by these two issues: tesseract-ocr/tesseract#1670 tesseract-ocr/tesseract#2151

rhardih added a commit to rhardih/bad that referenced this issue Jan 15, 2019

Updates to latest step tag

3bd69f3

This includes a fix for segfault on init problem mentioned by these two issues: tesseract-ocr/tesseract#1670 tesseract-ocr/tesseract#2151

bertsky mentioned this issue Jan 25, 2019

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 sirfz/tesserocr#165

Closed

nijel mentioned this issue Feb 26, 2019

teserract 4.0 compatibility WeblateOrg/weblate#2581

Closed

stweil added this to the 4.1.0 milestone Apr 14, 2019

mayanedms pushed a commit to mayan-edms/Mayan-EDMS that referenced this issue Jun 19, 2019

Add new default Tesseract OCR backend

32cf0a0

This new backend uses a command call to avoid Tesseract bug 1670 (tesseract-ocr/tesseract#1670). Signed-off-by: Roberto Rosario <[email protected]>

jeroen mentioned this issue Jul 26, 2019

Undo locale workaround for Engine 4.1 + ropensci/tesseract#44

Closed

stweil closed this as completed Dec 5, 2019

xuhdev mentioned this issue Feb 19, 2020

Bump MAX Base IBM/MAX-OCR#6

Merged

canihavesomecoffee mentioned this issue Feb 21, 2020

[BUG] Segmentation fault when extracting ts file CCExtractor/ccextractor#1234

Closed

amitdo added the locale label Mar 21, 2021

stweil mentioned this issue Sep 24, 2021

pixConvertToPdf output invalid due to LC_NUMERIC DanBloomberg/leptonica#591

Open

nguyenq mentioned this issue Apr 28, 2022

Docker Image with Java 11 + tess4j:5.2.0 + Spring Boot 2.6.6 not working nguyenq/tess4j#231

Closed

nijel added a commit to WeblateOrg/weblate that referenced this issue Oct 17, 2023

screenshots: remove locale workaround for tesseract

3682f50

It should not be needed since tesseract 4.1, see tesseract-ocr/tesseract#1670

recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

Comments

jwnsu commented Jun 14, 2018 • edited Loading

Shreeshrii commented Jun 14, 2018 via email

jwnsu commented Jun 14, 2018 • edited Loading

stweil commented Jun 14, 2018 • edited Loading

Shreeshrii commented Jun 14, 2018 via email

laurikari commented Jun 18, 2018

stweil commented Jun 18, 2018

troplin commented Jun 22, 2018 • edited Loading

stweil commented Jun 22, 2018

stweil commented Jun 22, 2018 • edited Loading

zindarod commented Jun 27, 2018

jeroen commented Aug 10, 2018

stweil commented Aug 10, 2018

jeroen commented Aug 10, 2018 • edited Loading

amitdo commented Aug 10, 2018 • edited Loading

zdenop commented Sep 30, 2018

stweil commented Oct 2, 2018

amitdo commented Oct 2, 2018

stweil commented Oct 8, 2018 • edited Loading

ephes commented Oct 20, 2018

martin-huber commented Nov 9, 2018

martin-huber commented Nov 9, 2018 • edited Loading

martin-huber commented Nov 9, 2018

stweil commented May 2, 2019

amitdo commented May 12, 2019 • edited Loading

agnelvishal commented Jul 24, 2019

stweil commented Jul 26, 2019

jeroen commented Jul 26, 2019 • edited Loading

stweil commented Jul 26, 2019

amitdo commented Dec 5, 2019

stweil commented Dec 5, 2019

wd commented Dec 28, 2019

zdenop commented Dec 28, 2019

wd commented Dec 28, 2019

stweil commented Dec 28, 2019

wd commented Dec 29, 2019

Shreeshrii commented Dec 29, 2019

AlexanderP commented Dec 29, 2019

amitdo commented Dec 29, 2019

wd commented Dec 30, 2019

jwnsu commented Jun 14, 2018 •

edited

Loading

jwnsu commented Jun 14, 2018 •

edited

Loading

stweil commented Jun 14, 2018 •

edited

Loading

troplin commented Jun 22, 2018 •

edited

Loading

stweil commented Jun 22, 2018 •

edited

Loading

jeroen commented Aug 10, 2018 •

edited

Loading

amitdo commented Aug 10, 2018 •

edited

Loading

stweil commented Oct 8, 2018 •

edited

Loading

martin-huber commented Nov 9, 2018 •

edited

Loading

amitdo commented May 12, 2019 •

edited

Loading

jeroen commented Jul 26, 2019 •

edited

Loading