!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

charhuang · 2018-12-02T19:21:06Z

import tesserocr
from PIL import Image

image = Image.open('image.png')
print(tesserocr.image_to_text(image))

Mac 10.14.1
tesserocr 2.3.1
Python 3.6.7

xiejun946 · 2018-12-03T11:22:01Z

i am the same error as you

xiejun946 · 2018-12-03T12:35:42Z

try it as this ,i work well
import locale
locale.setlocale(locale.LC_ALL, 'C')
import tesserocr

abhishek-jain-infrrd · 2018-12-10T11:52:22Z

This can be fixed by either importing export LC_ALL=C in terminal or setting it in the python file where you are using tesserocr as locale.setlocale(locale.LC_ALL, 'C') or if still getting error in Pycharm IDE then just set LC_ALL=C as Environment variable in the run configuration for that file.

jameshe3 · 2019-01-17T01:14:50Z

try it as this ,i work well
import locale
locale.setlocale(locale.LC_ALL, 'C')
import tesserocr

I met the same issue. Your solution helped . Thanks!

bertsky · 2019-01-25T20:48:35Z

It's mentioned in the FAQ as well. And BTW this measure is only meant to be provisional, to be removed as soon as certain legacy parts of the code base (locale-dependent functions like sscanf) are gone.

You might have problems with this workaround though: other parts of your code might depend localization, e.g. on *.UTF-8. In that case, you can still try to set the locale to C merely temporarily (just before importing from tesserocr), and maybe also set back afterwards (but the latter will risk giving sub-optimal results or even crash during recognition).

nijel · 2019-02-26T20:48:05Z

This is really ridiculous, it makes it nearly impossible to use the module in a program using locales. Does anybody have idea what kind of breakages can man expect if locales are changed after importing the module? Or is the recommended approach to set C locale before every tesserocr call and restore correct locales after it?

PS: I've at least tried to leverage the limitation to support utf-8 locale as well in tesseract-ocr/tesseract#2272 as without that Python 3 before 3.7 has problems handling unicode filenames...

PS2: Frequently changing locales might not be a good idea as well, quoting Python locale docs:

The C standard defines the locale as a program-wide property that may be relatively expensive to change. On top of that, some implementation are broken in such a way that frequent locale changes may cause core dumps. This makes the locale somewhat painful to use correctly.

bertsky · 2019-02-26T22:15:39Z

Indeed. I'll have a chance to ask @stweil about it tomorrow. The utf-8 patch was a good idea, thanks!

What you dug up about locale switching in Python does make matters worse. So maybe this should really be done on the C level in tesserocr? Has anyone tried that?

Interestingly, gImageReader does temporarily reset the locale, but only for Tesseract initialisation, not during recognition. And both libopencv-contrib/modules/text and ffmpeg/libavfilter do nothing about it (but seem to be based on version 3 still).

stweil · 2019-02-27T05:27:27Z

There is a whole number of functions in the standard C library which depend on the current locale, not only sscanf, but also sprintf and their relatives and several classifier functions. See my comment tesseract-ocr/tesseract#1670 (comment).

This causes problems like tesseract-ocr/tesseract#1250.

With my German locale de_DE.UTF-8, for example, Tesseract would read wrong float and double values from existing configuration files because German writes such numbers like 3,14159 instead of 3.14159. A file written with that locale would be incompatible with most of the rest of the world.

tesseract-ocr/tesseract@db9c7e0 shows how that can be fixed. Similar fixes are still needed for the rest of the Tesseract code. Then the hard requirement of a fixed locale setting and the assertions can be removed.

nijel · 2019-02-27T07:02:17Z

I understand the problems (I've fixed similar issues in many programs). I just find strange that import tesserocr can lead to terminating Python. That's not really what I would expect from any library.

Also it would be great to know where all the C locale is needed to make things work properly. As @bertsky pointed out there are already programs switching locales for tesseract initialization, will that work properly or should it be there for the recognition as well (as outlined in tesseract-ocr/tesseract#1670 (comment))? Having this documented would make it easier to implement this properly.

stweil · 2019-02-27T07:11:34Z

Switching locale just to avoid the assertion and switching it back after Tesseract initialization will leave you with the risk of problems caused by a "wrong" locale. Depending on your own locale and on the code parts which you use it can work. If your locale does not use the comma as a decimal point and your primary characters are based on Latin, chances are high that it will work. With comma or with an Asian locale, chances are much lower. So to summarize, a temporary locale switch is not a general solution usable for everybody.

stweil · 2019-02-27T07:15:45Z

Also it would be great to know where all the C locale is needed to make things work properly.

Just search the code for the problematic function names. It is some work to decide when those code locations are used (text recognition or training, old recognizer or LSTM, ...). Therefore I think it is easier to replace all those code parts by code which does not depend on locale settings.

nijel · 2019-02-27T08:15:46Z

With the locales switching I'm trying to figure out solution that would work now. It's certainly not nice, but I really don't see other way to use tesserocr.

The major problem is that upcoming Debian stable will come with tesseract 4.0.0, so apparently my code will have to deal with that version for years, no matter what you fix in upcoming tesseract releases. Not sure if something can be done in the tesserocr module to mitigate this or at least to make possible some runtime detection avoiding terminating Python...

mikkelee · 2019-02-27T08:40:51Z

I'm currently doing this in Python3.7:

@contextmanager
def c_locale():
    try:
        currlocale = locale.getlocale()
    except ValueError:
        currlocale = ('en_US', 'UTF-8')
    log.debug(f'Switching to C from {currlocale}')
    locale.setlocale(locale.LC_ALL, "C")
    yield
    log.debug(f'Switching to {currlocale} from C')
    locale.setlocale(locale.LC_ALL, currlocale)

def get_hocr(image):
    with c_locale():
        from tesserocr import PyTessBaseAPI
        with PyTessBaseAPI() as tesseract:
            tesseract.SetImage(image)
            hocr = tesseract.GetHOCRText(0)
    return hocr

It appears to work, as all access to Tesseract is done inside a C-locale.

nijel · 2019-02-27T13:50:19Z

Similar workaround has been posted in tesseract-ocr/tesseract#1670 (comment), it's pretty straightforward to implement, however looking at Python docs makes me a bit worried if that is good thing to do (see #165 (comment)). Anyway I've implemented this in WeblateOrg/weblate@6724204 as I don't see better way to address this for now.

agnelvishal · 2019-07-24T12:37:01Z

After typing export LC_ALL=C in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work.

stweil · 2019-07-26T06:02:19Z

Tesseract 4.1 and Tesseract 5 no longer require such locale hacks.

mikkelee · 2019-07-26T06:35:26Z

I can't find anything about Tesseract 5?

stweil · 2019-07-26T07:45:45Z

Tesseract 5 will be the next release. Tesseract Git master is the latest version targeting 5.0, 5.0.0-alpha is a pre-release.

mikkelee · 2019-07-26T07:48:42Z

Ah, thanks

sirfz · 2019-08-22T16:43:03Z

Solved in Tesseract v4.1

nijel mentioned this issue Feb 26, 2019

teserract 4.0 compatibility WeblateOrg/weblate#2581

Closed

sirfz closed this as completed Aug 22, 2019

sirfz mentioned this issue Feb 8, 2020

can't import tesserocr #160

Closed

dynobo mentioned this issue May 11, 2021

Not working on Debian Buster dynobo/normcap#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

charhuang commented Dec 2, 2018 •

edited

Loading

xiejun946 commented Dec 3, 2018

xiejun946 commented Dec 3, 2018

abhishek-jain-infrrd commented Dec 10, 2018

jameshe3 commented Jan 17, 2019

bertsky commented Jan 25, 2019

nijel commented Feb 26, 2019 •

edited

Loading

bertsky commented Feb 26, 2019

stweil commented Feb 27, 2019

nijel commented Feb 27, 2019 •

edited

Loading

stweil commented Feb 27, 2019

stweil commented Feb 27, 2019

nijel commented Feb 27, 2019

mikkelee commented Feb 27, 2019

nijel commented Feb 27, 2019

agnelvishal commented Jul 24, 2019

stweil commented Jul 26, 2019

mikkelee commented Jul 26, 2019

stweil commented Jul 26, 2019

mikkelee commented Jul 26, 2019

sirfz commented Aug 22, 2019

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

Comments

charhuang commented Dec 2, 2018 • edited Loading

xiejun946 commented Dec 3, 2018

xiejun946 commented Dec 3, 2018

abhishek-jain-infrrd commented Dec 10, 2018

jameshe3 commented Jan 17, 2019

bertsky commented Jan 25, 2019

nijel commented Feb 26, 2019 • edited Loading

bertsky commented Feb 26, 2019

stweil commented Feb 27, 2019

nijel commented Feb 27, 2019 • edited Loading

stweil commented Feb 27, 2019

stweil commented Feb 27, 2019

nijel commented Feb 27, 2019

mikkelee commented Feb 27, 2019

nijel commented Feb 27, 2019

agnelvishal commented Jul 24, 2019

stweil commented Jul 26, 2019

mikkelee commented Jul 26, 2019

stweil commented Jul 26, 2019

mikkelee commented Jul 26, 2019

sirfz commented Aug 22, 2019

charhuang commented Dec 2, 2018 •

edited

Loading

nijel commented Feb 26, 2019 •

edited

Loading

nijel commented Feb 27, 2019 •

edited

Loading