Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165

Closed
charhuang opened this issue Dec 2, 2018 · 20 comments
Closed

Comments

@charhuang
Copy link

charhuang commented Dec 2, 2018

import tesserocr
from PIL import Image

image = Image.open('image.png')
print(tesserocr.image_to_text(image))

Mac 10.14.1
tesserocr 2.3.1
Python 3.6.7

@xiejun946
Copy link

i am the same error as you

@xiejun946
Copy link

try it as this ,i work well
import locale
locale.setlocale(locale.LC_ALL, 'C')
import tesserocr

@abhishek-jain-infrrd
Copy link

This can be fixed by either importing export LC_ALL=C in terminal or setting it in the python file where you are using tesserocr as locale.setlocale(locale.LC_ALL, 'C') or if still getting error in Pycharm IDE then just set LC_ALL=C as Environment variable in the run configuration for that file.

@jameshe3
Copy link

try it as this ,i work well
import locale
locale.setlocale(locale.LC_ALL, 'C')
import tesserocr

I met the same issue. Your solution helped . Thanks!

@bertsky
Copy link
Contributor

bertsky commented Jan 25, 2019

It's mentioned in the FAQ as well. And BTW this measure is only meant to be provisional, to be removed as soon as certain legacy parts of the code base (locale-dependent functions like sscanf) are gone.

You might have problems with this workaround though: other parts of your code might depend localization, e.g. on *.UTF-8. In that case, you can still try to set the locale to C merely temporarily (just before importing from tesserocr), and maybe also set back afterwards (but the latter will risk giving sub-optimal results or even crash during recognition).

@nijel
Copy link
Contributor

nijel commented Feb 26, 2019

This is really ridiculous, it makes it nearly impossible to use the module in a program using locales. Does anybody have idea what kind of breakages can man expect if locales are changed after importing the module? Or is the recommended approach to set C locale before every tesserocr call and restore correct locales after it?

PS: I've at least tried to leverage the limitation to support utf-8 locale as well in tesseract-ocr/tesseract#2272 as without that Python 3 before 3.7 has problems handling unicode filenames...

PS2: Frequently changing locales might not be a good idea as well, quoting Python locale docs:

The C standard defines the locale as a program-wide property that may be relatively expensive to change. On top of that, some implementation are broken in such a way that frequent locale changes may cause core dumps. This makes the locale somewhat painful to use correctly.

@bertsky
Copy link
Contributor

bertsky commented Feb 26, 2019

Indeed. I'll have a chance to ask @stweil about it tomorrow. The utf-8 patch was a good idea, thanks!

What you dug up about locale switching in Python does make matters worse. So maybe this should really be done on the C level in tesserocr? Has anyone tried that?

Interestingly, gImageReader does temporarily reset the locale, but only for Tesseract initialisation, not during recognition. And both libopencv-contrib/modules/text and ffmpeg/libavfilter do nothing about it (but seem to be based on version 3 still).

@stweil
Copy link
Contributor

stweil commented Feb 27, 2019

There is a whole number of functions in the standard C library which depend on the current locale, not only sscanf, but also sprintf and their relatives and several classifier functions. See my comment tesseract-ocr/tesseract#1670 (comment).

This causes problems like tesseract-ocr/tesseract#1250.

With my German locale de_DE.UTF-8, for example, Tesseract would read wrong float and double values from existing configuration files because German writes such numbers like 3,14159 instead of 3.14159. A file written with that locale would be incompatible with most of the rest of the world.

tesseract-ocr/tesseract@db9c7e0 shows how that can be fixed. Similar fixes are still needed for the rest of the Tesseract code. Then the hard requirement of a fixed locale setting and the assertions can be removed.

@nijel
Copy link
Contributor

nijel commented Feb 27, 2019

I understand the problems (I've fixed similar issues in many programs). I just find strange that import tesserocr can lead to terminating Python. That's not really what I would expect from any library.

Also it would be great to know where all the C locale is needed to make things work properly. As @bertsky pointed out there are already programs switching locales for tesseract initialization, will that work properly or should it be there for the recognition as well (as outlined in tesseract-ocr/tesseract#1670 (comment))? Having this documented would make it easier to implement this properly.

@stweil
Copy link
Contributor

stweil commented Feb 27, 2019

Switching locale just to avoid the assertion and switching it back after Tesseract initialization will leave you with the risk of problems caused by a "wrong" locale. Depending on your own locale and on the code parts which you use it can work. If your locale does not use the comma as a decimal point and your primary characters are based on Latin, chances are high that it will work. With comma or with an Asian locale, chances are much lower. So to summarize, a temporary locale switch is not a general solution usable for everybody.

@stweil
Copy link
Contributor

stweil commented Feb 27, 2019

Also it would be great to know where all the C locale is needed to make things work properly.

Just search the code for the problematic function names. It is some work to decide when those code locations are used (text recognition or training, old recognizer or LSTM, ...). Therefore I think it is easier to replace all those code parts by code which does not depend on locale settings.

@nijel
Copy link
Contributor

nijel commented Feb 27, 2019

With the locales switching I'm trying to figure out solution that would work now. It's certainly not nice, but I really don't see other way to use tesserocr.

The major problem is that upcoming Debian stable will come with tesseract 4.0.0, so apparently my code will have to deal with that version for years, no matter what you fix in upcoming tesseract releases. Not sure if something can be done in the tesserocr module to mitigate this or at least to make possible some runtime detection avoiding terminating Python...

@mikkelee
Copy link

I'm currently doing this in Python3.7:

@contextmanager
def c_locale():
    try:
        currlocale = locale.getlocale()
    except ValueError:
        currlocale = ('en_US', 'UTF-8')
    log.debug(f'Switching to C from {currlocale}')
    locale.setlocale(locale.LC_ALL, "C")
    yield
    log.debug(f'Switching to {currlocale} from C')
    locale.setlocale(locale.LC_ALL, currlocale)

def get_hocr(image):
    with c_locale():
        from tesserocr import PyTessBaseAPI
        with PyTessBaseAPI() as tesseract:
            tesseract.SetImage(image)
            hocr = tesseract.GetHOCRText(0)
    return hocr

It appears to work, as all access to Tesseract is done inside a C-locale.

@nijel
Copy link
Contributor

nijel commented Feb 27, 2019

Similar workaround has been posted in tesseract-ocr/tesseract#1670 (comment), it's pretty straightforward to implement, however looking at Python docs makes me a bit worried if that is good thing to do (see #165 (comment)). Anyway I've implemented this in WeblateOrg/weblate@6724204 as I don't see better way to address this for now.

@agnelvishal
Copy link

After typing export LC_ALL=C in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work.

@stweil
Copy link
Contributor

stweil commented Jul 26, 2019

Tesseract 4.1 and Tesseract 5 no longer require such locale hacks.

@mikkelee
Copy link

I can't find anything about Tesseract 5?

@stweil
Copy link
Contributor

stweil commented Jul 26, 2019

Tesseract 5 will be the next release. Tesseract Git master is the latest version targeting 5.0, 5.0.0-alpha is a pre-release.

@mikkelee
Copy link

Ah, thanks

@sirfz
Copy link
Owner

sirfz commented Aug 22, 2019

Solved in Tesseract v4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants