-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 #165
Comments
i am the same error as you |
try it as this ,i work well |
This can be fixed by either importing |
I met the same issue. Your solution helped . Thanks! |
It's mentioned in the FAQ as well. And BTW this measure is only meant to be provisional, to be removed as soon as certain legacy parts of the code base (locale-dependent functions like You might have problems with this workaround though: other parts of your code might depend localization, e.g. on |
This is really ridiculous, it makes it nearly impossible to use the module in a program using locales. Does anybody have idea what kind of breakages can man expect if locales are changed after importing the module? Or is the recommended approach to set C locale before every tesserocr call and restore correct locales after it? PS: I've at least tried to leverage the limitation to support utf-8 locale as well in tesseract-ocr/tesseract#2272 as without that Python 3 before 3.7 has problems handling unicode filenames... PS2: Frequently changing locales might not be a good idea as well, quoting Python locale docs:
|
Indeed. I'll have a chance to ask @stweil about it tomorrow. The utf-8 patch was a good idea, thanks! What you dug up about locale switching in Python does make matters worse. So maybe this should really be done on the C level in tesserocr? Has anyone tried that? Interestingly, gImageReader does temporarily reset the locale, but only for Tesseract initialisation, not during recognition. And both libopencv-contrib/modules/text and ffmpeg/libavfilter do nothing about it (but seem to be based on version 3 still). |
There is a whole number of functions in the standard C library which depend on the current locale, not only This causes problems like tesseract-ocr/tesseract#1250. With my German locale de_DE.UTF-8, for example, Tesseract would read wrong float and double values from existing configuration files because German writes such numbers like tesseract-ocr/tesseract@db9c7e0 shows how that can be fixed. Similar fixes are still needed for the rest of the Tesseract code. Then the hard requirement of a fixed locale setting and the assertions can be removed. |
I understand the problems (I've fixed similar issues in many programs). I just find strange that Also it would be great to know where all the C locale is needed to make things work properly. As @bertsky pointed out there are already programs switching locales for tesseract initialization, will that work properly or should it be there for the recognition as well (as outlined in tesseract-ocr/tesseract#1670 (comment))? Having this documented would make it easier to implement this properly. |
Switching locale just to avoid the assertion and switching it back after Tesseract initialization will leave you with the risk of problems caused by a "wrong" locale. Depending on your own locale and on the code parts which you use it can work. If your locale does not use the comma as a decimal point and your primary characters are based on Latin, chances are high that it will work. With comma or with an Asian locale, chances are much lower. So to summarize, a temporary locale switch is not a general solution usable for everybody. |
Just search the code for the problematic function names. It is some work to decide when those code locations are used (text recognition or training, old recognizer or LSTM, ...). Therefore I think it is easier to replace all those code parts by code which does not depend on locale settings. |
With the locales switching I'm trying to figure out solution that would work now. It's certainly not nice, but I really don't see other way to use tesserocr. The major problem is that upcoming Debian stable will come with tesseract 4.0.0, so apparently my code will have to deal with that version for years, no matter what you fix in upcoming tesseract releases. Not sure if something can be done in the tesserocr module to mitigate this or at least to make possible some runtime detection avoiding terminating Python... |
I'm currently doing this in Python3.7: @contextmanager
def c_locale():
try:
currlocale = locale.getlocale()
except ValueError:
currlocale = ('en_US', 'UTF-8')
log.debug(f'Switching to C from {currlocale}')
locale.setlocale(locale.LC_ALL, "C")
yield
log.debug(f'Switching to {currlocale} from C')
locale.setlocale(locale.LC_ALL, currlocale)
def get_hocr(image):
with c_locale():
from tesserocr import PyTessBaseAPI
with PyTessBaseAPI() as tesseract:
tesseract.SetImage(image)
hocr = tesseract.GetHOCRText(0)
return hocr It appears to work, as all access to Tesseract is done inside a C-locale. |
Similar workaround has been posted in tesseract-ocr/tesseract#1670 (comment), it's pretty straightforward to implement, however looking at Python docs makes me a bit worried if that is good thing to do (see #165 (comment)). Anyway I've implemented this in WeblateOrg/weblate@6724204 as I don't see better way to address this for now. |
After typing |
Tesseract 4.1 and Tesseract 5 no longer require such locale hacks. |
I can't find anything about Tesseract 5? |
Tesseract 5 will be the next release. Tesseract Git master is the latest version targeting 5.0, 5.0.0-alpha is a pre-release. |
Ah, thanks |
Solved in Tesseract v4.1 |
import tesserocr
from PIL import Image
image = Image.open('image.png')
print(tesserocr.image_to_text(image))
Mac 10.14.1
tesserocr 2.3.1
Python 3.6.7
The text was updated successfully, but these errors were encountered: