-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract failed loading non-english language.traineddata #1250
Comments
if I load 'eng.traineddata', works fine, even load self-trained data file, it also works fine. |
Seems like an issue related to locale settings. |
@amitdo hi, I tried, but seems not working, any other ideas? |
No. Try the forum. |
Hi all,
I want to let tesseract to output temperary image files to local disk. So I can know that where step's result. I compile the tesseract with "--enable-debug" but after recognize the image, I cannot find the temoprery image files. Is there anyone meet the similar problem? Thanks.
|
@amitdo finally I got the reason, it relates the "locale". here is the explanation.
@amitdo thanks a lot for your reading. |
@stweil, your thoughts on the suggested change? |
@githubgs, which locale did you use when it failed? |
@amitdo sorry for replying late. it seems that only in english environment, I saw the locale value. |
Hey all, if you're just coming across this issue, I solved it by setting the locale in Python thus at the top of my script:
|
I confirm that I also ran into this problem with the R bindings. All is fine for most languages, however asian languages like A workaround is to set This is |
I have the same issue with the following env with idea. macOS 10.13.4 my workaround is adding a environment variable LC_CTYPE=C in idea, it works. |
Pull request #1649 makes Tesseract initialization fail if the locale settings are wrong. Users who get that failure must set the "C" locale in their code. |
Technically this issue was closed by enforcing the "C" locale in pull request #1649, but that causes problems requiring ugly workarounds in projects which use the Tesseract API from Python, Java or other languages which typically don't set the "C" locale. Therefore I suggest to keep it open. |
@stweil : what about my suggestion to implement jeroen code into TessBaseAPI::Init? |
Make sure the environment variable |
@stweil : Is assert still needed for non "C" LC_ALL? |
It turns out that we're having this problem as well, on macOS with 3.05.01. I'm considering a patch to the Another alternative, which might make cleaner code, would be to use Thoughts welcome, and of course I'll contribute back patches. |
See #1670 |
The problem in |
@datalogics-kam, which locale settings failed in your test? I'd like to reproduce your problem to see whether it is fixed by new code. |
Function 2019-05-12: This was now done in pull request #2430. While implementing this, an unrelated bug was found and fixed, too. |
@stweil I've been able to reproduce it with Since we're still on 3.05.01 here, and I see that version 4 asserts that the locale must be "C", I'm going to put a fix in our code uses the C locale when calling Tesseract, and restores the locale after. Since Thanks for the insights! |
How to test whether Tesseract code works with your locale: The following patch disabled the assertions which check for the right locale and enables the current locale for all Tesseract code:
With this patch, not only |
Failing test on macOS with
2019-05-16: Fixed in pull request #2437 |
Failing test on macOS with
2019-05-18: Fixed in commit 36ed6da. |
@githubgs, this issue should be fixed now in branch 4.1 and in Git master. Can we close it? |
The function did not correctly read Chinese unichars into the local Class variable if the locale was set to de_DE.UTF-8 (or other incompatible locales). That resulted in a wrong ClassId which was used to write into the Cutoffs array without checking for valid bounds. On macOS the result was a runtime error in baseapi_test (see GitHub issue tesseract-ocr#1250): [ RUN ] TesseractTest.InitConfigOnlyTest baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug Replacing sscanf by std::istringstream fixes that. Add also an assertion to catch future out-of-bounds writes. Signed-off-by: Stefan Weil <[email protected]>
The function did not correctly read Chinese unichars into the local Class variable if the locale was set to de_DE.UTF-8 (or other incompatible locales). That resulted in a wrong ClassId which was used to write into the Cutoffs array without checking for valid bounds. On macOS the result was a runtime error in baseapi_test (see GitHub issue #1250): [ RUN ] TesseractTest.InitConfigOnlyTest baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug Replacing sscanf by std::istringstream fixes that. Add also an assertion to catch future out-of-bounds writes. Signed-off-by: Stefan Weil <[email protected]>
Environment
Current Behavior:
my situation a little complicated.
I made the tesseract into a lib which for other application to call, while in the "api->init" to load chi_sim, it failed, ONLY in IDE(pycharm) environment, after debugging, I located this function "load_via_fgets" in file "tesseract/ccutil/unicharset.cpp", from row 825, sscanf return 1/1/1/1/1/1/1 rather then 17/16/10/8/4/3/2, so it return 'false' to function "bool UNICHARSET::load_from_file(tesseract::TFile *file, bool skip_fragments)" in row 781.
ATTENTION, this situation wont happen in terminal command line, only in IDE, also found a same problem happened in tess4j, link:.
looking forward to hearing from you, thanks so much.
Expected Behavior:
Suggested Fix:
The text was updated successfully, but these errors were encountered: