Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract failed loading non-english language.traineddata #1250

Closed
gbolin opened this issue Dec 28, 2017 · 28 comments
Closed

tesseract failed loading non-english language.traineddata #1250

gbolin opened this issue Dec 28, 2017 · 28 comments
Milestone

Comments

@gbolin
Copy link

gbolin commented Dec 28, 2017

Environment

  • Tesseract Version: <tesseract 4.00.00dev-690-g1b0379c2>
  • Platform: <mac OS 64bit, High Sierra, 10.13.1>

Current Behavior:

my situation a little complicated.
I made the tesseract into a lib which for other application to call, while in the "api->init" to load chi_sim, it failed, ONLY in IDE(pycharm) environment, after debugging, I located this function "load_via_fgets" in file "tesseract/ccutil/unicharset.cpp", from row 825, sscanf return 1/1/1/1/1/1/1 rather then 17/16/10/8/4/3/2, so it return 'false' to function "bool UNICHARSET::load_from_file(tesseract::TFile *file, bool skip_fragments)" in row 781.
ATTENTION, this situation wont happen in terminal command line, only in IDE, also found a same problem happened in tess4j, link:.
looking forward to hearing from you, thanks so much.

Expected Behavior:

Suggested Fix:

@gbolin
Copy link
Author

gbolin commented Dec 28, 2017

if I load 'eng.traineddata', works fine, even load self-trained data file, it also works fine.

@amitdo
Copy link
Collaborator

amitdo commented Dec 28, 2017

Seems like an issue related to locale settings.
https://www.google.co.il/search?q=osx+%22pycharm%22+%22locale%22

@gbolin
Copy link
Author

gbolin commented Dec 28, 2017

@amitdo hi, I tried, but seems not working, any other ideas?

@amitdo
Copy link
Collaborator

amitdo commented Dec 28, 2017

any other ideas?

No. Try the forum.

@ITCoolie
Copy link

ITCoolie commented Dec 29, 2017 via email

@gbolin gbolin changed the title tesseract failed loading language tesseract failed loading non-english language.traineddata Dec 29, 2017
@gbolin
Copy link
Author

gbolin commented Dec 29, 2017

@amitdo finally I got the reason, it relates the "locale". here is the explanation.
after "combine_tessdata -U chi_sim.traineddata ./chi_sim.", generate a file named "chi_sim.unicharset"(This file is the key reason why non-eng traineddata files somehow could not be loaded). This function "bool UNICHARSET::load_via_fgets" in "tesseract/ccutil/unicharset.cpp:789" would read that unicharset file row by row, when arriving here
(v = sscanf(buffer, "%s %x %d,%d,%d,%d,%g,%g,%g,%g,%g,%g %63s %d %d %d %63s", unichar, &properties, &min_bottom, &max_bottom, &min_top, &max_top, &width, &width_sd, &bearing, &bearing_sd, &advance, &advance_sd, script, &other_case, &direction, &mirror, normed)) != 17
let's say buffer is "格 1 63,69,255,255,192,220,0,9,205,233 Han 7 0 7 格 # 格 [683c ]x"
sscanf function will call for isspace function, the letter "格“ utf-8 code is:0xE6 0xA0 0xBC,
the 0xA0 was recognized as a space. so a buffer interruption happens here. That is the key reason!
Tesseract will call std::locale to get the default locale setting, but exactly in unicharset.cpp, it causes sscanf function fail.
not only in Chinese language, but for others , after UTF8-based locale, if a character contains some special bits value, like '0xA0', '0x85', more, especially non-english operating system, it absolutely will fail.
how to solve:
1:change system into English, but maybe not a good idea, butit works for me.
2:change the unicharset.cpp source code, I tried on my own mac os, like this:

..... from row 823
char normed[64];
int v = -1;
************Add code ************
locale lc("C");
locale::global(lc);
************************
if (fgets_cb->Run(buffer, sizeof (buffer)) == NULL ||
.....continue

@amitdo thanks a lot for your reading.
regards, GS.

@amitdo
Copy link
Collaborator

amitdo commented Dec 29, 2017

@stweil, your thoughts on the suggested change?

@amitdo
Copy link
Collaborator

amitdo commented Dec 29, 2017

@ITCoolie,
The right place to ask general questions is the forum.

@stweil
Copy link
Contributor

stweil commented Dec 29, 2017

@githubgs, which locale did you use when it failed?

@gbolin
Copy link
Author

gbolin commented Dec 30, 2017

@amitdo sorry for replying late.
after input 'env' command in terminal, these following 2 pictures show what you need.
the first one, set the os language into Chinese, while the 2nd one set English.
2017-12-30 10 13 07
screen shot 2017-12-30 at 10 15 16 pm

it seems that only in english environment, I saw the locale value.
hope it's helpful to you.

@wxs
Copy link

wxs commented Mar 29, 2018

Hey all, if you're just coming across this issue, I solved it by setting the locale in Python thus at the top of my script:

import locale
locale.setlocale(locale.LC_ALL, "C")

@jeroen
Copy link
Contributor

jeroen commented Apr 22, 2018

I confirm that I also ran into this problem with the R bindings. All is fine for most languages, however asian languages like jpn and kor would not load with en_US.UTF-8.

A workaround is to set Sys.setlocale('LC_CTYPE', 'C') and then it works. However it is unclear to me if I can set it back to en_US.UTF-8 afterwards.

This is 3.05.01 by the way.

@delonzhou
Copy link

I have the same issue with the following env with idea.

macOS 10.13.4
tesseract 4.00.00alpha
leptonica-1.76.0
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE

my workaround is adding a environment variable LC_CTYPE=C in idea, it works.

@stweil
Copy link
Contributor

stweil commented Jun 8, 2018

Pull request #1649 makes Tesseract initialization fail if the locale settings are wrong. Users who get that failure must set the "C" locale in their code.

@stweil
Copy link
Contributor

stweil commented Oct 1, 2018

Technically this issue was closed by enforcing the "C" locale in pull request #1649, but that causes problems requiring ugly workarounds in projects which use the Tesseract API from Python, Java or other languages which typically don't set the "C" locale.

Therefore I suggest to keep it open.

@zdenop
Copy link
Contributor

zdenop commented Oct 2, 2018

@stweil : what about my suggestion to implement jeroen code into TessBaseAPI::Init?

@iseegr8tfuldeadppl
Copy link

Make sure the environment variable TESSDATA_PREFIX is set to your tessdata directory!
(for ex. C:\msys64\mingw32\share\tessdata).

@zdenop
Copy link
Contributor

zdenop commented May 10, 2019

@stweil : Is assert still needed for non "C" LC_ALL?

@datalogics-kam
Copy link

It turns out that we're having this problem as well, on macOS with 3.05.01. I'm considering a patch to the load_via_fgets code to use sscanf_l where available, which will allow passing in a locale for the call, rather than modifying locale

Another alternative, which might make cleaner code, would be to use uselocale if it's available. That sets the locale for only the current thread, and then it can be set back to the previous locale at function exit. I might try this one first.

Thoughts welcome, and of course I'll contribute back patches.

@amitdo
Copy link
Collaborator

amitdo commented May 12, 2019

See #1670

@stweil
Copy link
Contributor

stweil commented May 12, 2019

Is assert still needed for non "C" LC_ALL?

The problem in load_via_fgets which was mentioned by @datalogics-kam still exists: sscanf has to be replaced by C++ stringstream like in the other places.

@stweil
Copy link
Contributor

stweil commented May 12, 2019

@datalogics-kam, which locale settings failed in your test? I'd like to reproduce your problem to see whether it is fixed by new code.

@stweil
Copy link
Contributor

stweil commented May 12, 2019

Function ReadParamDesc also still needs a replacement for sscanf.

2019-05-12: This was now done in pull request #2430. While implementing this, an unrelated bug was found and fixed, too.

@datalogics-kam
Copy link

@stweil I've been able to reproduce it with LC_ALL, LC_CTYPE, and LC_NUMERIC set to "en_US.UTF-8". That's what the JVM was setting. In the unit tests for our OCR wrapper, I added a fixture to recreate that locale setting.

Since we're still on 3.05.01 here, and I see that version 4 asserts that the locale must be "C", I'm going to put a fix in our code uses the C locale when calling Tesseract, and restores the locale after.

Since setlocale is global, if uselocale is available, our code will use that as it is thread-specific. I also found this code to set the locale on a thread basis on Windows: https://stackoverflow.com/a/17173977

Thanks for the insights!

@stweil
Copy link
Contributor

stweil commented May 15, 2019

How to test whether Tesseract code works with your locale:

The following patch disabled the assertions which check for the right locale and enables the current locale for all Tesseract code:

diff --git a/src/api/baseapi.cpp b/src/api/baseapi.cpp
index 61b38f8e..72e892b8 100644
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@@ -209,6 +209,9 @@ TessBaseAPI::TessBaseAPI()
       rect_height_(0),
       image_width_(0),
       image_height_(0) {
+#if 1
+  setlocale(LC_ALL, "");
+#else
   const char *locale;
   locale = std::setlocale(LC_ALL, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
@@ -216,6 +219,7 @@ TessBaseAPI::TessBaseAPI()
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
   locale = std::setlocale(LC_NUMERIC, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
+#endif
 }
 
 TessBaseAPI::~TessBaseAPI() {

With this patch, not only tesseract but also all other command line tools and the tests use the current locale. Run make check and see that several tests will fail depending on your locale.

@stweil
Copy link
Contributor

stweil commented May 15, 2019

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/apiexample_test 
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN      ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

2019-05-16: Fixed in pull request #2437

@stweil
Copy link
Contributor

stweil commented May 16, 2019

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN      ] TesseractTest.ArraySizeTest
[       OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN      ] TesseractTest.BasicTesseractTest
[       OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN      ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[       OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN      ] TesseractTest.HOCRWorksWithoutSetInputName
[       OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN      ] TesseractTest.HOCRContainsBaseline
[       OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN      ] TesseractTest.RickSnyderNotFuckSnyder
[       OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN      ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[       OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN      ] TesseractTest.BasicLSTMTest
[       OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN      ] TesseractTest.LSTMGeometryTest
[       OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN      ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO]  Lang eng took 327ms in regular init[INFO]  Lang chi_tra took 1422ms in regular initAbort trap: 6

2019-05-18: Fixed in commit 36ed6da.
2019-05-18: malloc/free issue fixed in commit 09edd1a.

@stweil
Copy link
Contributor

stweil commented May 18, 2019

@githubgs, this issue should be fixed now in branch 4.1 and in Git master. Can we close it?

stweil added a commit to stweil/tesseract that referenced this issue May 18, 2019
The function did not correctly read Chinese unichars into the local
Class variable if the locale was set to de_DE.UTF-8 (or other
incompatible locales). That resulted in a wrong ClassId which was
used to write into the Cutoffs array without checking for valid bounds.

On macOS the result was a runtime error in baseapi_test (see GitHub
issue tesseract-ocr#1250):

    [ RUN      ] TesseractTest.InitConfigOnlyTest
    baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
    baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug

Replacing sscanf by std::istringstream fixes that.
Add also an assertion to catch future out-of-bounds writes.

Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit that referenced this issue May 18, 2019
The function did not correctly read Chinese unichars into the local
Class variable if the locale was set to de_DE.UTF-8 (or other
incompatible locales). That resulted in a wrong ClassId which was
used to write into the Cutoffs array without checking for valid bounds.

On macOS the result was a runtime error in baseapi_test (see GitHub
issue #1250):

    [ RUN      ] TesseractTest.InitConfigOnlyTest
    baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
    baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug

Replacing sscanf by std::istringstream fixes that.
Add also an assertion to catch future out-of-bounds writes.

Signed-off-by: Stefan Weil <[email protected]>
@stweil stweil closed this as completed Jun 22, 2019
@amitdo amitdo added the locale label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants