tesseract failed loading non-english language.traineddata #1250

gbolin · 2017-12-28T02:59:13Z

Environment

Tesseract Version: <tesseract 4.00.00dev-690-g1b0379c2>
Platform: <mac OS 64bit, High Sierra, 10.13.1>

Current Behavior:

my situation a little complicated.
I made the tesseract into a lib which for other application to call, while in the "api->init" to load chi_sim, it failed, ONLY in IDE(pycharm) environment, after debugging, I located this function "load_via_fgets" in file "tesseract/ccutil/unicharset.cpp", from row 825, sscanf return 1/1/1/1/1/1/1 rather then 17/16/10/8/4/3/2, so it return 'false' to function "bool UNICHARSET::load_from_file(tesseract::TFile *file, bool skip_fragments)" in row 781.
ATTENTION, this situation wont happen in terminal command line, only in IDE, also found a same problem happened in tess4j, link:.
looking forward to hearing from you, thanks so much.

Expected Behavior:

Suggested Fix:

gbolin · 2017-12-28T03:01:57Z

if I load 'eng.traineddata', works fine, even load self-trained data file, it also works fine.

amitdo · 2017-12-28T08:33:57Z

Seems like an issue related to locale settings.
https://www.google.co.il/search?q=osx+%22pycharm%22+%22locale%22

gbolin · 2017-12-28T08:57:08Z

@amitdo hi, I tried, but seems not working, any other ideas?

amitdo · 2017-12-28T09:45:54Z

any other ideas?

No. Try the forum.

ITCoolie · 2017-12-29T06:49:33Z

Hi all, I want to let tesseract to output temperary image files to local disk. So I can know that where step's result. I compile the tesseract with "--enable-debug" but after recognize the image, I cannot find the temoprery image files. Is there anyone meet the similar problem? Thanks.

gbolin · 2017-12-29T08:29:36Z

@amitdo finally I got the reason, it relates the "locale". here is the explanation.
after "combine_tessdata -U chi_sim.traineddata ./chi_sim.", generate a file named "chi_sim.unicharset"(This file is the key reason why non-eng traineddata files somehow could not be loaded). This function "bool UNICHARSET::load_via_fgets" in "tesseract/ccutil/unicharset.cpp:789" would read that unicharset file row by row, when arriving here
(v = sscanf(buffer, "%s %x %d,%d,%d,%d,%g,%g,%g,%g,%g,%g %63s %d %d %d %63s", unichar, &properties, &min_bottom, &max_bottom, &min_top, &max_top, &width, &width_sd, &bearing, &bearing_sd, &advance, &advance_sd, script, &other_case, &direction, &mirror, normed)) != 17
let's say buffer is "格 1 63,69,255,255,192,220,0,9,205,233 Han 7 0 7 格 # 格 [683c ]x"
sscanf function will call for isspace function, the letter "格“ utf-8 code is:0xE6 0xA0 0xBC,
the 0xA0 was recognized as a space. so a buffer interruption happens here. That is the key reason!
Tesseract will call std::locale to get the default locale setting, but exactly in unicharset.cpp, it causes sscanf function fail.
not only in Chinese language, but for others , after UTF8-based locale, if a character contains some special bits value, like '0xA0', '0x85', more, especially non-english operating system, it absolutely will fail.
how to solve:
1:change system into English, but maybe not a good idea, butit works for me.
2:change the unicharset.cpp source code, I tried on my own mac os, like this:

..... from row 823
char normed[64];
int v = -1;
************Add code ************
locale lc("C");
locale::global(lc);
************************
if (fgets_cb->Run(buffer, sizeof (buffer)) == NULL ||
.....continue

@amitdo thanks a lot for your reading.
regards, GS.

amitdo · 2017-12-29T10:53:40Z

@stweil, your thoughts on the suggested change?

amitdo · 2017-12-29T11:00:36Z

@ITCoolie,
The right place to ask general questions is the forum.

stweil · 2017-12-29T15:03:24Z

@githubgs, which locale did you use when it failed?

gbolin · 2017-12-30T14:26:38Z

@amitdo sorry for replying late.
after input 'env' command in terminal, these following 2 pictures show what you need.
the first one, set the os language into Chinese, while the 2nd one set English.

it seems that only in english environment, I saw the locale value.
hope it's helpful to you.

wxs · 2018-03-29T19:35:06Z

Hey all, if you're just coming across this issue, I solved it by setting the locale in Python thus at the top of my script:

import locale
locale.setlocale(locale.LC_ALL, "C")

jeroen · 2018-04-22T01:33:05Z

I confirm that I also ran into this problem with the R bindings. All is fine for most languages, however asian languages like jpn and kor would not load with en_US.UTF-8.

A workaround is to set Sys.setlocale('LC_CTYPE', 'C') and then it works. However it is unclear to me if I can set it back to en_US.UTF-8 afterwards.

This is 3.05.01 by the way.

delonzhou · 2018-06-05T05:31:21Z

I have the same issue with the following env with idea.

macOS 10.13.4
tesseract 4.00.00alpha
leptonica-1.76.0
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE

my workaround is adding a environment variable LC_CTYPE=C in idea, it works.

stweil · 2018-06-08T15:46:50Z

Pull request #1649 makes Tesseract initialization fail if the locale settings are wrong. Users who get that failure must set the "C" locale in their code.

stweil · 2018-10-01T12:07:39Z

Technically this issue was closed by enforcing the "C" locale in pull request #1649, but that causes problems requiring ugly workarounds in projects which use the Tesseract API from Python, Java or other languages which typically don't set the "C" locale.

Therefore I suggest to keep it open.

zdenop · 2018-10-02T07:15:01Z

@stweil : what about my suggestion to implement jeroen code into TessBaseAPI::Init?

iseegr8tfuldeadppl · 2019-05-06T00:07:19Z

Make sure the environment variable TESSDATA_PREFIX is set to your tessdata directory!
(for ex. C:\msys64\mingw32\share\tessdata).

zdenop · 2019-05-10T07:25:30Z

@stweil : Is assert still needed for non "C" LC_ALL?

datalogics-kam · 2019-05-11T14:45:48Z

It turns out that we're having this problem as well, on macOS with 3.05.01. I'm considering a patch to the load_via_fgets code to use sscanf_l where available, which will allow passing in a locale for the call, rather than modifying locale

Another alternative, which might make cleaner code, would be to use uselocale if it's available. That sets the locale for only the current thread, and then it can be set back to the previous locale at function exit. I might try this one first.

Thoughts welcome, and of course I'll contribute back patches.

amitdo · 2019-05-12T00:01:49Z

See #1670

stweil · 2019-05-12T06:49:36Z

Is assert still needed for non "C" LC_ALL?

The problem in load_via_fgets which was mentioned by @datalogics-kam still exists: sscanf has to be replaced by C++ stringstream like in the other places.

stweil · 2019-05-12T07:27:11Z

@datalogics-kam, which locale settings failed in your test? I'd like to reproduce your problem to see whether it is fixed by new code.

stweil · 2019-05-12T08:00:32Z

Function ReadParamDesc also still needs a replacement for sscanf.

2019-05-12: This was now done in pull request #2430. While implementing this, an unrelated bug was found and fixed, too.

datalogics-kam · 2019-05-13T15:02:47Z

@stweil I've been able to reproduce it with LC_ALL, LC_CTYPE, and LC_NUMERIC set to "en_US.UTF-8". That's what the JVM was setting. In the unit tests for our OCR wrapper, I added a fixture to recreate that locale setting.

Since we're still on 3.05.01 here, and I see that version 4 asserts that the locale must be "C", I'm going to put a fix in our code uses the C locale when calling Tesseract, and restores the locale after.

Since setlocale is global, if uselocale is available, our code will use that as it is thread-specific. I also found this code to set the locale on a thread basis on Windows: https://stackoverflow.com/a/17173977

Thanks for the insights!

stweil · 2019-05-15T20:33:31Z

How to test whether Tesseract code works with your locale:

The following patch disabled the assertions which check for the right locale and enables the current locale for all Tesseract code:

diff --git a/src/api/baseapi.cpp b/src/api/baseapi.cpp
index 61b38f8e..72e892b8 100644
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@@ -209,6 +209,9 @@ TessBaseAPI::TessBaseAPI()
       rect_height_(0),
       image_width_(0),
       image_height_(0) {
+#if 1
+  setlocale(LC_ALL, "");
+#else
   const char *locale;
   locale = std::setlocale(LC_ALL, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
@@ -216,6 +219,7 @@ TessBaseAPI::TessBaseAPI()
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
   locale = std::setlocale(LC_NUMERIC, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
+#endif
 }
 
 TessBaseAPI::~TessBaseAPI() {

With this patch, not only tesseract but also all other command line tools and the tests use the current locale. Run make check and see that several tests will fail depending on your locale.

stweil · 2019-05-15T20:35:20Z

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/apiexample_test 
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN      ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

2019-05-16: Fixed in pull request #2437

stweil · 2019-05-16T05:03:53Z

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN      ] TesseractTest.ArraySizeTest
[       OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN      ] TesseractTest.BasicTesseractTest
[       OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN      ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[       OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN      ] TesseractTest.HOCRWorksWithoutSetInputName
[       OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN      ] TesseractTest.HOCRContainsBaseline
[       OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN      ] TesseractTest.RickSnyderNotFuckSnyder
[       OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN      ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[       OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN      ] TesseractTest.BasicLSTMTest
[       OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN      ] TesseractTest.LSTMGeometryTest
[       OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN      ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO]  Lang eng took 327ms in regular init[INFO]  Lang chi_tra took 1422ms in regular initAbort trap: 6

2019-05-18: Fixed in commit 36ed6da.
2019-05-18: malloc/free issue fixed in commit 09edd1a.

stweil · 2019-05-18T05:54:12Z

@githubgs, this issue should be fixed now in branch 4.1 and in Git master. Can we close it?

The function did not correctly read Chinese unichars into the local Class variable if the locale was set to de_DE.UTF-8 (or other incompatible locales). That resulted in a wrong ClassId which was used to write into the Cutoffs array without checking for valid bounds. On macOS the result was a runtime error in baseapi_test (see GitHub issue tesseract-ocr#1250): [ RUN ] TesseractTest.InitConfigOnlyTest baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug Replacing sscanf by std::istringstream fixes that. Add also an assertion to catch future out-of-bounds writes. Signed-off-by: Stefan Weil <[email protected]>

The function did not correctly read Chinese unichars into the local Class variable if the locale was set to de_DE.UTF-8 (or other incompatible locales). That resulted in a wrong ClassId which was used to write into the Cutoffs array without checking for valid bounds. On macOS the result was a runtime error in baseapi_test (see GitHub issue #1250): [ RUN ] TesseractTest.InitConfigOnlyTest baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug Replacing sscanf by std::istringstream fixes that. Add also an assertion to catch future out-of-bounds writes. Signed-off-by: Stefan Weil <[email protected]>

gbolin changed the title ~~tesseract failed loading language~~ tesseract failed loading non-english language.traineddata Dec 29, 2017

nguyenq mentioned this issue Jan 14, 2018

Failed loading language Tesseract couldn't load any languages! nguyenq/tess4j#34

Closed

Shreeshrii mentioned this issue Apr 6, 2018

Error: Illegal Parameter specification! with Tesseract4Alpha #1010

Closed

jeroen mentioned this issue Apr 21, 2018

Bug with Japanese ropensci/tesseract#14

Closed

jeroen mentioned this issue Apr 26, 2018

Document LC_CTYPE for language data #1532

Closed

stweil mentioned this issue Jun 22, 2018

recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

Closed

stweil added the bug label Oct 1, 2018

stweil added this to the 4.1.0 milestone Feb 19, 2019

stweil mentioned this issue Feb 27, 2019

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209 sirfz/tesserocr#165

Closed

stweil closed this as completed Jun 22, 2019

amitdo added the locale label Mar 21, 2021

Jerry-Gump mentioned this issue Sep 9, 2021

OCRTesseract can not recognize Chinese shimat/opencvsharp#873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract failed loading non-english language.traineddata #1250

tesseract failed loading non-english language.traineddata #1250

gbolin commented Dec 28, 2017 •

edited

Loading

gbolin commented Dec 28, 2017

amitdo commented Dec 28, 2017

gbolin commented Dec 28, 2017

amitdo commented Dec 28, 2017

ITCoolie commented Dec 29, 2017 via email

gbolin commented Dec 29, 2017 •

edited

Loading

amitdo commented Dec 29, 2017

amitdo commented Dec 29, 2017

stweil commented Dec 29, 2017

gbolin commented Dec 30, 2017 •

edited

Loading

wxs commented Mar 29, 2018

jeroen commented Apr 22, 2018 •

edited

Loading

delonzhou commented Jun 5, 2018

stweil commented Jun 8, 2018

stweil commented Oct 1, 2018

zdenop commented Oct 2, 2018

iseegr8tfuldeadppl commented May 6, 2019

zdenop commented May 10, 2019

datalogics-kam commented May 11, 2019

amitdo commented May 12, 2019

stweil commented May 12, 2019 •

edited

Loading

stweil commented May 12, 2019

stweil commented May 12, 2019 •

edited

Loading

datalogics-kam commented May 13, 2019

stweil commented May 15, 2019

stweil commented May 15, 2019 •

edited

Loading

stweil commented May 16, 2019 •

edited

Loading

stweil commented May 18, 2019

tesseract failed loading non-english language.traineddata #1250

tesseract failed loading non-english language.traineddata #1250

Comments

gbolin commented Dec 28, 2017 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

gbolin commented Dec 28, 2017

amitdo commented Dec 28, 2017

gbolin commented Dec 28, 2017

amitdo commented Dec 28, 2017

ITCoolie commented Dec 29, 2017 via email

gbolin commented Dec 29, 2017 • edited Loading

amitdo commented Dec 29, 2017

amitdo commented Dec 29, 2017

stweil commented Dec 29, 2017

gbolin commented Dec 30, 2017 • edited Loading

wxs commented Mar 29, 2018

jeroen commented Apr 22, 2018 • edited Loading

delonzhou commented Jun 5, 2018

stweil commented Jun 8, 2018

stweil commented Oct 1, 2018

zdenop commented Oct 2, 2018

iseegr8tfuldeadppl commented May 6, 2019

zdenop commented May 10, 2019

datalogics-kam commented May 11, 2019

amitdo commented May 12, 2019

stweil commented May 12, 2019 • edited Loading

stweil commented May 12, 2019

stweil commented May 12, 2019 • edited Loading

datalogics-kam commented May 13, 2019

stweil commented May 15, 2019

stweil commented May 15, 2019 • edited Loading

stweil commented May 16, 2019 • edited Loading

stweil commented May 18, 2019

gbolin commented Dec 28, 2017 •

edited

Loading

gbolin commented Dec 29, 2017 •

edited

Loading

gbolin commented Dec 30, 2017 •

edited

Loading

jeroen commented Apr 22, 2018 •

edited

Loading

stweil commented May 12, 2019 •

edited

Loading

stweil commented May 12, 2019 •

edited

Loading

stweil commented May 15, 2019 •

edited

Loading

stweil commented May 16, 2019 •

edited

Loading