TessBaseAPI should not modify the global locale #3290

datalogics-kam · 2021-02-10T20:00:44Z

Environment

Tesseract Version: 4.1.1
Commit Number:
Platform: Darwin Xoots 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64

Current Behavior:

Tesseract modifies the global locale when the API is linked to other consumers, if built with Debug.

Expected Behavior:

Tesseract does not modify the locale.

Suggested Fix:

The following code in TessBaseAPI::TessBaseAPI() and its preceding code in 4.1.1 modify the global locale if a debug Tesseract is linked to a program. That would be the case when consuming Tesseract through Conan with a -s build_type=Debug.

In our case, it's changing the number formatting of one of our products. We are already careful to change the locale to "C" and restore the locale when calling Tesseract, but this resulted in a difficult to understand bug that we initially thought was in our code.

#if !defined(NDEBUG)
  // The Tesseract executables would use the "C" locale by default,
  // but other software which is linked against the Tesseract library
  // typically uses the locale from the user's environment.
  // Here the default is overridden to allow debugging of potential
  // problems caused by the locale settings.

  // Use the current locale if building debug code.
  try {
    std::locale::global(std::locale(""));
  } catch (const std::runtime_error& ex) {
    fprintf(stderr, "Warning: Could not set the current locale\n");
  }
#endif

The text was updated successfully, but these errors were encountered:

egorpugin · 2021-03-14T14:57:58Z

@stweil
I think we could remove that code. What do you think?
Or change that ifdef to #ifndef TESS_DEBUG, because it really uses global debug flag which may affect other code.
TESS_DEBUG will explicitly say that we want to trace exactly tess code.
But my preference is to remove that locale setting completely.

stweil · 2021-03-25T09:59:08Z

The Tesseract library is used in C/C++ applications like the tesseract executables. Such applications use the C locale by default and won't have problems with code parts which change the behaviour with different locales.

But the library is also used in other applications which don't use the C locale. Therefore it is important that the library works with any locale settings.

The code which was removed now made sure that debug versions of the Tesseract library always used the current locale of the user, so we had a broad test with different locales when people used debug builds.

Release builds should set NDEBUGwhich disables the locale settings. Then the TessBaseAPI does not modify the global locale.

Please revert the commit which removed the test code.

egorpugin · 2021-03-25T10:48:46Z

I do not understand and do not agree

The code which was removed

We do not need that code to make bugs visible.

Locale issues must be reported and fixed.
Setting locale is not what libtess should do.
If we need to check locale issues, we should find all places that use locales. What are those? I know c++ streams, what else?

egorpugin · 2021-03-25T11:18:13Z

Isn't default program locale is system locale?
Where can I read about opposite?

If first question is true (default program locale is system locale), then we already have many user locales tested in release and debug builds and removed code does not make any sense.
Am I wrong?

stweil · 2021-03-25T11:42:15Z

"ISO C says that all programs start by default in the standard ‘C’ locale." (see https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html). Therefore we won't notice any problems caused by other locales unless we explicitly set a different locale.

Java or Python programs which use the Tesseract library would be affected by such problems.

A lot of functions depend on locale settings, not only the C++ streams but also the scanf and printf family of functions (and therefore also tprintf) and some more. I remember that I made a list some years ago in an earlier issue.

egorpugin · 2021-03-25T12:35:57Z

"ISO C says that all programs start by default in the standard ‘C’ locale.

I see, thank you.
I've also found on https://en.cppreference.com/w/cpp/locale/setlocale
During program startup, the equivalent of std::setlocale(LC_ALL, "C"); is executed before any user code is run.

But if we need such test, why not create some unit test that will iterate over many locales, use several tess facilities that use locales.
(I agree that such unit test wil have smaller coverage.)

Then, next question is why do we have such issues? If we do some locale dependeny c++ io, we should imbue locale on streams explicitly (C locale to get locale independent results). If we use C print/scan functions, they should be checked on whether it is allowed to do locale dependent io.

I've already fixed some c++ io issue with streams during unittest CI setup. Seems like because of locale set in removed code or in unittest startup code.

I don't think that bugs should be catched in such way.
If we replace !NDEBUG with !TESS_DEBUG - no one will use that our def except our code (tess.exe, unittests).

Upd.:
Tess data must be stored in locale/os/arch/endianness/etc independent format. So, we don't need any locales here.
Also the binary data will be better as text data exactly uses locales.

stweil · 2021-03-25T12:47:46Z

why not create some unit test that will iterate over many locales

A locale has to be installed before it can be activated. Typical Linux installations only install a single locale depending on the language settings (in my case de_DE.UTF-8). The main contributors would at least cover some representative locales (RTL languages, Indic, Russian, Western European).

A typical example which can cause problems: training output from lstmtraining prints float numbers. Those numbers are parsed by scripts which plot the training results. Depending on the locale, the representation would look very different:

egorpugin · 2021-03-25T13:06:48Z

training output from lstmtraining prints float numbers.

I assume it must be fixed then? It should always print C locale numbers. If any script cannot parse them (works in some other locale), it is that script's problem, not ours.

Shreeshrii · 2021-03-26T03:43:29Z

training output from lstmtraining prints float numbers. Those numbers are parsed by scripts which plot the training results. Depending on the locale, the representation would look very different:

I had not considered this while creating the plotting scripts in tesstrain repo.

https://github.com/tesseract-ocr/tesstrain/pull/236/files#diff-c1286e969112d6bd21411905c6e5901139206ac5d7d995242af2eb0e37801106

the script uses . as the decimal delimiter when trying to get CER values from the log file.

datalogics-kam mentioned this issue Feb 10, 2021

[tesseract] Don't affect global locale with build_type=Debug conan-io/conan-center-index#4533

Merged

4 tasks

amitdo added the locale label Mar 21, 2021

egorpugin closed this as completed in 7c975a0 Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TessBaseAPI should not modify the global locale #3290

TessBaseAPI should not modify the global locale #3290

datalogics-kam commented Feb 10, 2021 •

edited

Loading

egorpugin commented Mar 14, 2021

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021

egorpugin commented Mar 25, 2021

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021 •

edited

Loading

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021

Shreeshrii commented Mar 26, 2021

TessBaseAPI should not modify the global locale #3290

TessBaseAPI should not modify the global locale #3290

Comments

datalogics-kam commented Feb 10, 2021 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

egorpugin commented Mar 14, 2021

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021

egorpugin commented Mar 25, 2021

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021 • edited Loading

stweil commented Mar 25, 2021

egorpugin commented Mar 25, 2021

Shreeshrii commented Mar 26, 2021

datalogics-kam commented Feb 10, 2021 •

edited

Loading

egorpugin commented Mar 25, 2021 •

edited

Loading