Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TessBaseAPI should not modify the global locale #3290

Closed
datalogics-kam opened this issue Feb 10, 2021 · 9 comments
Closed

TessBaseAPI should not modify the global locale #3290

datalogics-kam opened this issue Feb 10, 2021 · 9 comments
Labels

Comments

@datalogics-kam
Copy link

datalogics-kam commented Feb 10, 2021

Environment

  • Tesseract Version: 4.1.1
  • Commit Number:
  • Platform: Darwin Xoots 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64

Current Behavior:

Tesseract modifies the global locale when the API is linked to other consumers, if built with Debug.

Expected Behavior:

Tesseract does not modify the locale.

Suggested Fix:

The following code in TessBaseAPI::TessBaseAPI() and its preceding code in 4.1.1 modify the global locale if a debug Tesseract is linked to a program. That would be the case when consuming Tesseract through Conan with a -s build_type=Debug.

In our case, it's changing the number formatting of one of our products. We are already careful to change the locale to "C" and restore the locale when calling Tesseract, but this resulted in a difficult to understand bug that we initially thought was in our code.

#if !defined(NDEBUG)
  // The Tesseract executables would use the "C" locale by default,
  // but other software which is linked against the Tesseract library
  // typically uses the locale from the user's environment.
  // Here the default is overridden to allow debugging of potential
  // problems caused by the locale settings.

  // Use the current locale if building debug code.
  try {
    std::locale::global(std::locale(""));
  } catch (const std::runtime_error& ex) {
    fprintf(stderr, "Warning: Could not set the current locale\n");
  }
#endif
@egorpugin
Copy link
Contributor

@stweil
I think we could remove that code. What do you think?
Or change that ifdef to #ifndef TESS_DEBUG, because it really uses global debug flag which may affect other code.
TESS_DEBUG will explicitly say that we want to trace exactly tess code.
But my preference is to remove that locale setting completely.

@stweil
Copy link
Contributor

stweil commented Mar 25, 2021

The Tesseract library is used in C/C++ applications like the tesseract executables. Such applications use the C locale by default and won't have problems with code parts which change the behaviour with different locales.

But the library is also used in other applications which don't use the C locale. Therefore it is important that the library works with any locale settings.

The code which was removed now made sure that debug versions of the Tesseract library always used the current locale of the user, so we had a broad test with different locales when people used debug builds.

Release builds should set NDEBUGwhich disables the locale settings. Then the TessBaseAPI does not modify the global locale.

Please revert the commit which removed the test code.

@egorpugin
Copy link
Contributor

I do not understand and do not agree

The code which was removed

We do not need that code to make bugs visible.


  1. Locale issues must be reported and fixed.
    Setting locale is not what libtess should do.
  2. If we need to check locale issues, we should find all places that use locales. What are those? I know c++ streams, what else?

@egorpugin
Copy link
Contributor

Isn't default program locale is system locale?
Where can I read about opposite?

If first question is true (default program locale is system locale), then we already have many user locales tested in release and debug builds and removed code does not make any sense.
Am I wrong?

@stweil
Copy link
Contributor

stweil commented Mar 25, 2021

"ISO C says that all programs start by default in the standard ‘C’ locale." (see https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html). Therefore we won't notice any problems caused by other locales unless we explicitly set a different locale.

Java or Python programs which use the Tesseract library would be affected by such problems.

A lot of functions depend on locale settings, not only the C++ streams but also the scanf and printf family of functions (and therefore also tprintf) and some more. I remember that I made a list some years ago in an earlier issue.

@egorpugin
Copy link
Contributor

egorpugin commented Mar 25, 2021

"ISO C says that all programs start by default in the standard ‘C’ locale.

I see, thank you.
I've also found on https://en.cppreference.com/w/cpp/locale/setlocale
During program startup, the equivalent of std::setlocale(LC_ALL, "C"); is executed before any user code is run.

But if we need such test, why not create some unit test that will iterate over many locales, use several tess facilities that use locales.
(I agree that such unit test wil have smaller coverage.)


Then, next question is why do we have such issues? If we do some locale dependeny c++ io, we should imbue locale on streams explicitly (C locale to get locale independent results). If we use C print/scan functions, they should be checked on whether it is allowed to do locale dependent io.

I've already fixed some c++ io issue with streams during unittest CI setup. Seems like because of locale set in removed code or in unittest startup code.

I don't think that bugs should be catched in such way.
If we replace !NDEBUG with !TESS_DEBUG - no one will use that our def except our code (tess.exe, unittests).

Upd.:
Tess data must be stored in locale/os/arch/endianness/etc independent format. So, we don't need any locales here.
Also the binary data will be better as text data exactly uses locales.

@stweil
Copy link
Contributor

stweil commented Mar 25, 2021

why not create some unit test that will iterate over many locales

A locale has to be installed before it can be activated. Typical Linux installations only install a single locale depending on the language settings (in my case de_DE.UTF-8). The main contributors would at least cover some representative locales (RTL languages, Indic, Russian, Western European).

A typical example which can cause problems: training output from lstmtraining prints float numbers. Those numbers are parsed by scripts which plot the training results. Depending on the locale, the representation would look very different:

3.141
3,141
8192.05
8192,05
8.192,05
8,192.05

@egorpugin
Copy link
Contributor

training output from lstmtraining prints float numbers.

I assume it must be fixed then? It should always print C locale numbers. If any script cannot parse them (works in some other locale), it is that script's problem, not ours.

@Shreeshrii
Copy link
Collaborator

training output from lstmtraining prints float numbers. Those numbers are parsed by scripts which plot the training results. Depending on the locale, the representation would look very different:

I had not considered this while creating the plotting scripts in tesstrain repo.

https://github.com/tesseract-ocr/tesstrain/pull/236/files#diff-c1286e969112d6bd21411905c6e5901139206ac5d7d995242af2eb0e37801106

the script uses . as the decimal delimiter when trying to get CER values from the log file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants