-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TessBaseAPI should not modify the global locale #3290
Comments
@stweil |
The Tesseract library is used in C/C++ applications like the But the library is also used in other applications which don't use the C locale. Therefore it is important that the library works with any locale settings. The code which was removed now made sure that debug versions of the Tesseract library always used the current locale of the user, so we had a broad test with different locales when people used debug builds. Release builds should set Please revert the commit which removed the test code. |
I do not understand and do not agree
We do not need that code to make bugs visible.
|
Isn't default program locale is system locale? If first question is true (default program locale is system locale), then we already have many user locales tested in release and debug builds and removed code does not make any sense. |
"ISO C says that all programs start by default in the standard ‘C’ locale." (see https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html). Therefore we won't notice any problems caused by other locales unless we explicitly set a different locale. Java or Python programs which use the Tesseract library would be affected by such problems. A lot of functions depend on locale settings, not only the C++ streams but also the |
I see, thank you. But if we need such test, why not create some unit test that will iterate over many locales, use several tess facilities that use locales. Then, next question is why do we have such issues? If we do some locale dependeny c++ io, we should imbue locale on streams explicitly (C locale to get locale independent results). If we use C print/scan functions, they should be checked on whether it is allowed to do locale dependent io. I've already fixed some c++ io issue with streams during unittest CI setup. Seems like because of locale set in removed code or in unittest startup code. I don't think that bugs should be catched in such way. Upd.: |
A locale has to be installed before it can be activated. Typical Linux installations only install a single locale depending on the language settings (in my case de_DE.UTF-8). The main contributors would at least cover some representative locales (RTL languages, Indic, Russian, Western European). A typical example which can cause problems: training output from
|
I assume it must be fixed then? It should always print C locale numbers. If any script cannot parse them (works in some other locale), it is that script's problem, not ours. |
I had not considered this while creating the plotting scripts in tesstrain repo. the script uses |
Environment
Current Behavior:
Tesseract modifies the global locale when the API is linked to other consumers, if built with Debug.
Expected Behavior:
Tesseract does not modify the locale.
Suggested Fix:
The following code in
TessBaseAPI::TessBaseAPI()
and its preceding code in 4.1.1 modify the global locale if a debug Tesseract is linked to a program. That would be the case when consuming Tesseract through Conan with a-s build_type=Debug
.In our case, it's changing the number formatting of one of our products. We are already careful to change the locale to
"C"
and restore the locale when calling Tesseract, but this resulted in a difficult to understand bug that we initially thought was in our code.The text was updated successfully, but these errors were encountered: