-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pixConvertToPdf output invalid due to LC_NUMERIC #591
Comments
Note: this is not covered by |
leptonica uses ascii characters in strings, not utf-8. And this is the first mention I've seen of issues with decimal points (German commas) in numeric output. |
Leptonica uses C code, and Tesseract uses C++. Like most C or C++ programs Tesseract uses the default locale This is different for other programming languages, especially for script languages. Python uses the locale settings of the user, so like @bertsky I get the
tesseract-ocr/tesseract#1670 (comment) lists the problematic functions. Those were fixed in the Tesseract code, but I simply forgot that the Leptonica code requires similar fixes. Problematic functions (strikethrough for those which are not used in the Leptonica library):
|
Thanks @stweil for your analysis! Just two points:
|
The |
The CLI does have that problem, see above. |
How do you run |
For PDF generation, the problem is |
Run with |
I did. Running |
@bertsky, are you using an older debug build of |
Yes, I think I am (I usually just switch the source files underneath the builddir). What setting are you referring to specifically? |
See commit tesseract-ocr/tesseract@7c975a0. |
Yes, I definitely still have that (and the build was a debug config). |
Lines 1919 to 1920 in a49f60a
|
Just a meta-comment. The purpose of standards is to avoid variances, simplify implementation and allow interoperability. Such as making sure all railway lines have the same gauge. Or all functions in a standard library are supported with the same interface. Perhaps the PDF standard did not specify the ascii representation of a decimal point? |
"A real value is written as one or more decimal digits with an optional sign and a |
Dan, about half of the C files in |
The 145 or so prog/*_reg.c programs call regTestSetup() and regTestCleanup(). The other approximately 150 programs do not call these functions, but just about all of them read some data. |
Is setrlocale() recognized on all the platforms we support? |
Ideally all test programs would call a common startup code from |
If there is a test program for the pdf writing functionality, it won't detect the issue unless it will try to validate the created pdf file, like @bertsky did with ghostscript. |
I am trying to
pixConvertToPdf
a 1bpp image (with cmap I think), but the created PDF file seems to be invalid:After poking around, I found that I can get the function to produce valid results by running with
LC_NUMERIC=C
(instead of my default, which isde_DE.UTF-8
).Here's a single hunk of the diff between invalid (localized) and valid (non-localized) output:
(Unlike English, German uses commas for decimal point and dot for thousands.)
(This is also relevant for Tesseract BTW.)
The text was updated successfully, but these errors were encountered: