Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pixConvertToPdf output invalid due to LC_NUMERIC #591

Open
bertsky opened this issue Sep 23, 2021 · 24 comments
Open

pixConvertToPdf output invalid due to LC_NUMERIC #591

bertsky opened this issue Sep 23, 2021 · 24 comments
Assignees
Labels

Comments

@bertsky
Copy link

bertsky commented Sep 23, 2021

I am trying to pixConvertToPdf a 1bpp image (with cmap I think), but the created PDF file seems to be invalid:

GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Error: Encountered 'obj' while expecting 'endobj'.
               Treating this as a missing 'endobj', output may be incorrect.
Processing pages 1 through 17.
Page 1
   **** Error: Unknown operator: '595,2000'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '841,9200'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '0,0000'
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.
Error: /undefined in /--pdfshowpage_finish--
Operand stack:
   --dict:7/15(L)--   Annots
Execution stack:
   %interp_exit   .runexec2   --nostringval--   pdfshowpage_finish   --nostringval--   2   %stopped_push   --nostringval--   pdfshowpage_finish   pdfshowpage_finish   false   1   %stopped_push   2045   1   3   %oparray_pop   2044   1   3   %oparray_pop   2025   1   3   %oparray_pop   2026   1   3   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish   2   1   17   pdfshowpage_finish   %for_pos_int_continue   2029   1   7   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish   pdfshowpage_finish
Dictionary stack:
   --dict:968/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82/200(L)--   --dict:133/256(ro)(G)--
Current allocation mode is local
Last OS error: Resource temporarily unavailable
GPL Ghostscript 9.26: Unrecoverable error, exit code 1

After poking around, I found that I can get the function to produce valid results by running with LC_NUMERIC=C (instead of my default, which is de_DE.UTF-8).

Here's a single hunk of the diff between invalid (localized) and valid (non-localized) output:

 5 0 obj
 << /Length 61 >>
 stream
-q 595,2000 0,0000 0,0000 841,9200 0,0000 0,0000 cm /Im1 Do Q
+q 595.2000 0.0000 0.0000 841.9200 0.0000 0.0000 cm /Im1 Do Q
 endstream
 endobj
 6 0 obj

(Unlike English, German uses commas for decimal point and dot for thousands.)

(This is also relevant for Tesseract BTW.)

@bertsky
Copy link
Author

bertsky commented Sep 23, 2021

Note: this is not covered by make check (which succeeds for current master, despite my l10n).

@DanBloomberg
Copy link
Owner

leptonica uses ascii characters in strings, not utf-8.

And this is the first mention I've seen of issues with decimal points (German commas) in numeric output.

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

Leptonica uses C code, and Tesseract uses C++. Like most C or C++ programs Tesseract uses the default locale C which is simply the default as long as the program does not set a different locale.

This is different for other programming languages, especially for script languages. Python uses the locale settings of the user, so like @bertsky I get the de_DE.UTF-8 locale. Java also uses the user's locale.

So anybody who uses the Tesseract library (which implies also the Leptonica library) with Python or Java will get PDF output which depends on the current locale. That output will be wrong for many locales, for example for German users. Tesseract does not use pixConvertToPdf, but might use other parts of Leptonica which depend on the locale.

tesseract-ocr/tesseract#1670 (comment) lists the problematic functions. Those were fixed in the Tesseract code, but I simply forgot that the Leptonica code requires similar fixes.

Problematic functions (strikethrough for those which are not used in the Leptonica library): atoi, isspace, strtod, strtof, strtol (always), fscanf, sscanf, printf, fprint, snprintf and other *printf with float and double values.

git grep -l "%[0-9.]*f" src shows a list of 75 files with code which is affected by the locale settings.

@bertsky
Copy link
Author

bertsky commented Sep 24, 2021

Thanks @stweil for your analysis! Just two points:

  • Tesseract uses pixaConvertToPdf in two places for debugging. (The PDF renderer does not seem to use Leptonica directly, but it does call pixGenerateCIData.)
  • The problem exists also with the Tesseract CLI, when no Python is involved.

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

The tesseract CLI does not set the locale and therefore uses the C locale, so it should not have such a problem.

@bertsky
Copy link
Author

bertsky commented Sep 24, 2021

The tesseract CLI does not set the locale and therefore uses the C locale, so it should not have such a problem.

The CLI does have that problem, see above.

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

How do you run tesseract to produce a PDF with commas instead of decimal points? I don't see that above.

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

For PDF generation, the problem is snprintf in generateContentStringPdf in l_generatePdf in pixConvertToPdfData and cidConvertToPdfData.

@bertsky
Copy link
Author

bertsky commented Sep 24, 2021

How do you run tesseract to produce a PDF with commas instead of decimal points? I don't see that above.

Run with textord_tabfind_show_vlines=1 (which produces ./vhlinefinding.pdf) and/or tessedit_dump_pageseg_images=1 or textord_tabfind_show_images=1 (which produces imagepath+_debug.pdf).

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

I did. Running tesseract test/testing/phototest.tif /tmp/phototest -c textord_tabfind_show_vlines=1 -c tessedit_dump_pageseg_images=1 gives two PDF files, both with decimal points as expected.

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

@bertsky, are you using an older debug build of tesseract? That would explain why you get a different result. I had enabled locale setting for debug builds to find bugs like that, but unluckily that was removed again later.

@bertsky
Copy link
Author

bertsky commented Sep 24, 2021

are you using an older debug build of tesseract? That would explain why you get a different result. I had enabled locale setting for debug builds to find bugs like that, but unluckily that was removed again later.

Yes, I think I am (I usually just switch the source files underneath the builddir). What setting are you referring to specifically?

@stweil
Copy link
Collaborator

stweil commented Sep 24, 2021

See commit tesseract-ocr/tesseract@7c975a0.

@bertsky
Copy link
Author

bertsky commented Sep 24, 2021

See commit tesseract-ocr/tesseract@7c975a0.

Yes, I definitely still have that (and the build was a debug config).

@stweil stweil added the bug label Nov 15, 2021
@stweil stweil self-assigned this Nov 15, 2021
@amitdo
Copy link
Contributor

amitdo commented Nov 24, 2022

For PDF generation, the problem is snprintf in generateContentStringPdf in l_generatePdf in pixConvertToPdfData and cidConvertToPdfData.

leptonica/src/pdfio2.c

Lines 1919 to 1920 in a49f60a

snprintf(buf, bufsize,
"q %.4f %.4f %.4f %.4f %.4f %.4f cm /Im%d Do Q\n",

@DanBloomberg
Copy link
Owner

Just a meta-comment.

The purpose of standards is to avoid variances, simplify implementation and allow interoperability. Such as making sure all railway lines have the same gauge. Or all functions in a standard library are supported with the same interface.

Perhaps the PDF standard did not specify the ascii representation of a decimal point?

@stweil
Copy link
Collaborator

stweil commented Nov 24, 2022

"A real value is written as one or more decimal digits with an optional sign and a
leading, trailing, or embedded period (decimal point)" (cited from the old PDF reference 1.7).

@stweil
Copy link
Collaborator

stweil commented Nov 24, 2022

Dan, about half of the C files in prog call function regTestSetup. Is there any other function which gets called by all test programs? I'd like to add setlocale(LC_ALL, ""); to all of them. That can be used to detect issues where code depends on locale settings.

@DanBloomberg
Copy link
Owner

The 145 or so prog/*_reg.c programs call regTestSetup() and regTestCleanup().

The other approximately 150 programs do not call these functions, but just about all of them read some data.
Most will call pixRead() at some point.

@DanBloomberg
Copy link
Owner

Is setrlocale() recognized on all the platforms we support?

@amitdo
Copy link
Contributor

amitdo commented Nov 25, 2022

@stweil
Copy link
Collaborator

stweil commented Nov 25, 2022

Ideally all test programs would call a common startup code from main before running test code. That startup code could call setlocale, but also enable float point exceptions which detect FP division by zero, overflow, calculations with NAN and others.

@amitdo
Copy link
Contributor

amitdo commented Nov 25, 2022

If there is a test program for the pdf writing functionality, it won't detect the issue unless it will try to validate the created pdf file, like @bertsky did with ghostscript.

@amitdo
Copy link
Contributor

amitdo commented Nov 27, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants