pixConvertToPdf output invalid due to LC_NUMERIC #591

bertsky · 2021-09-23T14:55:03Z

I am trying to pixConvertToPdf a 1bpp image (with cmap I think), but the created PDF file seems to be invalid:

GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Error: Encountered 'obj' while expecting 'endobj'.
               Treating this as a missing 'endobj', output may be incorrect.
Processing pages 1 through 17.
Page 1
   **** Error: Unknown operator: '595,2000'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '841,9200'
   **** Error: Unknown operator: '0,0000'
   **** Error: Unknown operator: '0,0000'
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.
Error: /undefined in /--pdfshowpage_finish--
Operand stack:
   --dict:7/15(L)--   Annots
Execution stack:
   %interp_exit   .runexec2   --nostringval--   pdfshowpage_finish   --nostringval--   2   %stopped_push   --nostringval--   pdfshowpage_finish   pdfshowpage_finish   false   1   %stopped_push   2045   1   3   %oparray_pop   2044   1   3   %oparray_pop   2025   1   3   %oparray_pop   2026   1   3   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish   2   1   17   pdfshowpage_finish   %for_pos_int_continue   2029   1   7   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish   pdfshowpage_finish
Dictionary stack:
   --dict:968/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82/200(L)--   --dict:133/256(ro)(G)--
Current allocation mode is local
Last OS error: Resource temporarily unavailable
GPL Ghostscript 9.26: Unrecoverable error, exit code 1

After poking around, I found that I can get the function to produce valid results by running with LC_NUMERIC=C (instead of my default, which is de_DE.UTF-8).

Here's a single hunk of the diff between invalid (localized) and valid (non-localized) output:

 5 0 obj
 << /Length 61 >>
 stream
-q 595,2000 0,0000 0,0000 841,9200 0,0000 0,0000 cm /Im1 Do Q
+q 595.2000 0.0000 0.0000 841.9200 0.0000 0.0000 cm /Im1 Do Q
 endstream
 endobj
 6 0 obj

(Unlike English, German uses commas for decimal point and dot for thousands.)

(This is also relevant for Tesseract BTW.)

The text was updated successfully, but these errors were encountered:

bertsky · 2021-09-23T15:01:20Z

Note: this is not covered by make check (which succeeds for current master, despite my l10n).

DanBloomberg · 2021-09-23T22:12:58Z

leptonica uses ascii characters in strings, not utf-8.

And this is the first mention I've seen of issues with decimal points (German commas) in numeric output.

stweil · 2021-09-24T06:28:03Z

Leptonica uses C code, and Tesseract uses C++. Like most C or C++ programs Tesseract uses the default locale C which is simply the default as long as the program does not set a different locale.

This is different for other programming languages, especially for script languages. Python uses the locale settings of the user, so like @bertsky I get the de_DE.UTF-8 locale. Java also uses the user's locale.

So anybody who uses the Tesseract library (which implies also the Leptonica library) with Python or Java will get PDF output which depends on the current locale. That output will be wrong for many locales, for example for German users. Tesseract does not use pixConvertToPdf, but might use other parts of Leptonica which depend on the locale.

tesseract-ocr/tesseract#1670 (comment) lists the problematic functions. Those were fixed in the Tesseract code, but I simply forgot that the Leptonica code requires similar fixes.

Problematic functions (strikethrough for those which are not used in the Leptonica library): atoi, ~~isspace~~, ~~strtod~~, ~~strtof~~, ~~strtol~~ (always), fscanf, sscanf, ~~printf~~, fprint, snprintf ~~and other *printf~~ with float and double values.

git grep -l "%[0-9.]*f" src shows a list of 75 files with code which is affected by the locale settings.

bertsky · 2021-09-24T08:31:36Z

Thanks @stweil for your analysis! Just two points:

Tesseract uses pixaConvertToPdf in two places for debugging. (The PDF renderer does not seem to use Leptonica directly, but it does call pixGenerateCIData.)
The problem exists also with the Tesseract CLI, when no Python is involved.

stweil · 2021-09-24T08:48:00Z

The tesseract CLI does not set the locale and therefore uses the C locale, so it should not have such a problem.

bertsky · 2021-09-24T08:59:41Z

The tesseract CLI does not set the locale and therefore uses the C locale, so it should not have such a problem.

The CLI does have that problem, see above.

stweil · 2021-09-24T09:11:23Z

How do you run tesseract to produce a PDF with commas instead of decimal points? I don't see that above.

stweil · 2021-09-24T09:23:06Z

For PDF generation, the problem is snprintf in generateContentStringPdf in l_generatePdf in pixConvertToPdfData and cidConvertToPdfData.

bertsky · 2021-09-24T09:38:45Z

How do you run tesseract to produce a PDF with commas instead of decimal points? I don't see that above.

Run with textord_tabfind_show_vlines=1 (which produces ./vhlinefinding.pdf) and/or tessedit_dump_pageseg_images=1 or textord_tabfind_show_images=1 (which produces imagepath+_debug.pdf).

stweil · 2021-09-24T10:07:17Z

I did. Running tesseract test/testing/phototest.tif /tmp/phototest -c textord_tabfind_show_vlines=1 -c tessedit_dump_pageseg_images=1 gives two PDF files, both with decimal points as expected.

stweil · 2021-09-24T10:10:33Z

@bertsky, are you using an older debug build of tesseract? That would explain why you get a different result. I had enabled locale setting for debug builds to find bugs like that, but unluckily that was removed again later.

bertsky · 2021-09-24T10:26:12Z

are you using an older debug build of tesseract? That would explain why you get a different result. I had enabled locale setting for debug builds to find bugs like that, but unluckily that was removed again later.

Yes, I think I am (I usually just switch the source files underneath the builddir). What setting are you referring to specifically?

stweil · 2021-09-24T10:34:44Z

See commit tesseract-ocr/tesseract@7c975a0.

bertsky · 2021-09-24T10:40:26Z

See commit tesseract-ocr/tesseract@7c975a0.

Yes, I definitely still have that (and the build was a debug config).

amitdo · 2022-11-24T09:27:13Z

For PDF generation, the problem is snprintf in generateContentStringPdf in l_generatePdf in pixConvertToPdfData and cidConvertToPdfData.

leptonica/src/pdfio2.c

Lines 1919 to 1920 in a49f60a

    
           snprintf(buf, bufsize, 
        
                    "q %.4f %.4f %.4f %.4f %.4f %.4f cm /Im%d Do Q\n",

DanBloomberg · 2022-11-24T20:18:11Z

Just a meta-comment.

The purpose of standards is to avoid variances, simplify implementation and allow interoperability. Such as making sure all railway lines have the same gauge. Or all functions in a standard library are supported with the same interface.

Perhaps the PDF standard did not specify the ascii representation of a decimal point?

stweil · 2022-11-24T20:28:05Z

"A real value is written as one or more decimal digits with an optional sign and a
leading, trailing, or embedded period (decimal point)" (cited from the old PDF reference 1.7).

stweil · 2022-11-24T21:01:20Z

Dan, about half of the C files in prog call function regTestSetup. Is there any other function which gets called by all test programs? I'd like to add setlocale(LC_ALL, ""); to all of them. That can be used to detect issues where code depends on locale settings.

DanBloomberg · 2022-11-24T22:20:26Z

The 145 or so prog/*_reg.c programs call regTestSetup() and regTestCleanup().

The other approximately 150 programs do not call these functions, but just about all of them read some data.
Most will call pixRead() at some point.

DanBloomberg · 2022-11-24T22:25:10Z

Is setrlocale() recognized on all the platforms we support?

amitdo · 2022-11-25T06:55:45Z

setlocale() is defined in the C standard

stweil · 2022-11-25T07:33:33Z

Ideally all test programs would call a common startup code from main before running test code. That startup code could call setlocale, but also enable float point exceptions which detect FP division by zero, overflow, calculations with NAN and others.

amitdo · 2022-11-25T08:18:58Z

If there is a test program for the pdf writing functionality, it won't detect the issue unless it will try to validate the created pdf file, like @bertsky did with ghostscript.

amitdo · 2022-11-27T20:14:07Z

https://stackoverflow.com/questions/4057319/is-setlocale-thread-safe-function

stweil added the bug label Nov 15, 2021

stweil self-assigned this Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pixConvertToPdf output invalid due to LC_NUMERIC #591

pixConvertToPdf output invalid due to LC_NUMERIC #591

bertsky commented Sep 23, 2021

bertsky commented Sep 23, 2021

DanBloomberg commented Sep 23, 2021

stweil commented Sep 24, 2021 •

edited

Loading

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

stweil commented Sep 24, 2021 •

edited

Loading

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

amitdo commented Nov 24, 2022 •

edited

Loading

DanBloomberg commented Nov 24, 2022

stweil commented Nov 24, 2022

stweil commented Nov 24, 2022

DanBloomberg commented Nov 24, 2022

DanBloomberg commented Nov 24, 2022

amitdo commented Nov 25, 2022

stweil commented Nov 25, 2022

amitdo commented Nov 25, 2022

amitdo commented Nov 27, 2022

pixConvertToPdf output invalid due to LC_NUMERIC #591

pixConvertToPdf output invalid due to LC_NUMERIC #591

Comments

bertsky commented Sep 23, 2021

bertsky commented Sep 23, 2021

DanBloomberg commented Sep 23, 2021

stweil commented Sep 24, 2021 • edited Loading

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

stweil commented Sep 24, 2021 • edited Loading

bertsky commented Sep 24, 2021

stweil commented Sep 24, 2021

bertsky commented Sep 24, 2021

amitdo commented Nov 24, 2022 • edited Loading

DanBloomberg commented Nov 24, 2022

stweil commented Nov 24, 2022

stweil commented Nov 24, 2022

DanBloomberg commented Nov 24, 2022

DanBloomberg commented Nov 24, 2022

amitdo commented Nov 25, 2022

stweil commented Nov 25, 2022

amitdo commented Nov 25, 2022

amitdo commented Nov 27, 2022

stweil commented Sep 24, 2021 •

edited

Loading

stweil commented Sep 24, 2021 •

edited

Loading

amitdo commented Nov 24, 2022 •

edited

Loading