Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tess4j is not work on tesseract 4.0.0-beta.3-199-gba757 #106

Closed
6FA3T opened this issue Jul 11, 2018 · 13 comments
Closed

tess4j is not work on tesseract 4.0.0-beta.3-199-gba757 #106

6FA3T opened this issue Jul 11, 2018 · 13 comments

Comments

@6FA3T
Copy link

6FA3T commented Jul 11, 2018

I work with tess4j with
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.0.2</version>
</dependency>
But it doesn't work with tesseract-4.0.0-beta.3-199-gba757.
When I start my service, and init the tess4j, my service crush.
My service work on centos 7.4 , and tesseract version information is
tesseract 4.0.0-beta.3-199-gba757
leptonica-1.76.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
Found AVX2
Found AVX
Found SSE

print out the error log:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f11f9d4104a, pid=41510, tid=0x00007f1202fef700

JRE version: Java(TM) SE Runtime Environment (8.0_152-b16) (build 1.8.0_152-b16)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [libtesseract.so.4+0x27704a] ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const+0x16a

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.`


I had clone the newest code and build by myself. But it still work. I notice that the newest code support beta3.
the error info:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007ff5b0f3304a, pid=95930, tid=0x00007ff5b9ee0700

JRE version: Java(TM) SE Runtime Environment (8.0_152-b16) (build 1.8.0_152-b16)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [libtesseract.so.4+0x27704a] ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const+0x16a

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

--------------- T H R E A D ---------------

Current thread (0x00007ff60db13000): JavaThread "http-nio-80-exec-1" daemon [_thread_in_native, id=96070, stack(0x00007ff5b9de0000,0x00007ff5b9ee1000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

@nguyenq
Copy link
Owner

nguyenq commented Jul 13, 2018

The current release was tested against tesseract-4.0.0-beta.1 on Ubuntu 18.04. Let me see if I can find time this weekend to compile, install tesseract-4.0.0-beta.3, and test tess4j against it.

@6FA3T
Copy link
Author

6FA3T commented Jul 14, 2018

@nguyenq I solve the problem with 'export LC_ALL=C'. I fit out this by #105 conversation. I don't know why tess4j-4.1.0-snapshot with tesseract-4.0.0-beta.3 needs this set up, but tess4j-4.0.2-snapshot with tesseract-4.0.0-beta.1 didn't. If you know anything about this, please tell me. Thank you for your help.

@nguyenq
Copy link
Owner

nguyenq commented Jul 14, 2018

After successful build and install of tesseract 4.0.0-beta.3, I encountered a hard JVM crash when attempting to execute tess4j's unit tests.

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 201

Executing export LC_ALL=C at the terminal allowed the unit tests to complete successfully.

The exception stems from a recent commit in tesseract-4.0.0-beta.3 in an attempt to resolve the locale issue. That fix seems to have caused other issues.

Duplicate of #105.

@nguyenq nguyenq closed this as completed Jul 14, 2018
@nguyenq
Copy link
Owner

nguyenq commented Jul 14, 2018

$ tesseract -v
tesseract 4.0.0-beta.3-204-gf47da
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found SSE


$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


$ export LC_ALL=C
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

$ mvn test

@satishsojitra
Copy link

satishsojitra commented Nov 1, 2018

Hi All,
Operating System
Ubuntu 16.0.4 LTS

Tesseract
tesseract 4.0.0-6-g3ac3
leptonica-1.77.0
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
Found AVX
Found SSE

Tess4j

net.sourceforge.tess4j
tess4j
4.2.2

Exception
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fe043d13e0d, pid=3926, tid=0x00007fe044a00700

JRE version: OpenJDK Runtime Environment (8.0_181-b13) (build 1.8.0_181-8u181-b13-1ubuntu0.16.04.1-b13)
Java VM: OpenJDK 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [libtesseract.so+0x2b8e0d] ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const+0x16d

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:
/tmp/hs_err_pid3926.log

If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

Even after executing export LC_ALL=C facing the same issue and terminal is not opening after restarting system. What is the correct solution?

@nguyenq
Copy link
Owner

nguyenq commented Nov 1, 2018

The library has been developed and tested on Ubuntu 18.10 and Oracle Java 11.

Due to Tesseract's locale requirements, export LC_ALL=C is required before running any client programs.

@satishsojitra
Copy link

@nguyenq I think there should be some solution for older version of Ubuntu as well because after exporting LC_ALL=C , terminal stopped working and also Intellij Idea.

@martin-huber
Copy link

martin-huber commented Nov 9, 2018

For my opinion it can not be, that you have to run a Java application (or a J2EE container or wherever you run you Java application) with LC_ALL=C !
This will also have side effects to the entire application.

This problem has been introduced with the following commit 5 month ago:
tesseract-ocr/tesseract@3292484
The code change was

  locale = std::setlocale(LC_ALL, nullptr);
  ASSERT_HOST(!strcmp(locale, "C"));
  locale = std::setlocale(LC_CTYPE, nullptr);
  ASSERT_HOST(!strcmp(locale, "C"));
  locale = std::setlocale(LC_NUMERIC, nullptr);
  ASSERT_HOST(!strcmp(locale, "C"));

The documentation of std::setlocale says:

The setlocale function installs the specified system locale or its portion as the new C locale. The modifications remain in effect and influences the execution of all locale-sensitive C library functions until the next call to setlocale. If locale is a null pointer, setlocale queries the current C locale without modifying it.

So if I understand it right, tesseract either tries to set the C locale and or only checks, if it is set, and throws an Exception if not.

The commit comment says

Normal C++ programs like those which are built for tesseract automatically
set the locale "C".

There can be different locale settings if the tesseract library is used
in other software.

A wrong locale can cause wrong results from sscanf which is used at
different places in the tesseract code, so make sure that we have the
right locale settings and fail if that is not the case.

So the core problem is either, that a dynamic library loaded by JNA is not able to change the locale (if the above code can be interpreted to attempt to set the locale for tesseract) OR, if above code only ensures THAT the locale is set, the right way would be to somehow set the "C" locale for all tesseract JNA calls.

I tried to search for something like this, but it didn't help.
I tried to set the DefaultLocale in my Java application to Locale.ROOT - but this also didn't help.

Is there anybody who understands excatly what the cpp code is trying? Does it try to SET the locale or does it only try to QUERY the C locale and see, if it set ?
Could it be an option, if tesseract would provide an API method to really SET the C locale, that could be used by tess4j to initialize the locale ?

Cheers,
Martin

@martin-huber
Copy link

I tried several things in my Java code, but what worked to get my Test green, was using the following snippet:

    public interface CLibrary extends Library {
        CLibrary INSTANCE = Native.loadLibrary((Platform.isWindows() ? "msvcrt" : "c"), CLibrary.class);

        int LC_CTYPE=0;
        int LC_NUMERIC=1;
        int LC_ALL=6;

        // char *setlocale(int category, const char *locale);
        String setlocale(int category, String locale);
    }

      @Test
    public void testTess4J() throws IOException, TesseractException, URISyntaxException {
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath(getTessdataPath().getAbsolutePath());
        tesseract.setLanguage("deu+eng+fra+ita"));

        Locale defaultLocale = Locale.getDefault();

        CLibrary.INSTANCE.setlocale(CLibrary.LC_ALL, "C");
        CLibrary.INSTANCE.setlocale(CLibrary.LC_NUMERIC, "C");
        CLibrary.INSTANCE.setlocale(CLibrary.LC_CTYPE, "C");

        assertEquals(defaultLocale, Locale.getDefault());

        tesseract.setTessVariable("page_separator", "");
        tesseract.setTessVariable("preserve_interword_spaces", "1");

        PDFRenderer pdfRenderer = new PDFRenderer(PDDocument.load(OCRTest.class.getResourceAsStream("/ocr/eurotext.pdf")));
        BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 750, ImageType.GRAY);
        String result = tesseract.doOCR(bufferedImage).trim();

        assertEquals("The (quick) [brown] {fox} jumps!\n" +
                "Over the $43,456.78 <lazy> #90 dog\n" +
                "& duck/goose, as 12.5% of E-mail\n" +
                "from [email protected] is spam.\n" +
                "Der „schnelle” braune Fuchs springt\n" +
                "über den faulen Hund. Le renard brun\n" +
                "«rapide» saute par-dessus le chien\n" +
                "paresseux. La volpe marrone rapida\n" +
                "salta sopra il cane pigro. EI zorro\n" +
                "marron rapido salta sobre el perro\n" +
                "perezoso. A raposa marrom rapida\n" +
                "salta sobre o cào preguicoso.", result);
    }

But I am not sure, if setting the locale to "C" using JNA does have some side-effects ...

At least it does not affect the Java default locale.
But since Java over short or long also will do native calls to the underlying platform, it might well be, that it does.

@Jeff201812
Copy link

The example codes for running Tess4j has been given in "http://tess4j.sourceforge.net/codesample.html," however it was performed in Mac OS in intellij, error message came up with "!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209" with the below messages:

A fatal error has been detected by the Java Runtime Environment:

SIGILL (0x4) at pc=0x000000012183ca4f, pid=44026, tid=0x0000000000001903

JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode bsd-amd64 compressed oops)

Problematic frame:

C [libtesseract.dylib+0x156a4f] ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const+0x183

By my reading of the above posts, it appears to be an issue with exporting LC_ALL=C, but not sure, in the example codes given, what exact codes should be added/amend, to make that example to run?

@nguyenq
Copy link
Owner

nguyenq commented Dec 17, 2018

@Jeff201812 Open a command line, execute export LC_ALL=C before launching IntelliJ from the same command prompt.

Or try CLibrary.INSTANCE.setlocale given above in your Java code.

@Jeff201812
Copy link

Jeff201812 commented Dec 17, 2018 via email

@nao-de
Copy link

nao-de commented Jan 24, 2019

So what's the resolution here? Can we add the export LC_ALL step to the docs since everyone is just going to hit this error and the error message is super cryptic? Isn't there any way tess4j can set the variable before executing calls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants