Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with setting the "classify_bln_numeric_mode" variable #52

Closed
simonmb opened this issue Dec 5, 2013 · 22 comments
Closed

Problems with setting the "classify_bln_numeric_mode" variable #52

simonmb opened this issue Dec 5, 2013 · 22 comments
Assignees
Labels

Comments

@simonmb
Copy link

simonmb commented Dec 5, 2013

Hi I'm trying to read some numbers from a scanned document.
My first test was using the BaseApiTester. The only things I changed in the project was the path to the image, and I added following line:

bool ret = engine.SetVariable("classify_bln_numeric_mode", 1);

before the line:

using (var page = engine.Process(img)){

Whenever I set the variable the program will crash. If I don't set the variable the program does what it is supposed to.

I checked here (http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version)
to see what variable to set.

The error Im getting is:

System.AccessViolationException
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

And the Command Window:

Process image
first_unichar != NULL:Error:Assert failed:in file .\wordrec\language_model.cpp, line 445

Thanks

@charlesw
Copy link
Owner

charlesw commented Dec 6, 2013

I've had a look and been able to reproduce the issue. From the looks of it there is an issue with tesseract 3.02. I'll try and setup a build using the latest tesseract sources over the weekend and see if that resolves the issue, and if not file an issue with them. For reference I found https://code.google.com/p/tesseract-ocr/issues/detail?id=261 from 3.01 days with a comment that looks like the same issue however looks like it wasn't resolved (no further comments).

@ghost ghost assigned charlesw Dec 26, 2013
@robmathews
Copy link

I have the same issue with 3.02. Did you ever try building against the latest svn sources?

@charlesw
Copy link
Owner

No not yet, however I've just noticed they've released version 3.3 so will be looking to upgrade soon.

@robmathews
Copy link

Where did you see that they released 3.3? The latest version I see on the
website is 3.02 ...

On Sat, Jan 11, 2014 at 5:41 PM, Charles Weld [email protected]:

No not yet, however I've just noticed they've released version 3.3 so will
be looking to upgrade soon.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32109474
.

@charlesw
Copy link
Owner

https://code.google.com/p/tesseract-ocr/source/browse/trunk/ChangeLog looks like they haven't produced the binaries yet.

@charlesw
Copy link
Owner

OK looks like I got a little a head of myself, see: https://groups.google.com/forum/#!topic/tesseract-dev/UvqR53IfCgA. Anyway looks like we're stuck with 3.02 until 3.03 is released as the language data files are tied to a given version of tesseract.

@zdenop
Copy link

zdenop commented Jan 14, 2014

@charlesw: Can you test 3.03 from svn? If you need changes in it, you need to send patch ASAP - official release should be at the end of January...

charlesw added a commit that referenced this issue Jan 14, 2014
Notes:
* x86 only (no x64 build yet)
* based on rev 987
* Highly likely that tesseract and leptonica have different VS Runtimes
(tesseract was compiled with VS 2013, while leptonica VS 2012) this will
be fixed when I generate the x64 builds.
* Confirmed Issue #52 has been resolved.
* Could be Interop incompatibilities UNTESTED
@charlesw
Copy link
Owner

@zdenop I can confirm that this issue has been resolved in the latest 3.03 release (revision 987) no patches are required. Thanks

@simonmb
Copy link
Author

simonmb commented Jan 14, 2014

Thanks, works for me aswell

Date: Tue, 14 Jan 2014 02:06:02 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [tesseract] Problems with setting the "classify_bln_numeric_mode" variable (#52)

@zdenop I can confirm that this issue has been resolved in the latest 3.03 release (revision 987) no patches are required. Thanks


Reply to this email directly or view it on GitHub.

@robmathews
Copy link

Stupid question: does this parameter help with recognizing only digits? I
mean, that's why I got into in the first place, but now that I have to
rebuild against the latest source, I'm wondering if even helps much with
that problem ...

On Tue, Jan 14, 2014 at 5:11 AM, simonmb [email protected] wrote:

Thanks, works for me aswell

Date: Tue, 14 Jan 2014 02:06:02 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [tesseract] Problems with setting the
"classify_bln_numeric_mode" variable (#52)

@zdenop I can confirm that this issue has been resolved in the latest 3.03
release (revision 987) no patches are required. Thanks


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32252407
.

@charlesw
Copy link
Owner

Yes, I believe that's its purpose. Note from my very brief tests it will also recognise other characters that could be in numbers; for example ',$.

@robmathews
Copy link

FWIW, I compiled the new binaries from revision 998 of svn and it still
crashes in the digit mode like this:

_(lldb) _frame info

frame #12: 0x00258964
OCR-Example2`tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) + 724

which is less than clear.

On Sat, Jan 11, 2014 at 6:50 PM, Charles Weld [email protected]:

OK looks like I got a little a head of myself, see:
https://groups.google.com/forum/#!topic/tesseract-dev/UvqR53IfCgA. Anyway
looks like we're stuck with 3.02 until 3.03 is released as the language
data files are tied to a given version of tesseract.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32110905
.

@charlesw
Copy link
Owner

Did you try just using the dev_3.03 branch I created? There's also a test for this issue there too, all you should need to do is get the source, run prepare.bat and then load the solution in VS. I'll do a proper release once 3.03 has officially been released.

@robmathews
Copy link

No .. actually I discovered that I hadn't cleaned since the last time I
built from source, so the build was invalid. Doh.

But when I actually build from source, I can build the library w/o errors,
but when I link against my test app, it is claiming that it can't find the
baseapi class - 8 link errors like this:

"tesseract::TessBaseAPI::GetUTF8Text()", referenced from:

Has this class actually been removed? I can see the source for it, but it
doesn't seem to get compiled into the code (I examined the libraries and
dumped the symbols to be sure)

On Wed, Jan 15, 2014 at 5:23 PM, Charles Weld [email protected]:

Did you try just using the dev_3.03 branch I created? There's also a test
for this issue there too, all you should need to do is get the source, run
prepare.bat and then load the solution in VS. I'll do a proper release
once 3.03 has officially been released.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32422072
.

@robmathews
Copy link

Ok, so I missed a build error - it's trying to include "tiff.h", even
though it is configured with --no-libtiff. So I guess revision 998 isn't
valid.

I looked for your branch dev_3.0.3 and don't see it anywhere here:
https://code.google.com/p/tesseract-ocr/source/browse/#svn%2Ftags%253Fstate%253Dclosed

Are you looking somewhere else?

Rob.

On Wed, Jan 15, 2014 at 7:37 PM, Robert Mathews <
[email protected]> wrote:

No .. actually I discovered that I hadn't cleaned since the last time I
built from source, so the build was invalid. Doh.

But when I actually build from source, I can build the library w/o errors,
but when I link against my test app, it is claiming that it can't find the
baseapi class - 8 link errors like this:

"tesseract::TessBaseAPI::GetUTF8Text()", referenced from:

Has this class actually been removed? I can see the source for it, but it
doesn't seem to get compiled into the code (I examined the libraries and
dumped the symbols to be sure)

On Wed, Jan 15, 2014 at 5:23 PM, Charles Weld [email protected]:

Did you try just using the dev_3.03 branch I created? There's also a
test for this issue there too, all you should need to do is get the source,
run prepare.bat and then load the solution in VS. I'll do a proper
release once 3.03 has officially been released.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32422072
.

@robmathews
Copy link

Well, this diff looks pretty suspicious:
https://code.google.com/p/tesseract-ocr/source/diff?spec=svn990&r=990&format=side&path=/trunk/opencl/openclwrapper.h

And I'm 2 revisions before that .. looks like I need revision 990

On Wed, Jan 15, 2014 at 8:30 PM, Robert Mathews <
[email protected]> wrote:

Ok, so I missed a build error - it's trying to include "tiff.h", even
though it is configured with --no-libtiff. So I guess revision 998 isn't
valid.

I looked for your branch dev_3.0.3 and don't see it anywhere here:
https://code.google.com/p/tesseract-ocr/source/browse/#svn%2Ftags%253Fstate%253Dclosed

Are you looking somewhere else?

Rob.

On Wed, Jan 15, 2014 at 7:37 PM, Robert Mathews <
[email protected]> wrote:

No .. actually I discovered that I hadn't cleaned since the last time I
built from source, so the build was invalid. Doh.

But when I actually build from source, I can build the library w/o
errors, but when I link against my test app, it is claiming that it can't
find the baseapi class - 8 link errors like this:

"tesseract::TessBaseAPI::GetUTF8Text()", referenced from:

Has this class actually been removed? I can see the source for it, but it
doesn't seem to get compiled into the code (I examined the libraries and
dumped the symbols to be sure)

On Wed, Jan 15, 2014 at 5:23 PM, Charles Weld [email protected]:

Did you try just using the dev_3.03 branch I created? There's also a
test for this issue there too, all you should need to do is get the source,
run prepare.bat and then load the solution in VS. I'll do a proper
release once 3.03 has officially been released.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32422072
.

@robmathews
Copy link

and revision 990 doesn't compile on OSX, because it is trying to include
librt, which isn't defined for OSX.

To be precise, I get this error:

ld: library not found for -lrt

clang: error: linker command failed with exit code 1 (use -v to see
invocation)

complete build log here: https://gist.github.com/8448535

Help?

On Wed, Jan 15, 2014 at 8:34 PM, Robert Mathews <
[email protected]> wrote:

Well, this diff looks pretty suspicious:
https://code.google.com/p/tesseract-ocr/source/diff?spec=svn990&r=990&format=side&path=/trunk/opencl/openclwrapper.h

And I'm 2 revisions before that .. looks like I need revision 990

On Wed, Jan 15, 2014 at 8:30 PM, Robert Mathews <
[email protected]> wrote:

Ok, so I missed a build error - it's trying to include "tiff.h", even
though it is configured with --no-libtiff. So I guess revision 998 isn't
valid.

I looked for your branch dev_3.0.3 and don't see it anywhere here:
https://code.google.com/p/tesseract-ocr/source/browse/#svn%2Ftags%253Fstate%253Dclosed

Are you looking somewhere else?

Rob.

On Wed, Jan 15, 2014 at 7:37 PM, Robert Mathews <
[email protected]> wrote:

No .. actually I discovered that I hadn't cleaned since the last time I
built from source, so the build was invalid. Doh.

But when I actually build from source, I can build the library w/o
errors, but when I link against my test app, it is claiming that it can't
find the baseapi class - 8 link errors like this:

"tesseract::TessBaseAPI::GetUTF8Text()", referenced from:

Has this class actually been removed? I can see the source for it, but
it doesn't seem to get compiled into the code (I examined the libraries and
dumped the symbols to be sure)

On Wed, Jan 15, 2014 at 5:23 PM, Charles Weld [email protected]:

Did you try just using the dev_3.03 branch I created? There's also a
test for this issue there too, all you should need to do is get the source,
run prepare.bat and then load the solution in VS. I'll do a proper
release once 3.03 has officially been released.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32422072
.

@zdenop
Copy link

zdenop commented Jan 16, 2014

@robmathews: I can try to fix it, but I need OSX tester - please send me your e-mail zdenop (at) gmail (dot) com.

@charlesw
Copy link
Owner

@robmathews looks like we're at cross purposes, I was talking about this .NET wrapper not compiling / linking the native tesseract-ocr library. The dev_3.03 branch is on THIS repository not tesseract-ocr's and only includes the native dllsI compiled using VS 2013 Express on Windows 8.1. I'm not an expert on c++ / c and haven't really done any work involving c++ for a LONG time, so I'm going to be of limited help here :/

@robmathews
Copy link

@charles - which is "THIS" repository? Is there some other repository that
folks use? Although, you are correct that I'm not majorly interested in the
.NET version ;)

@zdenop - I have a project here for iOS [
https://github.com/robmathews/OCR-iOS-Example]. Once we get tesseract
building, you might consider referencing it in the release notes as the
build for the iphone project. I took an older iOS project that
http://tinsuke.wordpress.com/2011/11/01/how-to-compile-and-use-tesseract-3-01-on-ios-sdk-5/
did a year or so ago and updated it for the latest xcode + leptonia +
tesseract. Anyway, compiling for the iphone is one of those things that's
hard.

Rob.

On Thu, Jan 16, 2014 at 5:30 AM, Charles Weld [email protected]:

@robmathews https://github.com/robmathews looks like we're at cross
purposes I was talking about this .NET wrapper not compiling / linking the
native tesseract-ocr library (the dev_3.03 branch is on THIS repository not
tesseract-ocr's). This branch only includes the 3.03 dll I compiled using
VS 2013 Express on Windows 8.1. I'm not an expert on c++ / c and haven't
really done any work involving c++ for a LONG time, so I'm going to be of
limited help here :/


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32457541
.

@charlesw
Copy link
Owner

@robmathews "THIS" repository is the one hosting this issue (https://github.com/charlesw/tesseract) and is an unofficial .net wrapper for tesseract. It doesn't contain the tesseract sources for that you'll need the svn repo from the official site (http://tesseract-ocr.googlecode.com/svn/trunk/) which I'm assuming you've got.

@robmathews
Copy link

@charles. yes, exactly.

On Thu, Jan 16, 2014 at 5:19 PM, Charles Weld [email protected]:

@robmathews https://github.com/robmathews "THIS" repository is the one
hosting this issue (https://github.com/charlesw/tesseract) and is an
unofficial .net wrapper for tesseract. It doesn't contain the tesseract
sources for that you'll need the svn repo from the official site (
http://tesseract-ocr.googlecode.com/svn/trunk/) which I'm assuming you've
got.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-32553339
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants