Segmentation fault OCRing a washed out image #1601

konstantin-dzreev · 2018-05-24T21:51:53Z

I'm playing with tesseract trying to process bad images like really dark, or light, or the ones with very low contrast, etc. And I run into a file that causes tesseract to die with a segmentation fault error.

Environment

Tesseract Version: tesseract 4.0.0-beta.1-270-g5a56
Commit Number: 5a56d0c
Platform: Linux <host-name> 4.15.0-22-generic #24-Ubuntu SMP Wed May 16 12:15:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 18.04)

$ tesseract --version

tesseract 4.0.0-beta.1-270-g5a56
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

Current Behavior: Segmentation fault:

$ tesseract scan1.grey.png stdout

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)

Attachment: a file to reproduce the issue

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-05-25T08:25:38Z

Duplicate issue - please see #1205

@zdenop Please close

amitdo · 2018-05-25T08:37:06Z

Try with this image:

amitdo · 2018-05-25T08:47:56Z

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

#427 #468

zdenop · 2018-05-25T09:06:03Z

Duplicate

Shreeshrii · 2018-05-25T11:10:49Z

@konstantin-dzreev I am not able to reproduce the error regarding unichar-id and core-dump that you are getting (pasted below)

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)

My version info is the same:

 tesseract -v
tesseract 4.0.0-beta.1-270-g5a56
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

The only difference I see is that you have:

Found AVX2
 Found AVX
 Found SSE

libpng version is also different.

@stweil Can that make a difference?

Shreeshrii · 2018-05-25T11:17:39Z

I found a different issue while processing this image with gdb (hoping to trace the crash).

Edit: made new issue #1603

Shreeshrii · 2018-05-25T11:18:57Z

Related to issue #1603 - using image posted here by OP

Log file attached here
1601.log.txt

Shreeshrii · 2018-05-25T11:30:42Z

@zdenop Please reopen the issue. The title could be edited to say 'psm 6 producing gibberish' Thanks!

zdenop · 2018-05-25T12:56:39Z

@Shreeshrii: your observation is different that original issue report (that is duplication of already open issues). Renaming it will just produce chaos...

Shreeshrii · 2018-05-25T12:58:27Z

@zdenop Good point. I will open a different issue for it and delete the comments from here. Thanks,

amitdo · 2018-05-25T15:58:37Z

OK, I tested it.

tesseract ~/Downloads/dark.png - --tessdata-dir ~/Downloads/tessdata/tessdata`
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../src/ccutil/unicharset.h, line 511
Segmentation fault

Shreeshrii · 2018-05-25T16:01:10Z

@amitdo tesseract and leptonica version, please!

amitdo · 2018-05-25T16:01:39Z

fast, best and the original (lstm+legacy) tessdata does not crash.

Shreeshrii · 2018-05-25T16:03:17Z

which version is
--tessdata-dir ~/Downloads/tessdata/tessdata?

does it correspond to current tessdata?

amitdo · 2018-05-25T16:05:34Z

$ uname -a
Linux debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

$ tesseract -v
tsseract 4.0.0-beta.1-270-g5a56
 leptonica-1.76.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2
 Found AVX
 Found SSE

amitdo · 2018-05-25T16:07:30Z

The 2 latest commits both crash.

Shreeshrii · 2018-05-25T16:09:12Z

The 2 latest commits both crash.

of tessdata?

amitdo · 2018-05-25T16:12:41Z

tessdata repo

v1
Nov 28, 2016
https://github.com/tesseract-ocr/tessdata/blob/4592b8d453889181e01982d22328b5846765eaad/eng.traineddata
Does not crash.

v2
March 22, 2018
https://github.com/tesseract-ocr/tessdata/blob/d87b3cbc75555bd3282e0cadab5e159e2d468396/eng.traineddata
Crash!

v3
May 10, 2018
https://github.com/tesseract-ocr/tessdata/blob/c2b2e0df86272ce11be323f23f96cf656565ed41/eng.traineddata
Crash!

Shreeshrii · 2018-05-25T16:21:39Z

@stweil had removed cube components from some traineddata files recently. It is possible that the components say 'cube' but are used by the legacy engine. I just refreshed my tessdata files, will test again. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 25, 2018 at 9:43 PM, Amit D. ***@***.***> wrote: tessdata repo v1 Nov 28, 2016 https://github.com/tesseract-ocr/tessdata/blob/ 4592b8d453889181e01982d22328b5846765eaad/eng.traineddata Does not crash. v2 March 22, 2018 https://github.com/tesseract-ocr/tessdata/blob/ d87b3cbc75555bd3282e0cadab5e159e2d468396/eng.traineddata Crash! v3 May 10, 2018 https://github.com/tesseract-ocr/tessdata/blob/ c2b2e0df86272ce11be323f23f96cf656565ed41/eng.traineddata Crash! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1601 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o-ZUAmT1R0o-CrtCaGrfX7iNlDSwks5t2C2TgaJpZM4UNDd1> .

amitdo · 2018-05-25T16:24:38Z

If I use the manually binarized image I provided earlier, with those 2 newer traineddata from the tessdata repo, then there is no crash

Shreeshrii · 2018-05-25T16:27:42Z

OK. I can reproduce the crash after updating the tessdata files. Here is the backtrace. Starting program: /usr/local/bin/tesseract 1601.png - --tessdata-dir ../tessdata [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1". Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box [New Thread 0x3fffb6c5f100 (LWP 8284)] [New Thread 0x3fffb645f100 (LWP 8285)] [New Thread 0x3fffb5c5f100 (LWP 8286)] contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault. 0x00003fffb7bea350 in ERRCODE::error (this=<optimized out>, caller=<optimized out>, action=<optimized out>, format=0x3fffb7c045d0 "in file %s, line %d") at errcode.cpp:86 86 if (!*p) (gdb) backtrace #0 0x00003fffb7bea350 in ERRCODE::error (this=<optimized out>, caller=<optimized out>, action=<optimized out>, format=0x3fffb7c045d0 "in file %s, line %d") at errcode.cpp:86 #1 0x00003fffb7b3c7bc in UNICHARSET::get_isdigit (unichar_id=112, this=0x102a95c0) at ../../src/ccutil/unicharset.h:511 #2 tesseract::Dict::char_for_dawg (dawg=0x113d11b0, ch=<optimized out>, this=0x1058cb30) at dict.h:434 #3 tesseract::Dict::def_letter_is_okay (this=0x1058cb30, void_dawg_args=0x3fffffffdf50, unichar_id=<optimized out>, word_end=<optimized out>) at dict.cpp:413 #4 0x00003fffb7b3cf9c in tesseract::Dict::valid_word (this=0x1058cb30, word=..., numbers_ok=<optimized out>) at dict.cpp:758 #5 0x00003fffb7af4324 in tesseract::Dict::valid_word (word=..., this=<optimized out>) at ../../src/dict/dict.h:463 #6 tesseract::Wordrec::dict_word (this=<optimized out>, word=...) at tface.cpp:129 #7 0x00003fffb7a37ee8 in tesseract::Tesseract::recog_word (this=0x102442c0, word=0x12418860) at tfacepp.cpp:69 #8 0x00003fffb7a25554 in tesseract::Tesseract::tess_segment_pass_n (this=0x102442c0, pass_n=<optimized out>, word=0x12418860) at tessbox.cpp:49 #9 0x00003fffb79e01c4 in tesseract::Tesseract::match_word_pass_n (this=<optimized out>, pass_n=<optimized out>, word=<optimized out>, row=<optimized out>, block=<optimized out>) at control.cpp:1580 #10 0x00003fffb79e0464 in tesseract::Tesseract::classify_word_pass1 (this=0x102442c0, word_data=..., in_word=<optimized out>, out_words=<optimized out>) at control.cpp:1392 #11 0x00003fffb79e195c in tesseract::Tesseract::RetryWithLanguage (this=0x102442c0, word_data=..., recognizer=<optimized out>, debug=<optimized out>, in_word=0x1177aec0, best_words=0x3fffffffe508) at control.cpp:899 #12 0x00003fffb79e21c0 in tesseract::Tesseract::classify_word_and_language (this=0x102442c0, pass_n=<optimized out>, pr_it=0x3fffffffe710, word_data=0x1177a888) at control.cpp:1315 #13 0x00003fffb79e5974 in tesseract::Tesseract::RecogAllWordsPassN (this=0x102442c0, pass_n=<optimized out>, monitor=0x0, pr_it=0x3fffffffe710, words=0x3fffffffe6f0) at control.cpp:266 #14 0x00003fffb79e7660 in tesseract::Tesseract::recog_all_words (this=0x102442c0, page_res=0x1174fe20, monitor=0x0, target_word_box=0x0, word_config=0x0, dopasses=<optimized out>) at control.cpp:353 #15 0x00003fffb79c9b68 in tesseract::TessBaseAPI::Recognize (this=0x10020270 <main::api>, monitor=0x0) at baseapi.cpp:870 #16 0x00003fffb79ca10c in tesseract::TessBaseAPI::ProcessPage (this=0x10020270 <main::api>, pix=0x1027aa60, page_index=<optimized out>, filename=<optimized out>, retry_config=0x0, timeout_millisec=<optimized out>, renderer=0x1029de40) at baseapi.cpp:1176 #17 0x00003fffb79cdaf0 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x10020270 <main::api>, filename=0x3ffffffff792 "1601.png", retry_config=0x0, timeout_millisec=<optimized out>, renderer=0x1029de40) at baseapi.cpp:1132 #18 0x00003fffb79ce0d8 in tesseract::TessBaseAPI::ProcessPages (this=<optimized out>, filename=<optimized out>, retry_config=<optimized out>, timeout_millisec=<optimized out>, renderer=<optimized out>) at baseapi.cpp:1032 #19 0x0000000010002d6c in main (argc=<optimized out>, argv=0x3ffffffff4b8) at tesseractmain.cpp:547 (gdb) quit ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 25, 2018 at 9:54 PM, Amit D. ***@***.***> wrote: If I use the manually binarized image I provided earlier, with those 2 newer traineddata from the tessdata repo, then there is no crash — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1601 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o_RNGvTIWezGJS_OjosAWwSCE5gIks5t2DBYgaJpZM4UNDd1> .

amitdo · 2018-05-25T16:28:41Z

The version without cube is one of the two that crashes.

Shreeshrii · 2018-05-25T16:35:45Z

d87b3cb

<tesseract-ocr/tessdata@d87b3cb> on Mar 22 [image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii <https://github.com/Shreeshrii> Update LSTM Models to integerized tessdata_best for files < 25mb <tesseract-ocr/tessdata@d87b3cb> Will have to check that one. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***> wrote: The version wIthout cube is one of the two that crashes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1601 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1> .

Shreeshrii · 2018-05-25T16:43:22Z

OK, looks like that both --oem 0 and --oem 1 work individually with the current traineddata. However, if --oem 2 is used it crashes. tesseract 1601.png - --tessdata-dir ../tessdata --oem 2 Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 Segmentation fault (core dumped) There is no eng.config file. So by default tesseract is using both. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, May 25, 2018 at 10:05 PM, ShreeDevi Kumar <[email protected]> wrote:

> d87b3cb <tesseract-ocr/tessdata@d87b3cb> on Mar 22 [image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii <https://github.com/Shreeshrii> Update LSTM Models to integerized tessdata_best for files < 25mb <tesseract-ocr/tessdata@d87b3cb> Will have to check that one. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***> wrote: > The version wIthout cube is one of the two that crashes. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1601 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1> > . >

amitdo · 2018-05-25T16:46:05Z

Yes, with --oem 0 or --oem 1 it does not crash.

Shreeshrii · 2018-05-25T16:48:22Z

unicharset and lstm-unicharset can be different in the same traineddata file. It is possible that the program is using the wrong unicharset for the dawg files when using --oem 2. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, May 25, 2018 at 10:12 PM, ShreeDevi Kumar <[email protected]> wrote:

OK, looks like that both --oem 0 and --oem 1 work individually with the current traineddata. However, if --oem 2 is used it crashes. tesseract 1601.png - --tessdata-dir ../tessdata --oem 2 Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 Segmentation fault (core dumped) There is no eng.config file. So by default tesseract is using both. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, May 25, 2018 at 10:05 PM, ShreeDevi Kumar ***@***.***> wrote: > > d87b3cb > <tesseract-ocr/tessdata@d87b3cb> > on Mar 22 > [image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii > <https://github.com/Shreeshrii> Update LSTM Models to integerized > tessdata_best for files < 25mb > <tesseract-ocr/tessdata@d87b3cb> > > Will have to check that one. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***> > wrote: > >> The version wIthout cube is one of the two that crashes. >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#1601 (comment)>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1> >> . >> > >

Shreeshrii · 2018-05-25T17:16:04Z

Should we disable --oem 2?

That will also take care of other issues related to multi-language processing, where one language may only have LSTM model (Indic, Arabic) and others may have both, which also leads to similar problems.

It is not necessary that --oem 2 gives better / more accurate results.

Shreeshrii · 2018-05-25T17:23:05Z

See #235 (comment)

regarding --oem 2 issues with mix of languages.

stweil · 2018-05-28T21:29:56Z

It indeed looks like the wrong unicharset is used (eng.lstm-unicharset instead of eng.unicharset). If this can be confirmed, it would result in wrong decisions which text is best and cause the observed assertions (which finally trigger an intentional segmentation fault). Avoiding the crash is easy, but the right fix still needs some time at least for me.

The crash is not related to the cube removal.

stweil · 2018-05-28T21:33:43Z

I wonder why we need more than one unicharset and more than one word list. Both should not depend on the OCR engine used, and it should be possible to always use a superset fitting both engines. That would also reduce the trainedata size.

Shreeshrii · 2018-05-29T04:21:39Z

The crash is not related to the cube removal.

@stweil Yes, you are right. I was trying to recall from memory recent commits. Further testing indicated problem might be unicharsets.

If this can be confirmed,

If you can indicate what testing will help, I can do it.

I wonder why we need more than one unicharset and more than one word list.

My guess is that the language models depend on these, specially the LSTM model, which also uses a recoder/unicharcompressor for some languages. Using a different unicharset (even same unichars but different order in file) lead to wrong results.

it should be possible to always use a superset fitting both engines.

You could give it try. Use merge_unicharsets and keep the lstm-unicharset first in list.

Shreeshrii mentioned this issue May 25, 2018

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed

zdenop closed this as completed May 25, 2018

Shreeshrii mentioned this issue May 25, 2018

suggestion: make --oem 1 (LSTM) the default #1604

Open

This was referenced Jul 7, 2020

Tesseract Empty Page #3021

Open

Error in boxClipToRectangle: box outside rectangle #427

Open

amitdo added the boxClipToRectangle label Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault OCRing a washed out image #1601

Segmentation fault OCRing a washed out image #1601

konstantin-dzreev commented May 24, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

amitdo commented May 25, 2018

zdenop commented May 25, 2018

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018

zdenop commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 via email

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 via email

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018

stweil commented May 28, 2018

stweil commented May 28, 2018

Shreeshrii commented May 29, 2018

Segmentation fault OCRing a washed out image #1601

Segmentation fault OCRing a washed out image #1601

Comments

konstantin-dzreev commented May 24, 2018

Environment

Current Behavior: Segmentation fault:

Attachment: a file to reproduce the issue

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

amitdo commented May 25, 2018

zdenop commented May 25, 2018

Shreeshrii commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018

zdenop commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018

amitdo commented May 25, 2018

tessdata repo

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018 via email

Shreeshrii commented May 25, 2018 via email

amitdo commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018 via email

Shreeshrii commented May 25, 2018 • edited Loading

Shreeshrii commented May 25, 2018

stweil commented May 28, 2018

stweil commented May 28, 2018

Shreeshrii commented May 29, 2018

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 •

edited

Loading

amitdo commented May 25, 2018 •

edited

Loading

amitdo commented May 25, 2018 •

edited

Loading

Shreeshrii commented May 25, 2018 •

edited

Loading