Blacklist and whitelist unsupported with LSTM (4.0) #751

nguyenq · 2017-03-08T13:21:26Z

Blacklist and whitelist no longer work in 4.00alpha. They used to work in 3.04.

https://groups.google.com/forum/#!topic/tesseract-ocr/cpcJHTE2xMo

Cryspart · 2017-03-09T19:40:36Z

Same problem for me with 4.00alpha, I tried to set tessedit_char_whitelist by using:

cli with option -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz
cli with config file
tesserocr python module

But I keep getting non letter results

I can provide Dockerfile + python script + images if needed

yshean · 2017-03-14T09:21:39Z

Same problem for me. Still getting symbols and alphabets despite setting tessedit_char_whitelist="0123456789".

atefm · 2017-03-28T14:35:59Z

I encountered the same issue today when using --oem 1,2,3. It works fine for --oem 0 (Original Tesseract).

DanielRieske · 2017-04-12T09:34:03Z

I am encountering the same issue, is there a solution for this issue yet?.

amitdo · 2017-04-12T09:47:53Z

I am encountering the same issue, is there a solution for this issue yet?.

No.

RimacV · 2017-04-13T14:00:04Z

I am facing the same issue. Is it really a bug or is it just not supported for LSTM?

amitdo · 2017-04-13T14:09:38Z

It's currently not supported for LSTM.

People, please do not add another "I have the same issue" comment.

Adrian-at-CrimsonAzure · 2017-06-09T14:08:31Z

Are there plans to support whitelisting on LSTM in the future?

Htarlov · 2017-08-04T17:11:45Z

I also have this problem when using Tesseract 4 from C++

tess->SetVariable("tessedit_char_whitelist", "01234567890abcdefg");

has no effect on the output. The same with blacklist.

Tesseract returns not only ascii + language-specific characters but also some strange other characters from UTF-8.

Is there a way to get a full list of all possible characters, specific for a language or not? Basing on such list one could make a workaround to map such wrong characters to best fitting ones that are expected (like EM DASH to plain ASCII dash etc.) and remove those without any sensible fit. It would be useful for me in current circumstances and maybe it could be useful for others in need of whitelisting.

Shreeshrii · 2017-08-04T17:33:08Z

@theraysmith Are there plans to support this for LSTM?

Shreeshrii · 2017-10-03T16:39:51Z

In response to https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw

You can try the plus-minus type of training if you just want a digits type of traineddata.

Your training_text can contain numbers in the format you need and you can train with a font matching your images.

For proof of concept you can try my experimental version at

https://github.com/Shreeshrii/tessdata_shreetest

ErnstTmp · 2017-10-28T22:12:40Z

I would like to exclude everything except letters and digits from the result. I started from eng.traineddata and trained my font from graphical images (@Shreeshrii: thanks!!)) . Is there a way to get rid of all the other symbols, especially !"=)() ... ?

I am using --oem 1.

Thank you very much,
Ernst

ghost · 2018-01-18T14:04:41Z

Duplicate issue? "user pattern/dict does not work at all"
#960

I'm on 3.04.01 (from ubuntu 16.04 repos) and it doesn't work in that version either.

smlum · 2018-03-22T17:32:57Z

has this been resolved or anyone found a workaround?

amitdo · 2018-03-22T17:50:29Z

has this been resolved

No.

Htarlov · 2018-03-22T18:18:38Z

Not really - sort-of workaround only.

I've ended up by iterating through symbols found by Tesseract and doing some post-processing. Found out by analysis of many cases what are usual OCR errors for my type of documents, that move us out of chosen set and then used a mapping of those mistaken chars to proper chars (plus filtering of all that are outside of set). So finally I have only chosen character set on output, but it is suboptimal solution.

Shreeshrii · 2018-03-23T15:08:51Z

Another experiment with finetuning - minuschar - i.e. removing characters from an existing traineddata.

In my sample I have used upper and lower case alphabet and digits only.

Please see attached zip file. It has the bash script used, training text and resulting traineddata file. You wil l get better results if you use font similar to one you want to recognize and training text also of similar to what you need.

I have removed all the wordlists/dawgs so tesseract will give a warning message when doing OCR.

alphanum.zip

smlum · 2018-03-23T15:50:12Z

@Htarlov @Shreeshrii thanks interesting thoughts. I hadn't run much much post-processing or done any training yet so these should improve things considerably.

amitdo · 2018-04-15T11:49:46Z

tesseract/lstm/recodebeam.cpp

Line 258 in 8f7be2e

void RecodeBeamSearch::ExtractPathAsUnicharIds(

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

tesseract/ccutil/unicharset.h

Line 877 in 023e1b3

bool get_enabled(UNICHAR_ID unichar_id) const {

teamcoltra · 2018-04-27T21:33:25Z

tesseract 4.0.0-beta.1 still has this problem.

vivanov879 · 2018-05-21T10:01:45Z

rebuilt from source -- whitelist still doesnt work

Shreeshrii · 2018-05-21T11:45:30Z

AFAIK, this will not be addressed for 4.0.0.

williape · 2018-05-23T01:49:08Z

I've posted a bounty to have this resolved: https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-broken-in-4-00alpha

Ungaminga · 2018-07-04T12:38:36Z

Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.

ghost · 2018-07-19T23:03:39Z

Use --oem 0 or -oem 0 and it works

thekevshow · 2019-01-26T04:58:33Z

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Thanks!

Is there any update on this? Or should I drop versions?

bertsky · 2019-03-07T00:56:35Z

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

@amitdo was right. I was able to get the old behaviour (whitelist, blacklist, unblacklist) back with the LSTM decoder by querying the unicharset's get_enabled for each output in ComputeTopN, ignoring it if disabled.

But it was not so easy (for me) to get the UnicharCompress (recoder) and RecodedCharID (label mapping) right – so that might be the wrong way to do it. Also, one important ingredient was that the unicharset member of the Tesseract class (which SetBlackAndWhiteList operates on) is not the same as lstm_recognizer_->GetUnicharset(). The latter seems to be a stripped down version, so I'll have the whitelisting operate on both. See #2294.

jxu · 2019-05-24T21:38:18Z

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Can we use ChoiceIterator to iterate through all possibilities, keeping/rejecting based on whitelist/blacklist, and using the top result left over, if it exists?

sinall · 2019-06-18T02:28:56Z

Has this been fixed in 5.0.0-alpha?

axhagemann · 2019-08-13T08:07:08Z

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

thekevshow · 2019-08-15T18:56:01Z

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

@axhagemann

This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this.

Jinnrry · 2019-08-16T07:41:10Z

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

Really excited ！ It works!

stweil · 2019-08-17T12:11:43Z

@nguyenq, can we close this issue?

nguyenq · 2019-08-17T16:43:22Z

Didn't realize I'd opened this issue. It's been so long ago. :)

Shreeshrii · 2019-08-30T16:18:59Z

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata Install from Alex's ppa

…

On Fri, 30 Aug 2019, 21:18 OllieD3711, ***@***.***> wrote: This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this. @kev2316 <https://github.com/kev2316> I'm having trouble upgrading to 4.1.0. I'm sure I'm doing something stupid (using sudo apt-upgrade, and also tried sudo apt install), but when I try upgrade, I'm told I already have the latest version, 4.00~git2288-10f4998a-2. How can I upgrade to 4.1.0? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#751?email_source=notifications&email_token=ABG37I5ILLDA745L7XWIC6TQHE6LLA5CNFSM4DC3C6RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SBBTI#issuecomment-526651597>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG37I6G7RJ7R3ZNMOFTPKLQHE6LLANCNFSM4DC3C6RA> .

Tesseract 4.0's LSTM engine ignores char_whitelist.[1] This is fixed in Tesseract 4.1 [2] but it isn't widely available yet (it'll be in Ubuntu 19.10). [1]: tesseract-ocr/tesseract#751

Tesseract 4.0's LSTM engine ignores char_whitelist.[1] This is fixed in Tesseract 4.1 [2] but it isn't widely available yet (it'll be in Ubuntu 19.10). [1]: tesseract-ocr/tesseract#751 [2]: https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes#tesseract-release-notes-jul-07-2019---v410

mhellmeier · 2020-03-20T02:55:33Z

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

Update Tesseract to version 4.1 (the future-oriented approach)
Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

Shreeshrii · 2020-03-20T03:41:52Z

Alex's ppa can be used on Ubuntu for the latest versions.
Please see https://github.com/tesseract-ocr/tesseract/wiki

v3ss0n · 2022-06-28T08:38:12Z

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

Update Tesseract to version 4.1 (the future-oriented approach)

Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

Is your method 3 a joke? It is not detection of numbers only , it is filtering out text AFTER being detected.

amitdo mentioned this issue Jun 19, 2017

whitelist character does not work for 4.0 api #998

Closed

nguyenq mentioned this issue Jun 20, 2017

4.0 instance.setConfigs(configs) not work！ nguyenq/tess4j#56

Closed

wosiu mentioned this issue Nov 10, 2017

when specify the white-list, why show the following effect? #1200

Closed

nguyenq mentioned this issue Nov 25, 2017

How to set a white list of characters to use getWords() nguyenq/tess4j#76

Closed

otiai10 mentioned this issue Feb 23, 2018

Unable to build: requires compiler support for the ISO C++ 2011 standard otiai10/gosseract#114

Closed

jeroen mentioned this issue Jan 9, 2019

Options ropensci/tesseract#18

Closed

amitdo mentioned this issue Jan 15, 2019

Option --psm 10 digits are not taken account. #2159

Open

JessicaYeh mentioned this issue Jan 24, 2019

Reading some numbers outputs symbols iseahound/Vis2#4

Open

dmypstl mentioned this issue Feb 9, 2019

Turning on legacy OCR engine mode ropensci/tesseract#39

Closed

mzeimet mentioned this issue May 28, 2019

How can I set tessedit_char_whitelist sirfz/tesserocr#174

Closed

thiagoalessio mentioned this issue May 29, 2019

Digits() or Whitelist() doesn't work thiagoalessio/tesseract-ocr-for-php#162

Closed

kylefoley76 mentioned this issue Jul 15, 2019

specify the characters tesserect outputs madmaze/pytesseract#212

Closed

nguyenq closed this as completed Aug 17, 2019

drothlis mentioned this issue Sep 4, 2019

stbt.ocr docstring: Note that LSTM engine ignores char_whitelist stb-tester/stb-tester#620

Merged

ghost mentioned this issue Dec 1, 2019

Fatal error: The command did not produce any output. thiagoalessio/tesseract-ocr-for-php#173

Closed

This comment has been minimized.

Sign in to view

amitdo removed the help wanted label May 12, 2020

iaroslavn mentioned this issue Jun 11, 2020

Simulate ANT+ without raspberry pi? iaroslavn/peloton-bike-metrics-server#2

Open

amitdo added the allowlist / denylist label Jun 28, 2022

Blacklist and whitelist unsupported with LSTM (4.0) #751

Blacklist and whitelist unsupported with LSTM (4.0) #751

Comments

nguyenq commented Mar 8, 2017

Cryspart commented Mar 9, 2017

yshean commented Mar 14, 2017

atefm commented Mar 28, 2017

DanielRieske commented Apr 12, 2017

amitdo commented Apr 12, 2017

RimacV commented Apr 13, 2017

amitdo commented Apr 13, 2017

Adrian-at-CrimsonAzure commented Jun 9, 2017 • edited Loading

Htarlov commented Aug 4, 2017 • edited Loading

Shreeshrii commented Aug 4, 2017

Shreeshrii commented Oct 3, 2017 • edited Loading

ErnstTmp commented Oct 28, 2017

ghost commented Jan 18, 2018 • edited by ghost Loading

smlum commented Mar 22, 2018

amitdo commented Mar 22, 2018

Htarlov commented Mar 22, 2018 • edited Loading

Shreeshrii commented Mar 23, 2018

smlum commented Mar 23, 2018

amitdo commented Apr 15, 2018 • edited Loading

teamcoltra commented Apr 27, 2018

vivanov879 commented May 21, 2018

Shreeshrii commented May 21, 2018

williape commented May 23, 2018

Ungaminga commented Jul 4, 2018

ghost commented Jul 19, 2018

thekevshow commented Jan 26, 2019

bertsky commented Mar 7, 2019

jxu commented May 24, 2019

sinall commented Jun 18, 2019

axhagemann commented Aug 13, 2019

thekevshow commented Aug 15, 2019 • edited Loading

Jinnrry commented Aug 16, 2019

stweil commented Aug 17, 2019

nguyenq commented Aug 17, 2019 • edited Loading

Shreeshrii commented Aug 30, 2019 via email

mhellmeier commented Mar 20, 2020

Shreeshrii commented Mar 20, 2020

This comment has been minimized.

This comment has been minimized.

v3ss0n commented Jun 28, 2022 • edited Loading

Adrian-at-CrimsonAzure commented Jun 9, 2017 •

edited

Loading

Htarlov commented Aug 4, 2017 •

edited

Loading

Shreeshrii commented Oct 3, 2017 •

edited

Loading

ghost commented Jan 18, 2018 •

edited by ghost

Loading

Htarlov commented Mar 22, 2018 •

edited

Loading

amitdo commented Apr 15, 2018 •

edited

Loading

thekevshow commented Aug 15, 2019 •

edited

Loading

nguyenq commented Aug 17, 2019 •

edited

Loading

v3ss0n commented Jun 28, 2022 •

edited

Loading