Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist and whitelist unsupported with LSTM (4.0) #751

Closed
nguyenq opened this issue Mar 8, 2017 · 61 comments
Closed

Blacklist and whitelist unsupported with LSTM (4.0) #751

nguyenq opened this issue Mar 8, 2017 · 61 comments

Comments

@nguyenq
Copy link
Contributor

nguyenq commented Mar 8, 2017

Blacklist and whitelist no longer work in 4.00alpha. They used to work in 3.04.

https://groups.google.com/forum/#!topic/tesseract-ocr/cpcJHTE2xMo

@Cryspart
Copy link

Cryspart commented Mar 9, 2017

Same problem for me with 4.00alpha, I tried to set tessedit_char_whitelist by using:

  • cli with option -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz
  • cli with config file
  • tesserocr python module

But I keep getting non letter results

I can provide Dockerfile + python script + images if needed

@yshean
Copy link

yshean commented Mar 14, 2017

Same problem for me. Still getting symbols and alphabets despite setting tessedit_char_whitelist="0123456789".

@atefm
Copy link

atefm commented Mar 28, 2017

I encountered the same issue today when using --oem 1,2,3. It works fine for --oem 0 (Original Tesseract).

@DanielRieske
Copy link

I am encountering the same issue, is there a solution for this issue yet?.

@amitdo
Copy link
Collaborator

amitdo commented Apr 12, 2017

I am encountering the same issue, is there a solution for this issue yet?.

No.

@RimacV
Copy link

RimacV commented Apr 13, 2017

I am facing the same issue. Is it really a bug or is it just not supported for LSTM?

@amitdo
Copy link
Collaborator

amitdo commented Apr 13, 2017

It's currently not supported for LSTM.

People, please do not add another "I have the same issue" comment.

@Adrian-at-CrimsonAzure
Copy link

Adrian-at-CrimsonAzure commented Jun 9, 2017

Are there plans to support whitelisting on LSTM in the future?

@Htarlov
Copy link

Htarlov commented Aug 4, 2017

I also have this problem when using Tesseract 4 from C++

tess->SetVariable("tessedit_char_whitelist", "01234567890abcdefg");

has no effect on the output. The same with blacklist.

Tesseract returns not only ascii + language-specific characters but also some strange other characters from UTF-8.

Is there a way to get a full list of all possible characters, specific for a language or not? Basing on such list one could make a workaround to map such wrong characters to best fitting ones that are expected (like EM DASH to plain ASCII dash etc.) and remove those without any sensible fit. It would be useful for me in current circumstances and maybe it could be useful for others in need of whitelisting.

@Shreeshrii
Copy link
Collaborator

@theraysmith Are there plans to support this for LSTM?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 3, 2017

In response to https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw

You can try the plus-minus type of training if you just want a digits type of traineddata.

Your training_text can contain numbers in the format you need and you can train with a font matching your images.

For proof of concept you can try my experimental version at

https://github.com/Shreeshrii/tessdata_shreetest

@ErnstTmp
Copy link

I would like to exclude everything except letters and digits from the result. I started from eng.traineddata and trained my font from graphical images (@Shreeshrii: thanks!!)) . Is there a way to get rid of all the other symbols, especially !"=)() ... ?

I am using --oem 1.

Thank you very much,
Ernst

@ghost
Copy link

ghost commented Jan 18, 2018

Duplicate issue? "user pattern/dict does not work at all"
#960

I'm on 3.04.01 (from ubuntu 16.04 repos) and it doesn't work in that version either.

@smlum
Copy link

smlum commented Mar 22, 2018

has this been resolved or anyone found a workaround?

@amitdo
Copy link
Collaborator

amitdo commented Mar 22, 2018

has this been resolved

No.

@Htarlov
Copy link

Htarlov commented Mar 22, 2018

Not really - sort-of workaround only.

I've ended up by iterating through symbols found by Tesseract and doing some post-processing. Found out by analysis of many cases what are usual OCR errors for my type of documents, that move us out of chosen set and then used a mapping of those mistaken chars to proper chars (plus filtering of all that are outside of set). So finally I have only chosen character set on output, but it is suboptimal solution.

@Shreeshrii
Copy link
Collaborator

Another experiment with finetuning - minuschar - i.e. removing characters from an existing traineddata.

In my sample I have used upper and lower case alphabet and digits only.

Please see attached zip file. It has the bash script used, training text and resulting traineddata file. You wil l get better results if you use font similar to one you want to recognize and training text also of similar to what you need.

I have removed all the wordlists/dawgs so tesseract will give a warning message when doing OCR.

alphanum.zip

@smlum
Copy link

smlum commented Mar 23, 2018

@Htarlov @Shreeshrii thanks interesting thoughts. I hadn't run much much post-processing or done any training yet so these should improve things considerably.

@amitdo
Copy link
Collaborator

amitdo commented Apr 15, 2018

void RecodeBeamSearch::ExtractPathAsUnicharIds(

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

bool get_enabled(UNICHAR_ID unichar_id) const {

@teamcoltra
Copy link

tesseract 4.0.0-beta.1 still has this problem.

@vivanov879
Copy link

rebuilt from source -- whitelist still doesnt work

@Shreeshrii
Copy link
Collaborator

AFAIK, this will not be addressed for 4.0.0.

@williape
Copy link

I've posted a bounty to have this resolved: https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-broken-in-4-00alpha

@Ungaminga
Copy link

Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.

@ghost
Copy link

ghost commented Jul 19, 2018

Use --oem 0 or -oem 0 and it works

@thekevshow
Copy link

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Thanks!

Is there any update on this? Or should I drop versions?

@bertsky
Copy link
Contributor

bertsky commented Mar 7, 2019

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

@amitdo was right. I was able to get the old behaviour (whitelist, blacklist, unblacklist) back with the LSTM decoder by querying the unicharset's get_enabled for each output in ComputeTopN, ignoring it if disabled.

But it was not so easy (for me) to get the UnicharCompress (recoder) and RecodedCharID (label mapping) right – so that might be the wrong way to do it. Also, one important ingredient was that the unicharset member of the Tesseract class (which SetBlackAndWhiteList operates on) is not the same as lstm_recognizer_->GetUnicharset(). The latter seems to be a stripped down version, so I'll have the whitelisting operate on both. See #2294.

@jxu
Copy link

jxu commented May 24, 2019

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Can we use ChoiceIterator to iterate through all possibilities, keeping/rejecting based on whitelist/blacklist, and using the top result left over, if it exists?

@sinall
Copy link

sinall commented Jun 18, 2019

Has this been fixed in 5.0.0-alpha?

@axhagemann
Copy link

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

@thekevshow
Copy link

thekevshow commented Aug 15, 2019

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

@axhagemann

This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this.

@Jinnrry
Copy link

Jinnrry commented Aug 16, 2019

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

Really excited ! It works!

@stweil
Copy link
Contributor

stweil commented Aug 17, 2019

@nguyenq, can we close this issue?

@nguyenq
Copy link
Contributor Author

nguyenq commented Aug 17, 2019

Didn't realize I'd opened this issue. It's been so long ago. :)

@nguyenq nguyenq closed this as completed Aug 17, 2019
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 30, 2019 via email

drothlis added a commit to stb-tester/stb-tester that referenced this issue Sep 4, 2019
Tesseract 4.0's LSTM engine ignores char_whitelist.[1]

This is fixed in Tesseract 4.1 [2] but it isn't widely available
yet (it'll be in Ubuntu 19.10).

[1]: tesseract-ocr/tesseract#751
drothlis added a commit to stb-tester/stb-tester that referenced this issue Sep 4, 2019
Tesseract 4.0's LSTM engine ignores char_whitelist.[1]

This is fixed in Tesseract 4.1 [2] but it isn't widely available
yet (it'll be in Ubuntu 19.10).

[1]: tesseract-ocr/tesseract#751
[2]: https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes#tesseract-release-notes-jul-07-2019---v410
@mhellmeier
Copy link

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

  1. Update Tesseract to version 4.1 (the future-oriented approach)
  2. Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

@Shreeshrii
Copy link
Collaborator

Alex's ppa can be used on Ubuntu for the latest versions.
Please see https://github.com/tesseract-ocr/tesseract/wiki

@tammarut

This comment has been minimized.

@amitdo

This comment has been minimized.

@v3ss0n
Copy link

v3ss0n commented Jun 28, 2022

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

  1. Update Tesseract to version 4.1 (the future-oriented approach)
  2. Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

Is your method 3 a joke? It is not detection of numbers only , it is filtering out text AFTER being detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests