-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blacklist and whitelist unsupported with LSTM (4.0) #751
Comments
Same problem for me with 4.00alpha, I tried to set
But I keep getting non letter results I can provide Dockerfile + python script + images if needed |
Same problem for me. Still getting symbols and alphabets despite setting |
I encountered the same issue today when using --oem 1,2,3. It works fine for --oem 0 (Original Tesseract). |
I am encountering the same issue, is there a solution for this issue yet?. |
No. |
I am facing the same issue. Is it really a bug or is it just not supported for LSTM? |
It's currently not supported for LSTM. People, please do not add another "I have the same issue" comment. |
Are there plans to support whitelisting on LSTM in the future? |
I also have this problem when using Tesseract 4 from C++
has no effect on the output. The same with blacklist. Tesseract returns not only ascii + language-specific characters but also some strange other characters from UTF-8. Is there a way to get a full list of all possible characters, specific for a language or not? Basing on such list one could make a workaround to map such wrong characters to best fitting ones that are expected (like EM DASH to plain ASCII dash etc.) and remove those without any sensible fit. It would be useful for me in current circumstances and maybe it could be useful for others in need of whitelisting. |
@theraysmith Are there plans to support this for LSTM? |
In response to https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw You can try the plus-minus type of training if you just want a digits type of traineddata. Your training_text can contain numbers in the format you need and you can train with a font matching your images. For proof of concept you can try my experimental version at |
I would like to exclude everything except letters and digits from the result. I started from eng.traineddata and trained my font from graphical images (@Shreeshrii: thanks!!)) . Is there a way to get rid of all the other symbols, especially !"=)() ... ? I am using --oem 1. Thank you very much, |
Duplicate issue? "user pattern/dict does not work at all" I'm on 3.04.01 (from ubuntu 16.04 repos) and it doesn't work in that version either. |
has this been resolved or anyone found a workaround? |
No. |
Not really - sort-of workaround only. I've ended up by iterating through symbols found by Tesseract and doing some post-processing. Found out by analysis of many cases what are usual OCR errors for my type of documents, that move us out of chosen set and then used a mapping of those mistaken chars to proper chars (plus filtering of all that are outside of set). So finally I have only chosen character set on output, but it is suboptimal solution. |
Another experiment with finetuning - minuschar - i.e. removing characters from an existing traineddata. In my sample I have used upper and lower case alphabet and digits only. Please see attached zip file. It has the bash script used, training text and resulting traineddata file. You wil l get better results if you use font similar to one you want to recognize and training text also of similar to what you need. I have removed all the wordlists/dawgs so tesseract will give a warning message when doing OCR. |
@Htarlov @Shreeshrii thanks interesting thoughts. I hadn't run much much post-processing or done any training yet so these should improve things considerably. |
tesseract 4.0.0-beta.1 still has this problem. |
rebuilt from source -- whitelist still doesnt work |
AFAIK, this will not be addressed for 4.0.0. |
I've posted a bounty to have this resolved: https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-broken-in-4-00alpha |
Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM. |
Use --oem 0 or -oem 0 and it works |
Is there any update on this? Or should I drop versions? |
@amitdo was right. I was able to get the old behaviour (whitelist, blacklist, unblacklist) back with the LSTM decoder by querying the unicharset's But it was not so easy (for me) to get the |
Can we use ChoiceIterator to iterate through all possibilities, keeping/rejecting based on whitelist/blacklist, and using the top result left over, if it exists? |
Has this been fixed in 5.0.0-alpha? |
Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes |
This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this. |
Really excited ! It works! |
@nguyenq, can we close this issue? |
Didn't realize I'd opened this issue. It's been so long ago. :) |
https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata
Install from Alex's ppa
…On Fri, 30 Aug 2019, 21:18 OllieD3711, ***@***.***> wrote:
This appears to be working for me, just upgraded my 4.0 version to 4.1.
FINALLY! lol been waiting on this.
@kev2316 <https://github.com/kev2316>
I'm having trouble upgrading to 4.1.0. I'm sure I'm doing something stupid
(using sudo apt-upgrade, and also tried sudo apt install), but when I try
upgrade, I'm told I already have the latest version,
4.00~git2288-10f4998a-2.
How can I upgrade to 4.1.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#751?email_source=notifications&email_token=ABG37I5ILLDA745L7XWIC6TQHE6LLA5CNFSM4DC3C6RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SBBTI#issuecomment-526651597>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG37I6G7RJ7R3ZNMOFTPKLQHE6LLANCNFSM4DC3C6RA>
.
|
Tesseract 4.0's LSTM engine ignores char_whitelist.[1] This is fixed in Tesseract 4.1 [2] but it isn't widely available yet (it'll be in Ubuntu 19.10). [1]: tesseract-ocr/tesseract#751
Tesseract 4.0's LSTM engine ignores char_whitelist.[1] This is fixed in Tesseract 4.1 [2] but it isn't widely available yet (it'll be in Ubuntu 19.10). [1]: tesseract-ocr/tesseract#751 [2]: https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes#tesseract-release-notes-jul-07-2019---v410
Just to conclude and in addition: To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:
The Ubuntu package sources only contains tesseract version |
Alex's ppa can be used on Ubuntu for the latest versions. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Is your method 3 a joke? It is not detection of numbers only , it is filtering out text AFTER being detected. |
Blacklist and whitelist no longer work in 4.00alpha. They used to work in 3.04.
https://groups.google.com/forum/#!topic/tesseract-ocr/cpcJHTE2xMo
The text was updated successfully, but these errors were encountered: