-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract inserting additional alternative characters #1465
Comments
For English, please also try the file from tessdata repo with --oem 0.
That will use the legacy tesseract engine. It is possible that it will be a
better fit for your use case.
…On Tue 10 Apr, 2018, 6:28 PM jghare, ***@***.***> wrote:
Environment
- *Tesseract Version*: <3.x stable and 4.0 alpha/beta> for English
language text (using Fast and Best trained data)
- *Platform*: <Windows, version 64-bit and linux (Ubuntu/centos)-->
Current Behavior:
All versions of tesseract mentioned above tend to insert additional
alternative characters (probably) whenever its not very confident. For
example - if there a "#" in the image file it often spits out "#H" or "A#"
or even "AH"... Thats 2 characters for 1. Another example: If theres a "$"
in the image then all it gives "S$" of "$s" etc.. happens very often for
other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra
character throws many things off.
I am currently a command-line user and may later use it in Java whenever a
wrapper for 4.0 becomes available.
Expected Behavior:
Expect tesseract to give out only one character for each character in the
image. I should be able to control this behaviour using command line
parameters (assuming there isn't one yet..). I have looked into the
parameters but there are hundreds and mostly non-self-explanatory. Hence
raising this as an issue. Also is it possible to get a "Character-level"
HOCR output - current one is at word level granularity.
Suggested Fix:
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1465>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozUU7WV9vOy5bqQWuoe5mAeKl_7Qks5tnKx3gaJpZM4TONQm>
.
|
Hi Shreeshrii |
Please fix this.. It's a big problem. |
If it is a big problem that provide user case. Description by words is difficult to test and developers are forced to spent useless time on find what is your problem instead of solving problems. |
#1011 might be related |
There are a number of issues regarding this, for different languages etc. Listing them below. Incorrect recognotion of specific words - additional letters inserted #1011 tesseract add similar characters in Japanese text (ambiguity management?) #1063 German - Characters added to result multiple times (aä / AÄ) #1060 Tesseract LSTM 4.0: letters repeat in recognized text #884 |
Possibly related: recognizes more characters than present #1362 |
This is still present in the latest master branch. It seems to happen after retraining (finetuning) the original tessdata files - in my case eng - and appears to be a result of ambiguous output from the LSTM, where it is providing more than one character for a bounding box (or at least that's how it appears without actually checking) - i.e. it is giving its possible or "unconfident" characters as well. More training does seem to balance this out slightly, but it's very hit or miss. |
In that case try to disable the dictionary. |
Yes. It is. Use
|
Hi, |
C:\Program Files (x86)\Tesseract-OCR>tesseract testImage.PNG out -l check -c hocr_char_boxes=1 hocr It looks like this config is not longer there. I want the output that on a char level but it does not seen possible. |
@jghare, can you provide some simple images which show this issue? That would help testing new code which tries to fix it. |
@stweil I encounter this issue on nearly daily basis with about 5% of the cases (around 2 per day on average). I will try to save the problematic files and the settings if that will be helpful to you. -- file: stelnum -> text: 1C4BUOOOOKPJ60479 -> extracted text: 1Cc4dBUOOOOKPJ60479 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQSRTUVWXYZ1234567890 lang=dan -- file: regnumber -> text: AJ38906 -> extracted text: AIJ38906 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQSRTUVWXY lang=dan -- You can find attached the images in an archive, hopefully they will prove useful test material. Please tell if you would like me to send more -- |
For random sequence of characters you'll need to:
If you have questions about fine tuning, use the forum. |
allow me to add #1465 and #2738. My diagnosis for this bug was that it is specific to the Tesseract CTC implementation (with its |
hello, |
I have just created pull request #4211 which I consider to be an improved solution for diplopia. I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible. Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are: bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect Obviously if my diplopia change is of value, then these configuration items should be made into settings. |
Environment
Tesseract Version: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) Command line
Platform: <Windows, version 64-bit and linux (Ubuntu/centos)-->
Current Behavior:
All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if theres a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then it gives "S$" or "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra character throws many things off.
I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available.
Expected Behavior:
Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.
Suggested Fix:
The text was updated successfully, but these errors were encountered: