Tesseract inserting additional alternative characters #1465

jghare · 2018-04-10T12:58:15Z

Environment

Tesseract Version: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) Command line
Platform: <Windows, version 64-bit and linux (Ubuntu/centos)-->

Current Behavior:

All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if theres a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then it gives "S$" or "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra character throws many things off.
I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available.

Expected Behavior:

Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Suggested Fix:

Shreeshrii · 2018-04-10T13:38:54Z

For English, please also try the file from tessdata repo with --oem 0. That will use the legacy tesseract engine. It is possible that it will be a better fit for your use case.

…

On Tue 10 Apr, 2018, 6:28 PM jghare, ***@***.***> wrote: Environment - *Tesseract Version*: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) - *Platform*: <Windows, version 64-bit and linux (Ubuntu/centos)--> Current Behavior: All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if there a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then all it gives "S$" of "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc... My application is very sensitive to length of the string hence an extra character throws many things off. I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available. Expected Behavior: Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity. Suggested Fix: — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1465>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozUU7WV9vOy5bqQWuoe5mAeKl_7Qks5tnKx3gaJpZM4TONQm> .

jghare · 2018-04-10T14:37:56Z

Hi Shreeshrii
When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.
The 4.0 alpha and beta give me far superior results on the OCR than legacy on my images. Is there no way to tell tesseract 4.0 to not insert extra alternatives?
Also would be good to give it a white-list of characters. I see that issue is also open...

mkrolready · 2018-04-20T09:46:26Z

Please fix this.. It's a big problem.

zdenop · 2018-04-20T12:33:29Z

If it is a big problem that provide user case. Description by words is difficult to test and developers are forced to spent useless time on find what is your problem instead of solving problems.

vidiecan · 2018-04-20T20:32:18Z

#1011 might be related

Shreeshrii · 2018-04-22T15:41:51Z

There are a number of issues regarding this, for different languages etc. Listing them below.

Incorrect recognotion of specific words - additional letters inserted #1011

tesseract add similar characters in Japanese text (ambiguity management?) #1063

German - Characters added to result multiple times (aä / AÄ) #1060

Tesseract LSTM 4.0: letters repeat in recognized text #884

Shreeshrii · 2018-04-24T08:56:08Z

Possibly related:

recognizes more characters than present #1362

talentoscope · 2018-09-16T16:47:17Z

This is still present in the latest master branch. It seems to happen after retraining (finetuning) the original tessdata files - in my case eng - and appears to be a result of ambiguous output from the LSTM, where it is providing more than one character for a bounding box (or at least that's how it appears without actually checking) - i.e. it is giving its possible or "unconfident" characters as well. More training does seem to balance this out slightly, but it's very hit or miss.

amitdo · 2018-09-17T08:52:54Z

When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.

In that case try to disable the dictionary.
Also try to fine tune the model.

Shreeshrii · 2019-07-19T08:20:59Z

Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Yes. It is. Use -c hocr_char_boxes=1 hocr in your command line. Output is of the format:

<span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
             <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
             <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
             <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
             <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>

eravallirao · 2019-11-20T09:27:19Z

Hi,
I tried to use it, but it is not working for me. Any idea

Togame-san · 2020-09-30T22:09:31Z

C:\Program Files (x86)\Tesseract-OCR>tesseract testImage.PNG out -l check -c hocr_char_boxes=1 hocr
Could not set option: hocr_char_boxes=1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
OSD: Weak margin (0.63), horiz textlines, not CJK: Don't rotate.
Detected 3 diacritics

It looks like this config is not longer there. I want the output that on a char level but it does not seen possible.

stweil · 2020-11-09T09:23:35Z

@jghare, can you provide some simple images which show this issue? That would help testing new code which tries to fix it.

Petru-design · 2021-02-16T11:03:42Z

@stweil I encounter this issue on nearly daily basis with about 5% of the cases (around 2 per day on average). I will try to save the problematic files and the settings if that will be helpful to you.
In the mean time, here are some examples (called from python):

--

file: stelnum -> text: 1C4BUOOOOKPJ60479 -> extracted text: 1Cc4dBUOOOOKPJ60479 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQSRTUVWXYZ1234567890 lang=dan

--

file: regnumber -> text: AJ38906 -> extracted text: AIJ38906 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQSRTUVWXY lang=dan

--

You can find attached the images in an archive, hopefully they will prove useful test material. Please tell if you would like me to send more
images.zip

--

amitdo · 2021-02-16T15:43:40Z

For random sequence of characters you'll need to:

Disable the dictionary.
Fine tune the eng model with similar images.

If you have questions about fine tuning, use the forum.

bertsky · 2021-03-16T18:55:31Z

@Shreeshrii

There are a number of issues regarding this, for different languages etc. Listing them below.

Incorrect recognotion of specific words - additional letters inserted #1011

tesseract add similar characters in Japanese text (ambiguity management?) #1063

German - Characters added to result multiple times (aä / AÄ) #1060

Tesseract LSTM 4.0: letters repeat in recognized text #884

recognizes more characters than present #1362

allow me to add #1465 and #2738.

My diagnosis for this bug was that it is specific to the Tesseract CTC implementation (with its NodeContinuation trick conflating paths to avoid the combinatorial explosion but creating an additional ambiguity of two adjacent nulls). I called these fake CTC duplicates diplopia. Someone definitely needs to work on this.

seltix5 · 2022-02-07T14:34:43Z

hello,
I have this problem too, any idea how I can help fix it?
I have this simple example :

OCR result : 1921.14K
Analyzing other tests the problem is probably in the 9 because sometimes I get wrong results with 2s instead of 9s, In this case I got both.
I'm using this .NET wrapper (https://github.com/charlesw/tesseract/tree/feature/321-Tesseract-4) but I build and updated the tesseract and leptonica DLLs to the latest ones (leptonica-1.83.0 & tesseract50), using the "best" eng.traineddata and char whitelist "0123456789.,KMB".

woodjohndavid · 2024-03-13T22:44:28Z

I have just created pull request #4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect
int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.

zdenop added the accuracy label Sep 29, 2018

This was referenced Mar 20, 2019

RFC: Lattice Output #2339

Open

Option --psm 10 digits are not taken account. #2159

Open

bertsky mentioned this issue Dec 2, 2019

Duplicate Characters in Output Stream #2738

Open

stweil mentioned this issue Oct 30, 2020

Character confusion fix suggestion #3144

Open

amitdo added the diplopia label Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract inserting additional alternative characters #1465

Tesseract inserting additional alternative characters #1465

jghare commented Apr 10, 2018 •

edited

Loading

Shreeshrii commented Apr 10, 2018 via email

jghare commented Apr 10, 2018

mkrolready commented Apr 20, 2018

zdenop commented Apr 20, 2018

vidiecan commented Apr 20, 2018

Shreeshrii commented Apr 22, 2018

Shreeshrii commented Apr 24, 2018

talentoscope commented Sep 16, 2018 •

edited

Loading

amitdo commented Sep 17, 2018

Shreeshrii commented Jul 19, 2019

eravallirao commented Nov 20, 2019

Togame-san commented Sep 30, 2020

stweil commented Nov 9, 2020

Petru-design commented Feb 16, 2021 •

edited

Loading

amitdo commented Feb 16, 2021

bertsky commented Mar 16, 2021

seltix5 commented Feb 7, 2022

woodjohndavid commented Mar 13, 2024

Tesseract inserting additional alternative characters #1465

Tesseract inserting additional alternative characters #1465

Comments

jghare commented Apr 10, 2018 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Shreeshrii commented Apr 10, 2018 via email

jghare commented Apr 10, 2018

mkrolready commented Apr 20, 2018

zdenop commented Apr 20, 2018

vidiecan commented Apr 20, 2018

Shreeshrii commented Apr 22, 2018

Shreeshrii commented Apr 24, 2018

talentoscope commented Sep 16, 2018 • edited Loading

amitdo commented Sep 17, 2018

Shreeshrii commented Jul 19, 2019

eravallirao commented Nov 20, 2019

Togame-san commented Sep 30, 2020

stweil commented Nov 9, 2020

Petru-design commented Feb 16, 2021 • edited Loading

amitdo commented Feb 16, 2021

bertsky commented Mar 16, 2021

seltix5 commented Feb 7, 2022

woodjohndavid commented Mar 13, 2024

jghare commented Apr 10, 2018 •

edited

Loading

talentoscope commented Sep 16, 2018 •

edited

Loading

Petru-design commented Feb 16, 2021 •

edited

Loading