Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text not Detecting in conversation #883

Open
sairash opened this issue Feb 2, 2024 · 5 comments
Open

Text not Detecting in conversation #883

sairash opened this issue Feb 2, 2024 · 5 comments

Comments

@sairash
Copy link

sairash commented Feb 2, 2024

Tesseract.js version 5.0.4

Describe the bug
For Some reason when I use conversation text it is not detecting conversations under in blue container

Image

To Reproduce
Steps to reproduce the behavior:
just use the image
test

Expected behavior
It needs to be
Device Version:

  • Linux + Arch
  • Browser [Brave]
@Kishlay-notabot
Copy link
Contributor

It works when you crop the specific blue part of the image and try to detect it.
I am using a web app to detect the text [uses tesseract.js]

image

I suppose Tesseract sets somewhat like a threshold for a whole image when it tries to detect text in it, in terms of easiness.


The engine probably scans the whole image, and the most contrasting text above in the image is the white text and grey background, and maybe it takes that as a relative reference to scan the whole image? I suppose it tries to find the text which is the most easiest to detect? i.e. with a high contrast with the background. I have no prior experience or knowledge of the internal working of the engine, but I think the program might work like the way I just hypothesized.

I tried converting the image to grayscale before executing Tesseract OCR on them, but the results aren't what we expect, again.
Below is the image converted to grayscale and processed as whole, but still the words aren't recognized.

image
grayscale image

When I apply binarization on the grayscale image, the result is kind of matching to my hypothesis, the blue text is totally not visible. So yes, maybe tesseract is running something like a uniformity inducing or binarization algorithm equivalent pre processing code before running ocr I suppose. [I feel I am wrong]

image
binarized image

@sairash
Copy link
Author

sairash commented Feb 5, 2024

I also tried using different filters and stuff. Nothing seems to be working.
But when I use tesseract-wasm it detects the texts

swappy-20240205_221940

But it is unreliable when using small res pic and also the same problem starts when I have any white borders in the image.

swappy-20240205_222121

@Balearica
Copy link
Member

Tesseract.js includes an output option that allows you to retrieve the actual binarized image recognized by Tesseract. An example site using this option can be found here. Sure enough, as speculated by @Kishlay-notabot, that confirms that the messages in blue are being erased by the binarization process.

download (50)

Using this example code, you should be able to experiment with Tesseract's binarization options. These are not documented in this repo, however you can find them in the main Tesseract project's repo, and I pasted the descriptions from the code below. I have not used these options before, so am not sure what (if any) options would improve results with this screenshot. If none of these options work, you would need to either (1) binarize the image properly yourself before sending to Tesseract or (2) crop the images to specific messages before processing.

    , INT_MEMBER(thresholding_method,
                 static_cast<int>(ThresholdMethod::Otsu),
                 "Thresholding method: 0 = Otsu, 1 = LeptonicaOtsu, 2 = "
                 "Sauvola",
                 this->params())
    , BOOL_MEMBER(thresholding_debug, false,
                  "Debug the thresholding process",
                  this->params())
    , double_MEMBER(thresholding_window_size, 0.33,
                    "Window size for measuring local statistics (to be "
                    "multiplied by image DPI). "
                    "This parameter is used by the Sauvola thresholding method",
                    this->params())
    , double_MEMBER(thresholding_kfactor, 0.34,
                    "Factor for reducing threshold due to variance. "
                    "This parameter is used by the Sauvola thresholding method."
                    " Normal range: 0.2-0.5",
                    this->params())
    , double_MEMBER(thresholding_tile_size, 0.33,
                    "Desired tile size (to be multiplied by image DPI). "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method",
                    this->params())
    , double_MEMBER(thresholding_smooth_kernel_size, 0.0,
                    "Size of convolution kernel applied to threshold array "
                    "(to be multiplied by image DPI). Use 0 for no smoothing. "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method",
                    this->params())
    , double_MEMBER(thresholding_score_fraction, 0.1,
                    "Fraction of the max Otsu score. "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method. "
                    "For standard Otsu use 0.0, otherwise 0.1 is recommended",
                    this->params())

@Balearica
Copy link
Member

It looks like the image is recognized perfectly, without needing to change any Tesseract.js settings, when it is first inverted. I don't know how generalizable this is since message apps can differ between white on black/black on white/mixed, however inverting the image to black text on a light background solves in this case.

text_1_invert

@Kishlay-notabot
Copy link
Contributor

@Balearica
Strange behaviour, try binarization on the inverted image,
I think that here in this case, the fonts are black and the backgrounds are orange and grey respectively, which do provide a better contrast than blue and white combination in the original image. Contrast is the main thing..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants