-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in extracting text from Excel TIF image #4014
Comments
This is an issue in 5.3 version as well. Tried using both best/fast trained data still we see this issue. |
Hey, Just wanted to know the status of this issue. Can I take this up? |
I can confirm the issue with the latest code. What about releases between 4.1.1 (working) and 5.2.0 (not working)? Can we narrow down which release introduced the regression? |
According to |
…-ocr#4014) "auto" resulted in unsigned numbers, but htext_score and vtest_score can be negative. Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings") Signed-off-by: Stefan Weil <[email protected]>
Pull request #4136 fixes the regression. |
@jwmepiq, thank you for reporting this nasty regression. |
"auto" resulted in unsigned numbers, but htext_score and vtest_score can be negative. Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings") Signed-off-by: Stefan Weil <[email protected]>
Now that #4136 has been merged and 5.3.3 released, I assume this can be closed |
Yes, thanks. |
Environment
ExcelTest_Bug.zip
ExcelTest_text_TesseractV4.txt
Tesseract Version: 5.2.0 vs. 4.1.1.-rc2-37-gcla5
Ubuntu 20.04.3 LTS
Current Behavior:
With the attached TIF image of an Excel file (in the zip), Tesseract version 5.2.0 extracts a minimal amount of text (only a single line "hiding rows 15 through 20"). However, in prior versions of Tesseract, namely the version 4.1.1 version noted above, but likely other versions as well, the amount of text extracted from the same TIF image is significantly larger (multiple lines of text, approximately 1K of text over multiple pages). Attached a separate text file with the output of the V4.x text output.
Expected Behavior:
Expecting version 5.2+ of Tesseract to at least replicate the behavior of prior versions in extracting text from this sample TIF.
Suggested Fix:
Correct the text extraction to match the output from previous Tesseract versions. Concerned with Tesseract's regression in ability to extract text from Excel files.
The text was updated successfully, but these errors were encountered: