Regression in extracting text from Excel TIF image #4014

jwmepiq · 2023-02-06T22:15:36Z

Environment

ExcelTest_Bug.zip
ExcelTest_text_TesseractV4.txt

Tesseract Version: 5.2.0 vs. 4.1.1.-rc2-37-gcla5
Ubuntu 20.04.3 LTS

Current Behavior:

With the attached TIF image of an Excel file (in the zip), Tesseract version 5.2.0 extracts a minimal amount of text (only a single line "hiding rows 15 through 20"). However, in prior versions of Tesseract, namely the version 4.1.1 version noted above, but likely other versions as well, the amount of text extracted from the same TIF image is significantly larger (multiple lines of text, approximately 1K of text over multiple pages). Attached a separate text file with the output of the V4.x text output.

Expected Behavior:

Expecting version 5.2+ of Tesseract to at least replicate the behavior of prior versions in extracting text from this sample TIF.

Suggested Fix:

Correct the text extraction to match the output from previous Tesseract versions. Concerned with Tesseract's regression in ability to extract text from Excel files.

vamsiyadavmolli · 2023-02-15T15:26:21Z

This is an issue in 5.3 version as well. Tried using both best/fast trained data still we see this issue.

dhairyagupta2603 · 2023-10-04T10:29:52Z

Hey, Just wanted to know the status of this issue. Can I take this up?

stweil · 2023-10-04T11:16:32Z

I can confirm the issue with the latest code. What about releases between 4.1.1 (working) and 5.2.0 (not working)? Can we narrow down which release introduced the regression?

stweil · 2023-10-04T12:17:10Z

According to git bisect the regression was introduced by commit 842cca1. So release5.0.0-beta-20210916 was the last one without this issue.

…-ocr#4014) "auto" resulted in unsigned numbers, but htext_score and vtest_score can be negative. Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings") Signed-off-by: Stefan Weil <[email protected]>

stweil · 2023-10-04T13:05:14Z

Pull request #4136 fixes the regression.

stweil · 2023-10-04T13:07:42Z

@jwmepiq, thank you for reporting this nasty regression.

"auto" resulted in unsigned numbers, but htext_score and vtest_score can be negative. Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings") Signed-off-by: Stefan Weil <[email protected]>

tfmorris · 2023-10-23T18:00:22Z

Now that #4136 has been merged and 5.3.3 released, I assume this can be closed

stweil · 2023-10-23T18:12:18Z

Yes, thanks.

stweil added the layout analysis label Oct 4, 2023

stweil added bug regression labels Oct 4, 2023

stweil mentioned this issue Oct 6, 2023

Update tesseract (new release 5.3.3) OCR-D/ocrd_all#391

Closed

stweil closed this as completed Oct 23, 2023

vivadavid mentioned this issue Feb 5, 2024

Selecting multiple languages for OCR cyanfish/naps2#305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in extracting text from Excel TIF image #4014

Regression in extracting text from Excel TIF image #4014

jwmepiq commented Feb 6, 2023

vamsiyadavmolli commented Feb 15, 2023

dhairyagupta2603 commented Oct 4, 2023

stweil commented Oct 4, 2023

stweil commented Oct 4, 2023 •

edited

Loading

stweil commented Oct 4, 2023

stweil commented Oct 4, 2023

tfmorris commented Oct 23, 2023

stweil commented Oct 23, 2023

Regression in extracting text from Excel TIF image #4014

Regression in extracting text from Excel TIF image #4014

Comments

jwmepiq commented Feb 6, 2023

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

vamsiyadavmolli commented Feb 15, 2023

dhairyagupta2603 commented Oct 4, 2023

stweil commented Oct 4, 2023

stweil commented Oct 4, 2023 • edited Loading

stweil commented Oct 4, 2023

stweil commented Oct 4, 2023

tfmorris commented Oct 23, 2023

stweil commented Oct 23, 2023

stweil commented Oct 4, 2023 •

edited

Loading