-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Different results for the same image depending on the order in which the files are processed #3452
Comments
Can you reproduce this issue with a document written in English or other language written in the Latin script?. You say it related to the diplopia issue, so maybe an image uploaded by other people in one of the diplopia issues can be used to reproduce your issue. Also, rry to reproduce the issue in the command line:
In your case it should contain something like this:
|
steps to reproduce this issue using the command line and ocrd-testset.zip. % unzip ocrd-testset.zip -d ocrd-testset
% find ocrd-testset -name "*.tif" | sort > list1.txt
% find ocrd-testset -name "*.tif" | shuf --random-source list1.txt > list2.txt
% tesseract list1.txt out1 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list2.txt out2 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list1.txt out3 -l frk --tessdata-dir ../../tessdata_fast/
def load_result(list_file, result_file):
with open(list_file) as f1, open(result_file) as f2:
files = f1.read().split("\n")
results = f2.read().split("\x0c")
return {fn: ret for fn, ret in zip(files, results)}
def print_diff(result1, result2):
for key in result1.keys():
if result1[key] != result2[key]:
print(f"---- {key}")
print(result1[key])
print(result2[key])
result1 = load_result("list1.txt", "out1.txt")
result2 = load_result("list2.txt", "out2.txt")
result3 = load_result("list1.txt", "out3.txt")
assert(set(result1.keys()) == set(result2.keys()))
print("* out1 x out2")
print_diff(result1, result2)
print("* out1 x out3")
print_diff(result1, result3)
|
I could not find any images in the diplopia issue threads that reproduce this issue. The above % echo test1.png > list.txt; echo test1.png >> list.txt
% tesseract list.txt stdout -l jpn_vert --psm 5 --tessdata-dir ../../tessdata_fast
Page 0 : test2.png
パ バ イボ
パ バ イボ
パ バイ ポ の
シュ ー リ ン ガ ン
Page 1 : test2.png
パ バ イボ
パ バ イボ
パイ ポ の
シュ ー リ ン ガ ン |
Is this type of issue added to some test suite? It would be great to be able to catch these types of errors as they happen. |
The tests are here: https://github.com/tesseract-ocr/tesseract/tree/master/unittest |
Are those tests run daily on CI? Could you check whether they no longer fail? |
I don't think that we already have a test which would have detected this issue, otherwise we would have noticed the failure early. Maybe you want to add one? |
@stweil That is a good idea, I will try to write one. |
Separated from #3200
Environment
Linux mpn1 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Current Behavior:
When processing multiple files with API, different results may occur for the same image, depending on the order in which the files are processed.
The same result is produced in the same order.
Expected Behavior:
The same result should be produced for the same image regardless of the processing order.
Note 1: I understand that multithreading and SIMD can cause minor differences in results, but this seems to be a different issue.
Note 2: The differences are mainly the result of diplopia issue.
Simple reproduce code (with tesserocr)
test1.png:
result:
Note 1: This code uses
tesserocr
to make it easier to reproduce, but our codebase uses Tesseract C-API via ctypes (it callsTessBaseAPISetImage
,TessBaseAPIGetUTF8Text
,TessBaseAPIClear
), so I don't think it's atesserocr
issue.Note 2:
test1.png
cannot reproduce this problem withtessdata_best
, but I have confirmed that the same problem occurs withtessdata_best
(float model). I have not succeeded in creating a publishable test image that can reproduce it.The text was updated successfully, but these errors were encountered: