API: Different results for the same image depending on the order in which the files are processed #3452

nagadomi · 2021-06-07T03:05:13Z

Separated from #3200

Environment

Tesseract Version: 5.0.0 Alpha (master branch)
Commit Number: bf979c8
Platform: Linux mpn1 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

When processing multiple files with API, different results may occur for the same image, depending on the order in which the files are processed.
The same result is produced in the same order.

Expected Behavior:

The same result should be produced for the same image regardless of the processing order.

Note 1: I understand that multithreading and SIMD can cause minor differences in results, but this seems to be a different issue.
Note 2: The differences are mainly the result of diplopia issue.

Simple reproduce code (with tesserocr)

from PIL import Image
from tesserocr import PyTessBaseAPI, PSM, tesseract_version

if __name__ == "__main__":
    TESSDATA_DIR="/home/nagadomi/dev/tesseract-git/tessdata_fast" # fill tessdata directory
    test_image = Image.open("test1.png")

    print(tesseract_version(), "\n")

    print("* case1 API re-use")
    with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
        api.SetVariable("preserve_interword_spaces", "1")
        variants = set()
        for t in range(100):
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
        print(f"{len(variants)} different results")
        print("----\n".join(variants))

    print("* case2 API re-create")
    variants = set()
    for t in range(100):
        with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
            api.SetVariable("preserve_interword_spaces", "1")
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
    print(f"{len(variants)} different results")
    print("----\n".join(variants))

test1.png:

result:

% OMP_THREAD_LIMIT=1 python3 variants.py
tesseract 5.0.0-alpha-20210401-108-gbf342
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 

* case1 API re-use
2 different results
パバイボ
パバイボ
パイポの
シューリンガン
----
パバイボ
パバイボ
パバイポの
シューリンガン

* case2 API re-create
1 different results
パバイボ
パバイボ
パバイポの
シューリンガン

Note 1: This code uses tesserocr to make it easier to reproduce, but our codebase uses Tesseract C-API via ctypes (it calls TessBaseAPISetImage,TessBaseAPIGetUTF8Text,TessBaseAPIClear), so I don't think it's a tesserocr issue.
Note 2: test1.png cannot reproduce this problem with tessdata_best, but I have confirmed that the same problem occurs with tessdata_best(float model). I have not succeeded in creating a publishable test image that can reproduce it.

The text was updated successfully, but these errors were encountered:

amitdo · 2021-06-07T09:29:26Z

Can you reproduce this issue with a document written in English or other language written in the Latin script?. You say it related to the diplopia issue, so maybe an image uploaded by other people in one of the diplopia issues can be used to reproduce your issue.

Also, rry to reproduce the issue in the command line:

tesseract images out

images should be a list of files with full paths. You can give this file any name you want.

In your case it should contain something like this:

/path/to/test1.png
/path/to/test1.png

nagadomi · 2021-06-07T11:23:33Z

steps to reproduce this issue using the command line and ocrd-testset.zip.

% unzip ocrd-testset.zip -d ocrd-testset
% find ocrd-testset -name "*.tif" | sort > list1.txt
% find ocrd-testset -name "*.tif" | shuf --random-source list1.txt > list2.txt
% tesseract list1.txt out1 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list2.txt out2 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list1.txt out3 -l frk --tessdata-dir ../../tessdata_fast/

check_diff.py

def load_result(list_file, result_file):
    with open(list_file) as f1, open(result_file) as f2:
        files = f1.read().split("\n")
        results = f2.read().split("\x0c")
        return {fn: ret for fn, ret in zip(files, results)}

def print_diff(result1, result2):
    for key in result1.keys():
        if result1[key] != result2[key]:
            print(f"---- {key}")
            print(result1[key])
            print(result2[key])

result1 = load_result("list1.txt", "out1.txt")
result2 = load_result("list2.txt", "out2.txt")
result3 = load_result("list1.txt", "out3.txt")
assert(set(result1.keys()) == set(result2.keys()))

print("* out1 x out2")
print_diff(result1, result2)

print("* out1 x out3")
print_diff(result1, result3)

% python3 check_diff.py
* out1 x out2
---- ocrd-testset/bismarck_erinnerungen02_1898_0274_002.tif
obligatur fann durc4 feine Bertragsclaujel außer Kraft gejebt

obligatur fann durc; feine Bertragsclaujel außer Kraft gejebt

---- ocrd-testset/mueller_waldhornist_1821_0051_015.tif
n Sturm und Regen und Schnee,

In Sturm und Regen und Schnee,

* out1 x out3

out1 × out2 is in a different order, the results were different. out1 × out3 is in the same order, so there was no difference. This is why I guess that the results depend on the order.

nagadomi · 2021-06-08T01:00:57Z

I could not find any images in the diplopia issue threads that reproduce this issue.
I guess this issue occurs when some of the outputs of softmax function are close to each other, so it is hard to reproduce this issue for languages with few symbols like English.
Japanese has many symbols, some of which are very similar to each other, so it is easy to reproduce.
If a lot of images are used, as in ocrd-testset.zip example above, it can be reproduced, but debugging will be more difficult, I think.

The above test1.png can be reproduced in two lines.

% echo test1.png > list.txt; echo test1.png >> list.txt
% tesseract list.txt stdout -l jpn_vert --psm 5 --tessdata-dir ../../tessdata_fast
Page 0 : test2.png
パ バ イボ
パ バ イボ
パ バイ ポ の
シュ ー リ ン ガ ン

Page 1 : test2.png
パ バ イボ
パ バ イボ
パイ ポ の
シュ ー リ ン ガ ン

nagadomi · 2021-06-30T01:29:41Z

This issue seems to be fixed by #3474 (#3473). This issue is not reproduced in the latest master branch.
I have a feeling that the diplopia issue is more frequently occurring than before, but I have no evidence to that.

nocun · 2021-06-30T09:29:53Z

Is this type of issue added to some test suite? It would be great to be able to catch these types of errors as they happen.

amitdo · 2021-06-30T11:49:19Z

The tests are here:

https://github.com/tesseract-ocr/tesseract/tree/master/unittest

nocun · 2021-06-30T15:43:19Z

Are those tests run daily on CI? Could you check whether they no longer fail?

stweil · 2021-07-03T13:55:45Z

I don't think that we already have a test which would have detected this issue, otherwise we would have noticed the failure early. Maybe you want to add one?

nocun · 2021-07-04T19:39:17Z

@stweil That is a good idea, I will try to write one.

nagadomi closed this as completed Jun 30, 2021

amitdo added the bug label Jul 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Different results for the same image depending on the order in which the files are processed #3452

API: Different results for the same image depending on the order in which the files are processed #3452

nagadomi commented Jun 7, 2021

amitdo commented Jun 7, 2021

nagadomi commented Jun 7, 2021 •

edited

Loading

nagadomi commented Jun 8, 2021 •

edited

Loading

nagadomi commented Jun 30, 2021 •

edited

Loading

nocun commented Jun 30, 2021

amitdo commented Jun 30, 2021

nocun commented Jun 30, 2021

stweil commented Jul 3, 2021

nocun commented Jul 4, 2021

API: Different results for the same image depending on the order in which the files are processed #3452

API: Different results for the same image depending on the order in which the files are processed #3452

Comments

nagadomi commented Jun 7, 2021

Environment

Current Behavior:

Expected Behavior:

Simple reproduce code (with tesserocr)

amitdo commented Jun 7, 2021

nagadomi commented Jun 7, 2021 • edited Loading

nagadomi commented Jun 8, 2021 • edited Loading

nagadomi commented Jun 30, 2021 • edited Loading

nocun commented Jun 30, 2021

amitdo commented Jun 30, 2021

nocun commented Jun 30, 2021

stweil commented Jul 3, 2021

nocun commented Jul 4, 2021

nagadomi commented Jun 7, 2021 •

edited

Loading

nagadomi commented Jun 8, 2021 •

edited

Loading

nagadomi commented Jun 30, 2021 •

edited

Loading