Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with swapping GenericVector for std::vector #3473

Closed
nocun opened this issue Jun 29, 2021 · 16 comments
Closed

Issues with swapping GenericVector for std::vector #3473

nocun opened this issue Jun 29, 2021 · 16 comments

Comments

@nocun
Copy link
Contributor

nocun commented Jun 29, 2021

Environment

Problem

Replacing GenericVector with std::vector has resulted in a small execution time performance decrease (about 4% on my simple test). I am not sure whether some other bugs have been introduced this way.

Most of the issues I could see are from changing generic_vector.init_to_size(n, x); directly into vector.resize(n, x);. It is because init_to_size() not only resizes the vector but also resets all of its values (not only the new ones).

Suggested Fix

Recheck the logic and update GenericVector's changed lines:

gv.init_to_size(n, x) and gv.resize(n, x) should become:

v.clear();
v.resize(n, x);

gv.truncate(n) (other than truncate(0) which is just v.clear()) should become:

if (n < v.size()) {
    v.resize(n);
}

gv.resize_no_init(n) (I think this one was ok):

v.resize(n);

I have added an example in comments of a commit where the first performance hit was noted .

As far as I could see the code affected is within commit range 92b6c652..65d882f9. I might be able to provide a PR but this could take some time.

@stweil stweil added the bug label Jun 29, 2021
@stweil stweil added this to the 5.0.0 milestone Jun 29, 2021
@stweil
Copy link
Contributor

stweil commented Jun 29, 2021

Thank you for this valuable hint. What do you mean with the "performance decrease" of 4 %? Increased character error rates? Larger execution time?

@nocun
Copy link
Contributor Author

nocun commented Jun 29, 2021

Larger execution time measured between bb6dbd2c and c583ecef.

@stweil
Copy link
Contributor

stweil commented Jun 29, 2021

I am afraid that the necessary fixes will increase the execution time even further.

@nocun
Copy link
Contributor Author

nocun commented Jun 29, 2021

The difference is not large but measureable. After applying changes mentioned in the comments here execution time went back to normal so I have some hope.

There is yet another performance issue like that introduced in later commits but I have not found it yet.

@egorpugin
Copy link
Contributor

We need some script to find such incorrect fixes in git history and update the places.

stweil added a commit to stweil/tesseract that referenced this issue Jun 29, 2021
@stweil
Copy link
Contributor

stweil commented Jun 29, 2021

@nocun, can you try whether pull request #3474 changes the performance for you?

@nocun
Copy link
Contributor Author

nocun commented Jun 29, 2021

Sure, I'll be happy to test it but from my initial grepping it looks like there were changes in a lot of places.

I just realised I pasted a wrong commit range, the GenericVector changes started in Oct 2020 and finished in Mar 2021, commits: 92b6c652..65d882f9. Updated the issue description, sorry about that.

This is quite a lot of code to go through.

@nocun
Copy link
Contributor Author

nocun commented Jun 29, 2021

If you have other issues to work on I may try and change all the wrong code I can find but it will take me a few days.

stweil added a commit to stweil/tesseract that referenced this issue Jun 29, 2021
stweil added a commit to stweil/tesseract that referenced this issue Jun 29, 2021
egorpugin pushed a commit that referenced this issue Jun 29, 2021
@stweil
Copy link
Contributor

stweil commented Jun 29, 2021

GenericVector changes started in Oct 2020 and finished in Mar 2021, commits: 92b6c65..65d882f

My fix now covers that whole range. I looked for all replacements of init_to_size.

@nocun
Copy link
Contributor Author

nocun commented Jun 29, 2021

That was quick!

I ran your merged code from master and it looks better (runs faster) but still not exactly the same as originally. You can try to find and replace problematic GenericVector::resize(x, n) (resize is just an alias for init_to_size) and truncate(n) code conversions (resize in particular may be difficult to grep for) or I can bisect the other issue tomorrow.

Thank you.

@nocun
Copy link
Contributor Author

nocun commented Jun 30, 2021

Looks like the other thing may be related to some compiler optimization. The time differences are pretty slim but show up consistently.

The commit that introduced additional delays: 2a3682a3

Created a PR that makes the problem go away on my machine: #3481

Is there a CI pipeline that can track and report tesseract performance?

If you see no other problems with GenericVector replacements then this issue can be closed.

@egorpugin
Copy link
Contributor

Is there a CI pipeline that can track and report tesseract performance?

We don't have CI perf tests at the moment.

@egorpugin
Copy link
Contributor

The time differences are pretty slim

Can you also provide some numbers?

From 2a3682a I see copies in loops in two places

  1. 2a3682a#diff-1e9b4b30348fc110f27ca4095a364df6161b8ded2f04a2625674f06b09a1ab64R396
  2. 2a3682a#diff-53cf79bc0219e0046b2c3620c5c8d9aa31c2e45def2ccf11ae343a5e0a02c43aR204

Not sure if this is intended.

@nocun
Copy link
Contributor Author

nocun commented Jun 30, 2021

Can you also provide some numbers?

493620us vs. 479074us on GCC (min of ~300 runs).
Clang was just as fast with at() as with operator[].

I checked all the other code that looked weird, including copies in those 2 fors, vector<bool> and even reserve() + resize() pair which in the new code do initialize memory.

I was surprised to see that none of the above made any measurable difference, only getting the data pointer did.
I don't have any hard proofs of what optimizations are happening behind the scenes but my hunch is that GCC can see that this pointer can never be NULL after Init() and enables some optimizations down the line.

But this is a bit of a guessing game.
If you would like to test it on other platforms and defer the merge then thats ok of course.

We don't have CI perf tests at the moment.

Those would be very cool to have.

@stweil
Copy link
Contributor

stweil commented Nov 10, 2021

If you see no other problems with GenericVector replacements then this issue can be closed.

So can it be closed now, or is there anything still open?

@nocun
Copy link
Contributor Author

nocun commented Nov 12, 2021

@stweil yes, thank you so much for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants