Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves performance of operator Transpose #5550

Merged
merged 16 commits into from
Nov 11, 2020
Merged

Conversation

xadupre
Copy link
Member

@xadupre xadupre commented Oct 20, 2020

Description:
Improves implementation of operator Transpose, see issue #5538. The PR merges two functions, the first one increments a multi-index (an index for each dimension of a tensor), the second computes the address mapped to this multi-index. The new implementations reduces the number of operations needed to increment the multi-index and the address.

Performance goes to 10000ms to 3700ms for the model mentioned in issue #5538. Transpose is around 40% faster and copies more efficiently the data when the transposition is equivalent to a reshape. This estimation depends on the tensor size: more dimensions means bigger speed up, big last dimension also means bigger speed up.

Motivation and Context
Performance.

@xadupre xadupre requested a review from a team as a code owner October 20, 2020 15:36
@xadupre xadupre changed the title [WIP] Improves performance of operator Transpose Improves performance of operator Transpose Oct 21, 2020
@wangyems
Copy link
Contributor

Thanks for the PR! the huggingface/Albert model(that uses many Einsum ops) is ~2x faster because of the change.

@xadupre xadupre requested review from hariharans29 and a team November 2, 2020 13:45
@@ -52,35 +94,38 @@ static void DoTransposeImpl(int64_t num_axes, const std::vector<int64_t>& target
size_t num_blocks, size_t num_elts_in_block, const std::vector<size_t>& stride,
const uint8_t* source, uint8_t* target, size_t element_size) {
size_t blocksize = num_elts_in_block * element_size;
MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex));
size_t naxes = IncrementIndexAndComputeOffsetSetup(mindex, num_axes, target_dims, stride, element_size);
Copy link
Contributor

@skottmckay skottmckay Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use alloca here? Can we instead have an array on the stack with say 10 axes (more than a model should ever need) to keep the code simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind but I'll have to add a test to check the dimension does not exceed the buffer and fallback to this behaviour in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have some implementations that simply reject inputs greater than a certain rank (usually 8 or 10) and maybe we could just do that here too - have an enforce on the num_axes and just have a fixed size array of MultiIndex to avoid alloca ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this isn't part of the inner loop, why do we need to do a stack allocation vs. just using std::vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll replace it by std::vector. I used alloca because the array is not big and alloca is supposed to be faster. Just for curiosity, is there any reason alloca should be avoided?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly that people cut-and-paste code without necessarily understanding nuances. alloca has the potential to cause a stack overflow that would be difficult to debug in the wild. I'm not aware of any other usages of alloca in the code base so we need to think about introducing this sort of thing. Whilst it may be 'faster', when it's one call per Transpose I wouldn't expect that that would be noticeable even if you were looking very very closely. So based on that I don't think the potential cost vs limited benefit warrants introducing alloca usage here.


In reply to: 520917125 [](ancestors = 520917125)

@hariharans29
Copy link
Member

/azp run MacOS CI Pipeline

Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@xadupre xadupre merged commit e5c8040 into microsoft:master Nov 11, 2020
snnn added a commit that referenced this pull request Nov 24, 2020
guoyu-wang pushed a commit that referenced this pull request Nov 24, 2020
xadupre added a commit that referenced this pull request Dec 2, 2020
…ranspose) (#5916)

* Improves implementation of transpose operator
* Fix issue mentioned in #5911
* adding unit test for function DoTransposeImpl
duli2012 pushed a commit that referenced this pull request Dec 8, 2020
…ranspose) (#5916)

* Improves implementation of transpose operator
* Fix issue mentioned in #5911
* adding unit test for function DoTransposeImpl
duli2012 added a commit that referenced this pull request Dec 9, 2020
* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916)

* Improves implementation of transpose operator
* Fix issue mentioned in #5911
* adding unit test for function DoTransposeImpl

* Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965)

* improves processing time by 10
* extend coverage unit test coverage
* better implementation for the multi regression case
* better comment, keep parallelization by trees when not enough trees

* Initialize a structure in operator ReduceSum (#6005)

* fix initialisation issue

* Fuse MatMulIntegerToFloat only when scales are scalar (#6008)

MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully.

* Disable Python 3.9 for training Python packaging build. (#6012)

Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency.

* Fix bugs for 1: Calibrator should check model inputs; 2: (#6017)

quantize_inupts forgot to use parameter initializer_use_weight_qtyp.

* Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs

Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1.
- [Release notes](https://github.com/highlightjs/highlight.js/releases)
- [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md)
- [Commits](highlightjs/highlight.js@10.2.1...10.4.1)

Signed-off-by: dependabot[bot] <[email protected]>

* work around of the build break in mac (#6069)

* Fix the build break in macos release

* revert android change

* Bump up API version for 1.6 release (#6076)

* Update version to 1.6.0 (#6041)

* Update version to 1.6.0

* Add v 1.5.3 info

* Updating WindowsAI and ONNX version

Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)"

This reverts commit beb950e.

Co-authored-by: Xavier Dupré <[email protected]>
Co-authored-by: Yufeng Li <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Zhang Lei <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pranav Sharma <[email protected]>
Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
@xadupre xadupre deleted the transpose branch September 28, 2021 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants