-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improves performance of operator Transpose #5550
Conversation
Thanks for the PR! the huggingface/Albert model(that uses many Einsum ops) is ~2x faster because of the change. |
@@ -52,35 +94,38 @@ static void DoTransposeImpl(int64_t num_axes, const std::vector<int64_t>& target | |||
size_t num_blocks, size_t num_elts_in_block, const std::vector<size_t>& stride, | |||
const uint8_t* source, uint8_t* target, size_t element_size) { | |||
size_t blocksize = num_elts_in_block * element_size; | |||
MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex)); | |||
size_t naxes = IncrementIndexAndComputeOffsetSetup(mindex, num_axes, target_dims, stride, element_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use alloca here? Can we instead have an array on the stack with say 10 axes (more than a model should ever need) to keep the code simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind but I'll have to add a test to check the dimension does not exceed the buffer and fallback to this behaviour in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have some implementations that simply reject inputs greater than a certain rank (usually 8 or 10) and maybe we could just do that here too - have an enforce on the num_axes and just have a fixed size array of MultiIndex to avoid alloca ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this isn't part of the inner loop, why do we need to do a stack allocation vs. just using std::vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll replace it by std::vector. I used alloca because the array is not big and alloca is supposed to be faster. Just for curiosity, is there any reason alloca should be avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly that people cut-and-paste code without necessarily understanding nuances. alloca has the potential to cause a stack overflow that would be difficult to debug in the wild. I'm not aware of any other usages of alloca in the code base so we need to think about introducing this sort of thing. Whilst it may be 'faster', when it's one call per Transpose I wouldn't expect that that would be noticeable even if you were looking very very closely. So based on that I don't think the potential cost vs limited benefit warrants introducing alloca usage here.
In reply to: 520917125 [](ancestors = 520917125)
/azp run MacOS CI Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl * Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965) * improves processing time by 10 * extend coverage unit test coverage * better implementation for the multi regression case * better comment, keep parallelization by trees when not enough trees * Initialize a structure in operator ReduceSum (#6005) * fix initialisation issue * Fuse MatMulIntegerToFloat only when scales are scalar (#6008) MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully. * Disable Python 3.9 for training Python packaging build. (#6012) Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency. * Fix bugs for 1: Calibrator should check model inputs; 2: (#6017) quantize_inupts forgot to use parameter initializer_use_weight_qtyp. * Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1. - [Release notes](https://github.com/highlightjs/highlight.js/releases) - [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md) - [Commits](highlightjs/highlight.js@10.2.1...10.4.1) Signed-off-by: dependabot[bot] <[email protected]> * work around of the build break in mac (#6069) * Fix the build break in macos release * revert android change * Bump up API version for 1.6 release (#6076) * Update version to 1.6.0 (#6041) * Update version to 1.6.0 * Add v 1.5.3 info * Updating WindowsAI and ONNX version Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" This reverts commit beb950e. Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Zhang Lei <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pranav Sharma <[email protected]> Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Description:
Improves implementation of operator Transpose, see issue #5538. The PR merges two functions, the first one increments a multi-index (an index for each dimension of a tensor), the second computes the address mapped to this multi-index. The new implementations reduces the number of operations needed to increment the multi-index and the address.
Performance goes to 10000ms to 3700ms for the model mentioned in issue #5538. Transpose is around 40% faster and copies more efficiently the data when the transposition is equivalent to a reshape. This estimation depends on the tensor size: more dimensions means bigger speed up, big last dimension also means bigger speed up.
Motivation and Context
Performance.