Improves performance of operator Transpose #5550

xadupre · 2020-10-20T15:36:10Z

Description:
Improves implementation of operator Transpose, see issue #5538. The PR merges two functions, the first one increments a multi-index (an index for each dimension of a tensor), the second computes the address mapped to this multi-index. The new implementations reduces the number of operations needed to increment the multi-index and the address.

Performance goes to 10000ms to 3700ms for the model mentioned in issue #5538. Transpose is around 40% faster and copies more efficiently the data when the transposition is equivalent to a reshape. This estimation depends on the tensor size: more dimensions means bigger speed up, big last dimension also means bigger speed up.

Motivation and Context
Performance.

onnxruntime/core/providers/cpu/tensor/transpose.cc

…o transpose

wangyems · 2020-10-28T19:21:19Z

Thanks for the PR! the huggingface/Albert model(that uses many Einsum ops) is ~2x faster because of the change.

skottmckay · 2020-11-09T21:09:22Z

onnxruntime/core/providers/cpu/tensor/transpose.cc

@@ -52,35 +94,38 @@ static void DoTransposeImpl(int64_t num_axes, const std::vector<int64_t>& target
                            size_t num_blocks, size_t num_elts_in_block, const std::vector<size_t>& stride,
                            const uint8_t* source, uint8_t* target, size_t element_size) {
  size_t blocksize = num_elts_in_block * element_size;
+  MultiIndex* mindex = (MultiIndex*)alloca(num_axes * sizeof(MultiIndex));
+  size_t naxes = IncrementIndexAndComputeOffsetSetup(mindex, num_axes, target_dims, stride, element_size);


Do we need to use alloca here? Can we instead have an array on the stack with say 10 axes (more than a model should ever need) to keep the code simpler?

I don't mind but I'll have to add a test to check the dimension does not exceed the buffer and fallback to this behaviour in that case.

I think we have some implementations that simply reject inputs greater than a certain rank (usually 8 or 10) and maybe we could just do that here too - have an enforce on the num_axes and just have a fixed size array of MultiIndex to avoid alloca ?

As this isn't part of the inner loop, why do we need to do a stack allocation vs. just using std::vector?

I'll replace it by std::vector. I used alloca because the array is not big and alloca is supposed to be faster. Just for curiosity, is there any reason alloca should be avoided?

Mainly that people cut-and-paste code without necessarily understanding nuances. alloca has the potential to cause a stack overflow that would be difficult to debug in the wild. I'm not aware of any other usages of alloca in the code base so we need to think about introducing this sort of thing. Whilst it may be 'faster', when it's one call per Transpose I wouldn't expect that that would be noticeable even if you were looking very very closely. So based on that I don't think the potential cost vs limited benefit warrants introducing alloca usage here.

In reply to: 520917125 [](ancestors = 520917125)

onnxruntime/core/providers/cpu/tensor/transpose.cc

…o transpose

hariharans29 · 2020-11-10T20:03:40Z

/azp run MacOS CI Pipeline

…o transpose

skottmckay

This reverts commit e5c8040.

…ranspose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl

* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl * Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965) * improves processing time by 10 * extend coverage unit test coverage * better implementation for the multi regression case * better comment, keep parallelization by trees when not enough trees * Initialize a structure in operator ReduceSum (#6005) * fix initialisation issue * Fuse MatMulIntegerToFloat only when scales are scalar (#6008) MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully. * Disable Python 3.9 for training Python packaging build. (#6012) Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency. * Fix bugs for 1: Calibrator should check model inputs; 2: (#6017) quantize_inupts forgot to use parameter initializer_use_weight_qtyp. * Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1. - [Release notes](https://github.com/highlightjs/highlight.js/releases) - [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md) - [Commits](highlightjs/highlight.js@10.2.1...10.4.1) Signed-off-by: dependabot[bot] <[email protected]> * work around of the build break in mac (#6069) * Fix the build break in macos release * revert android change * Bump up API version for 1.6 release (#6076) * Update version to 1.6.0 (#6041) * Update version to 1.6.0 * Add v 1.5.3 info * Updating WindowsAI and ONNX version Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" This reverts commit beb950e. Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Zhang Lei <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pranav Sharma <[email protected]> Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

Improves implementation of transpose operator

d8fd2ea

xadupre requested a review from a team as a code owner October 20, 2020 15:36

skottmckay reviewed Oct 20, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Outdated Show resolved Hide resolved

skottmckay reviewed Oct 20, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Outdated Show resolved Hide resolved

sdpython added 2 commits October 21, 2020 12:27

use alloca instead of std::vector

59b8014

Improves all transpose function, uses stack to allocate

265beac

xadupre changed the title ~~[WIP] Improves performance of operator Transpose~~ Improves performance of operator Transpose Oct 21, 2020

xadupre requested a review from skottmckay October 21, 2020 13:11

sdpython added 3 commits October 22, 2020 01:30

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

a963144

…o transpose

fix comment

621a5a2

Simplifies transposition when it is not really needed.

ed176f1

xadupre mentioned this pull request Oct 22, 2020

[WIP] Saving allocations in operator Einstein (extension to PR #5550 to fix issue #5538) #5571

Closed

sdpython added 4 commits October 23, 2020 11:25

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

7263d87

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

b7dfa05

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

acff690

…o transpose

fix comments

a3a3070

xadupre requested review from hariharans29 and a team November 2, 2020 13:45

skottmckay reviewed Nov 9, 2020

View reviewed changes

hariharans29 reviewed Nov 9, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Show resolved Hide resolved

hariharans29 reviewed Nov 9, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Show resolved Hide resolved

skottmckay reviewed Nov 9, 2020

View reviewed changes

onnxruntime/core/providers/cpu/tensor/transpose.cc Outdated Show resolved Hide resolved

sdpython added 4 commits November 10, 2020 00:08

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

5bb6c09

…o transpose

Removes ORT_ENFORCE

4d3c7d5

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

55e8bde

…o transpose

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

333fb40

…o transpose

sdpython added 2 commits November 10, 2020 23:54

Removes alloca

0438379

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

ff3f3ac

…o transpose

skottmckay approved these changes Nov 11, 2020

View reviewed changes

xadupre merged commit e5c8040 into microsoft:master Nov 11, 2020

snnn added a commit that referenced this pull request Nov 24, 2020

Revert "Improves performance of operator Transpose (#5550)"

11caeb5

This reverts commit e5c8040.

guoyu-wang pushed a commit that referenced this pull request Nov 24, 2020

Revert "Improves performance of operator Transpose (#5550)" (#5911)

7823033

This reverts commit e5c8040.

xadupre mentioned this pull request Nov 24, 2020

Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) #5916

Merged

duli2012 mentioned this pull request Dec 8, 2020

Second round of cherry-pick #6083

Merged

xadupre deleted the transpose branch September 28, 2021 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves performance of operator Transpose #5550

Improves performance of operator Transpose #5550

xadupre commented Oct 20, 2020 •

edited

Loading

wangyems commented Oct 28, 2020

skottmckay Nov 9, 2020 •

edited

Loading

xadupre Nov 9, 2020

hariharans29 Nov 10, 2020

skottmckay Nov 10, 2020

xadupre Nov 10, 2020

skottmckay Nov 11, 2020

hariharans29 commented Nov 10, 2020

skottmckay left a comment

Improves performance of operator Transpose #5550

Improves performance of operator Transpose #5550

Conversation

xadupre commented Oct 20, 2020 • edited Loading

wangyems commented Oct 28, 2020

skottmckay Nov 9, 2020 • edited Loading

Choose a reason for hiding this comment

xadupre Nov 9, 2020

Choose a reason for hiding this comment

hariharans29 Nov 10, 2020

Choose a reason for hiding this comment

skottmckay Nov 10, 2020

Choose a reason for hiding this comment

xadupre Nov 10, 2020

Choose a reason for hiding this comment

skottmckay Nov 11, 2020

Choose a reason for hiding this comment

hariharans29 commented Nov 10, 2020

skottmckay left a comment

Choose a reason for hiding this comment

xadupre commented Oct 20, 2020 •

edited

Loading

skottmckay Nov 9, 2020 •

edited

Loading