-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix problem bug in MobileBert with vectorization enable. #5211
Conversation
if (!op.getInputIndexingMap(i).isMinorIdentity()) return success(); | ||
} | ||
if (!vectorize || outputShape.getShape().back() % lowerTs != 0) | ||
return success(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to explicitly set config.vectorize
to false before returning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly
It is false by default. Do you mean doing it for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yup, it also helps when later we switch vectorization to be on by default.
It looks like this improves the performance of MobileNetV2 on S20... https://mako.dev/run?run_key=6441876564475904&~cpu=1&~vmla=1&~vlk=1 @antiagainst could you help verify if ToT does have 15ms performance? I'm hitting some weirdness on S20 in the lab, and I don't have one on local side. |
I tested on S10 and I feel that there are issues in benchmarking infrastructure... |
The variance is bug. There are 8 iterations in one benchmark and the range is 64 ms ~ 90 ms. We probably want to ping GPU frequency for lab phones. I still feel that there is a performance improvement, so it'd be good if you can help check it. |
* da64c93 Integrate MLIR-EmitC at iml130/mlir-emitc@dde739f (#5228) * a8d6c2a Fix doc publication by escaping our IR snippets in markdown. (#5227) * bd5e535 Introduce RVV VLS code-gen (#5199) * d496bc9 Avoid allocating temporary buffer for tensors derived from read-only tensors (.. * 1a93557 Tidying up a few pages of documentation. (#5225) * 46e8331 Add a dedicated iree_c_module CMake module (#5214) * 603b208 Merge google -> main (#5219) * c9f3742 Add support for control flow lowering in the VM to emitc target (#5208) * c8a7b2f Sort descriptors as expected by the native module (#5207) * 47183fb Revise compiler tracing to enable statistics aggregation. (#5218) * 500ec7b Set author to match committer for llvm submodule update action. (#5217) * 881de81 Generate and use the iree_hal_executable_library_t metadata. (#5195) * cf8180e Increase K tile size for small matrices. (#5213) * f5804ec Fix problem bug in MobileBert with vectorization enable. (#5211) COPYBARA_INTEGRATE_REVIEW=#5229 from ScottTodd:main-to-google da64c93 PiperOrigin-RevId: 365076337
* da64c93 Integrate MLIR-EmitC at iml130/mlir-emitc@dde739f (#5228) * a8d6c2a Fix doc publication by escaping our IR snippets in markdown. (#5227) * bd5e535 Introduce RVV VLS code-gen (#5199) * d496bc9 Avoid allocating temporary buffer for tensors derived from read-only tensors (.. * 1a93557 Tidying up a few pages of documentation. (#5225) * 46e8331 Add a dedicated iree_c_module CMake module (#5214) * 603b208 Merge google -> main (#5219) * c9f3742 Add support for control flow lowering in the VM to emitc target (#5208) * c8a7b2f Sort descriptors as expected by the native module (#5207) * 47183fb Revise compiler tracing to enable statistics aggregation. (#5218) * 500ec7b Set author to match committer for llvm submodule update action. (#5217) * 881de81 Generate and use the iree_hal_executable_library_t metadata. (#5195) * cf8180e Increase K tile size for small matrices. (#5213) * f5804ec Fix problem bug in MobileBert with vectorization enable. (#5211) PiperOrigin-RevId: 365076337
The ToT is still going through the existing buffer-based path. This PR actually fixed the broken connection to let the existing buffer-based path to be able to access known-good tiling configurations again (for both MobileBERT and MobileNet) so we should see an improvement. But I don't think we can achieve 15ms yet. It should be around 20ms. So looks like we are over optimistic here. I think it's likely because we duplicate dispatches without inserting any barriers. That essentially means increasing the number of warps for each dispatch to help filling the GPU. After dividing by the repeat count, we are getting a better number than it should be. There are certain dispatches in both models that are just working on small scale problem size; increasing warp count hides and "improves" the latency for them. Instead of duplicating dispatches, I think we want to submit the command buffer multiple times and average at that level. It will help to keep the original dispatch structure intact. |
I think it still affect the existing buffer-based path because some passes are used, like Yeah, good point... let me file an issue and move the discussion there. I think you are right, I will prototype on the idea and see what's going on. This does bother me for a while... Thank you sooooo much! |
No description provided.