-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis for this PR, it is exciting to see the speedup offered by these changes!
The implementation looks good in general. My main question is whether the implementation of the block linalg prims in block.cuh
should belong here, or to the raft repository? Tagging @teju85, in case he has some input on these points:
- From ARIMA's point of view they are temporary helper functions, and that would justify having them in cuML.
- One could also argue that it would be a more natural fit to place these in RAFT, even if they are not overly optimized. This fact can be marked in their docstring.
Further question:
- "my version of the loading function seemed to work much faster than the one I adapted from raft" Is this specific to your use case, or do you have any suggestions for improvement for contractions.cuh?
@tfeher Thanks for your review!
The loading function in raft loads first from global memory to registers, then registers to shared memory. This increases the register pressure, which is a bottleneck for my kernel for which occupancy is limited by the number of registers used. So it's not necessarily something to change in raft, just in my case.
These prims can clearly be useful for other algorithms, but as you said they're more of a temporary solution and not very optimized. I don't mind making the API a bit more generic and adding a perf warning, if @teju85 and other PICs think it belongs to raft. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis for fixing the issues and adding more tests! The PR looks good to me.
I had an offline discussion with @teju85 about the location of the block linalg prims. Currently ARIMA is the only use case for them, therefore we can keep them at the current location (unless someone else has strong feelings otherwise).
Please update the PR description: the TODO items should preferably go to the ARIMA optimization tracker, instead of the description (which will be part of the commit message). Additionally "part 1" could be removed from the PR title.
@tfeher Thanks for the review. I've shortened the PR description to what we want in the commit message. Tagging @rapidsai/cuml-cmake-codeowners for required approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #4006 +/- ##
===============================================
Coverage ? 85.59%
===============================================
Files ? 230
Lines ? 18221
Branches ? 0
===============================================
Hits ? 15596
Misses ? 2625
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
…apidsai#4006) This PR brings **speedups** of the order of **10x** to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations. The PR introduces a set of prims in `linalg/block.cuh` to compute block-local linear algebra operations, and corresponding unit tests in `test/prims/linalg_block.cu`. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Robert Maynard (https://github.com/robertmaynard) URL: rapidsai#4006
This PR brings speedups of the order of 10x to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations.
The PR introduces a set of prims in
linalg/block.cuh
to compute block-local linear algebra operations, and corresponding unit tests intest/prims/linalg_block.cu
.