-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add reduce kernels #3136
feat: add reduce kernels #3136
Conversation
@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling: If you scroll to the bottom of the trace below you'll see that when attempting to execute: This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled. For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last. However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the Here's the full stack trace:
|
Hi, I am still working on these kernels and need to fix a few things. I will update once I am done with this PR. The issue is most likely use to the use of |
No worries - just reporting what I'm finding with things as they are. Thanks! |
Yes. It's very helpful since I can only test of a limited number of cases so knowing how it works for actual data helps in identifying the issues. Thanks! I will keep you updated. |
The change to accumulator with atomics certainly fixed the memory issue, it's a bit slower than I expected for a sum though. ~250 MHz throughput for summing bools into int64. As an optimization for sums on the last dimension, couldn't you write this without atomics or any race conditions by having each thread sum over the last dimension into an array of one less dimension? Or is the thread divergence too bad and atomics are still faster? With the atomic implementation you're guaranteed to have access contention because each element is going to be hitting the same output position to make the sum. I don't have good intuition if that's going to be better or worse than thread divergence. @jpivarski maybe? |
In any case - with this latest change I've now got query 4 done on the ADL benchmarks. The rest seem to require combinations, so I'll wait for that! |
a98503d
to
38d314d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any changes in this PR that would cause the test failures or causing numpy
to use int32
for sum operations on boolean arrays (depending on the platform (32-bit vs 64-bit)).
We could explicitly cast the result of numpy sum to int64 to ensure the test doesn't fail:
numpy_sum = np.sum(array, axis=-1).astype(np.int64)
I'd say merge it - @jpivarski?
I'm checking it with #3158 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent!!! As I understand it, this enables all axis=-1
reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests
to tests-cuda
, particularly test_0115_generic_reducer_operation.py
).
The one failing test is unrelated, and it's failing in main
also, so it isn't a blocker to merging this PR. (We should never introduce failing tests into main
, but it's already there. @ianna, if you need elevated permissions to bypass the red warning about merging with a failing test, I can give you those permissions.)
What is a blocker, however, is that these need to be tested on more than one GPU. @ianna will be able to test it on Sunday, and it can be merged if it passes on her GPU. I'll be able to test it on Tuesday, and I'll just test it in main
(after merging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManasviGoyal - all tests pass except for the tests-cuda-kernels/test_cudaawkward_BitMaskedArray_to_ByteMaskedArray.py
that relies on Numpy
. I've added an import. Please, double check. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy
! If it works fine on yours the PR is good to be merged. Thanks.
@ianna All MacOS tests are cancelled in the CI due to which I am unable to merge. |
@jpivarski - I think, we need some other macOS node: Run Tests (macos-11, 3.8, x64, full) This is a scheduled macOS-11 brownout. The macOS-11 environment is deprecated and will be removed on June 28th, 2024. -- |
@jpivarski #3162 adds all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ianna and I have both tested with GPUs and all is good, so I'll merge this now. (Before June 28, we avoid complications with MacOS > 11.)
Kernels tested for different block sizes