-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements of histogram kernel #152
Conversation
// mask contains all sample bits of the next 32 ids that need to be bin'ed | ||
auto lane_mask = ballot(computeHistogram); | ||
|
||
// reverse to use __clz instead of __ffs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of this? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to warp-synchronously jump to samples that need to be computed. Otherwise we would need to check every id explicitly which would require 32 _shfl
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean why __cls vs __ffs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe __cls is preferrable for performance reasons -- I did not confirm in this scope though.
// reverse to use __clz instead of __ffs | ||
lane_mask = __brev(lane_mask); | ||
|
||
while (lane_mask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way of getting the bit fidding and intrinsics abstracted out:
template<typname T>
class WarpQueue{
private:
T x;
int32_t mask;
public:
WarpQueue(T x, bool active) ...
bool Empty(){}
T Pop(){}
}
WarpQueue warp_items(local_items, active);
while(!warp_items.empty()){
auto [node, sampleId] = warp_items.Pop();
}
Feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might work for queuing the offsets, but introducing the different storage locations for the actual items would increase the overhead in registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion is only for the offsets.
@RAMitchell thanks for the review - I have pushed some minor changes. |
I ran this against my extented notebook datasets:
|
Hmmh, looks to me that the kernel is not quite optimal in combination with the batching reordering on very low levels |
This PR adds an improved histogram kernel
Not yet: