Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements of histogram kernel #152

Merged
merged 2 commits into from
Sep 6, 2024

Conversation

mfoerste4
Copy link
Contributor

This PR adds an improved histogram kernel

  • optimized global loads
  • allow large number of features
  • no resorting of samples needed (unless we want to pursue shared memory usage)

Not yet:

  • add shared memory utilization

@mfoerste4 mfoerste4 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 4, 2024
@mfoerste4 mfoerste4 self-assigned this Sep 4, 2024
src/cpp_utils/cpp_utils.cuh Outdated Show resolved Hide resolved
// mask contains all sample bits of the next 32 ids that need to be bin'ed
auto lane_mask = ballot(computeHistogram);

// reverse to use __clz instead of __ffs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of this? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to warp-synchronously jump to samples that need to be computed. Otherwise we would need to check every id explicitly which would require 32 _shfl operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean why __cls vs __ffs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe __cls is preferrable for performance reasons -- I did not confirm in this scope though.

src/models/tree/build_tree.cu Outdated Show resolved Hide resolved
// reverse to use __clz instead of __ffs
lane_mask = __brev(lane_mask);

while (lane_mask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way of getting the bit fidding and intrinsics abstracted out:

template<typname T>
class WarpQueue{
private:
T x;
int32_t mask;
public:
WarpQueue(T x, bool active) ...
bool Empty(){}
T Pop(){}
}

WarpQueue warp_items(local_items, active);
while(!warp_items.empty()){
  auto [node, sampleId] = warp_items.Pop();
}

Feel free to ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might work for queuing the offsets, but introducing the different storage locations for the actual items would increase the overhead in registers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is only for the offsets.

src/models/tree/build_tree.cu Show resolved Hide resolved
@mfoerste4
Copy link
Contributor Author

@RAMitchell thanks for the review - I have pushed some minor changes.

@RAMitchell
Copy link
Contributor

RAMitchell commented Sep 6, 2024

I ran this against my extented notebook datasets:

Notebook results main

                                      time
dataset              max_depth            
Airline              8            1.224361
                     16           1.716611
Wine Quality (white) 8            1.122985
                     16           1.968594
YearPredictionMSD    8            3.417053
                     16          25.990417
adult                8            2.076839
                     16           3.880266
covtype              8            9.187823
                     16          29.507131
epsilon              8           26.678863
                     16         405.088000
higgs                8            1.899799
                     16           7.837720
kddcup99             8            2.171049
                     16           3.229796
make_classification  8           21.220743
                     16         127.141470
make_regression      8            5.540004
                     16          33.024259
mnist                8           29.125725
                     16         213.332709
satimage             8            2.358462
                     16           7.567248

Notebook results PR

                                      time
dataset              max_depth            
Airline              8            1.240454
                     16           1.699671
Wine Quality (white) 8            1.045785
                     16           1.855366
YearPredictionMSD    8            3.023942
                     16          27.307748
adult                8            2.034721
                     16           3.818123
covtype              8            6.403403
                     16          19.547341
epsilon              8           24.606926
                     16         430.767659
higgs                8            1.503556
                     16           6.907475
kddcup99             8            1.722053
                     16           2.735567
make_classification  8           18.130453
                     16         131.233685
make_regression      8            4.307860
                     16          34.079777
mnist                8           18.576233
                     16         193.952795
satimage             8            1.724287
                     16           6.638160

@mfoerste4
Copy link
Contributor Author

I ran this against my extented notebook datasets:

Hmmh, looks to me that the kernel is not quite optimal in combination with the batching reordering on very low levels

@RAMitchell RAMitchell merged commit dd01d09 into rapidsai:main Sep 6, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants