Performance improvements of histogram kernel #152

mfoerste4 · 2024-09-04T16:37:52Z

This PR adds an improved histogram kernel

optimized global loads
allow large number of features
no resorting of samples needed (unless we want to pursue shared memory usage)

Not yet:

add shared memory utilization

src/cpp_utils/cpp_utils.cuh

RAMitchell · 2024-09-05T08:45:18Z

src/models/tree/build_tree.cu

+  // mask contains all sample bits of the next 32 ids that need to be bin'ed
+  auto lane_mask = ballot(computeHistogram);
+
+  // reverse to use __clz instead of __ffs


What is the benefit of this? Just curious.

The idea is to warp-synchronously jump to samples that need to be computed. Otherwise we would need to check every id explicitly which would require 32 _shfl operations.

I mean why __cls vs __ffs?

I believe __cls is preferrable for performance reasons -- I did not confirm in this scope though.

src/models/tree/build_tree.cu

RAMitchell · 2024-09-05T09:03:40Z

src/models/tree/build_tree.cu

+  // reverse to use __clz instead of __ffs
+  lane_mask = __brev(lane_mask);
+
+  while (lane_mask) {


One way of getting the bit fidding and intrinsics abstracted out:

template<typname T> class WarpQueue{ private: T x; int32_t mask; public: WarpQueue(T x, bool active) ... bool Empty(){} T Pop(){} } WarpQueue warp_items(local_items, active); while(!warp_items.empty()){ auto [node, sampleId] = warp_items.Pop(); }

Feel free to ignore.

This might work for queuing the offsets, but introducing the different storage locations for the actual items would increase the overhead in registers.

The suggestion is only for the offsets.

src/models/tree/build_tree.cu

mfoerste4 · 2024-09-06T11:39:23Z

@RAMitchell thanks for the review - I have pushed some minor changes.

RAMitchell · 2024-09-06T12:21:53Z

I ran this against my extented notebook datasets:

Notebook results main

                                      time
dataset              max_depth            
Airline              8            1.224361
                     16           1.716611
Wine Quality (white) 8            1.122985
                     16           1.968594
YearPredictionMSD    8            3.417053
                     16          25.990417
adult                8            2.076839
                     16           3.880266
covtype              8            9.187823
                     16          29.507131
epsilon              8           26.678863
                     16         405.088000
higgs                8            1.899799
                     16           7.837720
kddcup99             8            2.171049
                     16           3.229796
make_classification  8           21.220743
                     16         127.141470
make_regression      8            5.540004
                     16          33.024259
mnist                8           29.125725
                     16         213.332709
satimage             8            2.358462
                     16           7.567248

Notebook results PR

                                      time
dataset              max_depth            
Airline              8            1.240454
                     16           1.699671
Wine Quality (white) 8            1.045785
                     16           1.855366
YearPredictionMSD    8            3.023942
                     16          27.307748
adult                8            2.034721
                     16           3.818123
covtype              8            6.403403
                     16          19.547341
epsilon              8           24.606926
                     16         430.767659
higgs                8            1.503556
                     16           6.907475
kddcup99             8            1.722053
                     16           2.735567
make_classification  8           18.130453
                     16         131.233685
make_regression      8            4.307860
                     16          34.079777
mnist                8           18.576233
                     16         193.952795
satimage             8            1.724287
                     16           6.638160

mfoerste4 · 2024-09-06T13:15:16Z

I ran this against my extented notebook datasets:

Hmmh, looks to me that the kernel is not quite optimal in combination with the batching reordering on very low levels

add warp based kernel

6947707

mfoerste4 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 4, 2024

mfoerste4 self-assigned this Sep 4, 2024

mfoerste4 requested a review from RAMitchell September 4, 2024 16:44

RAMitchell approved these changes Sep 5, 2024

View reviewed changes

review suggestions

fd50dfb

RAMitchell merged commit dd01d09 into rapidsai:main Sep 6, 2024
5 checks passed

RAMitchell mentioned this pull request Sep 18, 2024

Optimisations to scan kernel #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements of histogram kernel #152

Performance improvements of histogram kernel #152

mfoerste4 commented Sep 4, 2024

RAMitchell Sep 5, 2024

mfoerste4 Sep 6, 2024

RAMitchell Sep 6, 2024

mfoerste4 Sep 6, 2024

RAMitchell Sep 5, 2024

mfoerste4 Sep 6, 2024

RAMitchell Sep 6, 2024

mfoerste4 commented Sep 6, 2024

RAMitchell commented Sep 6, 2024 •

edited

Loading

mfoerste4 commented Sep 6, 2024

Performance improvements of histogram kernel #152

Performance improvements of histogram kernel #152

Conversation

mfoerste4 commented Sep 4, 2024

RAMitchell Sep 5, 2024

Choose a reason for hiding this comment

mfoerste4 Sep 6, 2024

Choose a reason for hiding this comment

RAMitchell Sep 6, 2024

Choose a reason for hiding this comment

mfoerste4 Sep 6, 2024

Choose a reason for hiding this comment

RAMitchell Sep 5, 2024

Choose a reason for hiding this comment

mfoerste4 Sep 6, 2024

Choose a reason for hiding this comment

RAMitchell Sep 6, 2024

Choose a reason for hiding this comment

mfoerste4 commented Sep 6, 2024

RAMitchell commented Sep 6, 2024 • edited Loading

mfoerste4 commented Sep 6, 2024

RAMitchell commented Sep 6, 2024 •

edited

Loading