Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

Merged
merged 5 commits into from
Dec 4, 2020

Conversation

hcho3
Copy link
Contributor

@hcho3 hcho3 commented Dec 3, 2020

Closes #3231
Closes #3128
Partially addresses #3188

The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case.

The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k.

As a result, the MSE gain is computed very similarly as the MAE gain.

Disadvantage: now we always make two passes over data to compute the gain.

cc @teju85 @vinaydes @JohnZed

@hcho3 hcho3 requested a review from a team as a code owner December 3, 2020 05:37
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@hcho3 hcho3 added 2 - In Progress Currenty a work in progress bug Something isn't working CUDA / C++ CUDA issue non-breaking Non-breaking change labels Dec 3, 2020
@hcho3 hcho3 requested a review from a team as a code owner December 3, 2020 06:00
@hcho3 hcho3 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Dec 3, 2020
@hcho3 hcho3 changed the title Avoid unnecessary split for degenerate case where all labels are identical [REVIEW] Avoid unnecessary split for degenerate case where all labels are identical Dec 3, 2020
@hcho3 hcho3 requested review from teju85 and JohnZed December 3, 2020 06:01
@hcho3
Copy link
Contributor Author

hcho3 commented Dec 3, 2020

Example from #3188 (comment)

Before (max_depth=3):

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    },
                    {
                        "nodeid": 4,
                        "split_feature": 0,
                        "split_threshold": 2.9149611,
                        "gain": 5.96046448e-08,
                        "yes": 7,
                        "no": 8,
                        "children": [
                            {
                                "nodeid": 7,
                                "leaf_value": 1
                            },
                            {
                                "nodeid": 8,
                                "leaf_value": 1
                            }
                        ]
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]

After this PR (max_depth=3):

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.539999962,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]

Example from #3188 (comment)

Before:

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 0,
        "gain": 2.98023224e-08,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1
            },
            {
                "nodeid": 2,
                "leaf_value": 1
            }
        ]
    }
]

After:

[
    {
        "nodeid": 0,
        "leaf_value": 1
    }
]

Example from #3128

python -m cuml.benchmark.run_benchmarks --cuml-param-sweep use_experimental_backend=[true,false]  \
  max_depth=[10,30] --n-reps 1 --skip-cpu --num-rows 1000000 --num-features 256  \
  RandomForestRegressor

Before:

                     algo  input     cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0   RandomForestRegressor  numpy   11.271957       0.0  0.999997      0.0      0.0    1000000         256                      True         10
1   RandomForestRegressor  numpy  115.319010       0.0  0.991797      0.0      0.0    1000000         256                      True         30
2   RandomForestRegressor  numpy    2.103713       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3   RandomForestRegressor  numpy    3.486071       0.0  0.999999      0.0      0.0    1000000         256                     False         30

After:

                    algo  input    cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0  RandomForestRegressor  numpy  13.305948       0.0  1.000000      0.0      0.0    1000000         256                      True         10
1  RandomForestRegressor  numpy  14.542375       0.0  0.999999      0.0      0.0    1000000         256                      True         30
2  RandomForestRegressor  numpy   2.094524       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3  RandomForestRegressor  numpy   3.602167       0.0  0.999999      0.0      0.0    1000000         256                     False         30

@codecov-io
Copy link

codecov-io commented Dec 3, 2020

Codecov Report

Merging #3243 (290184f) into branch-0.17 (d0cd8c1) will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #3243      +/-   ##
===============================================
+ Coverage        71.48%   71.49%   +0.01%     
===============================================
  Files              206      206              
  Lines            16648    16648              
===============================================
+ Hits             11900    11902       +2     
+ Misses            4748     4746       -2     
Impacted Files Coverage Δ
python/cuml/ensemble/randomforest_common.pyx 83.47% <0.00%> (+0.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0cd8c1...290184f. Read the comment docs.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks isolated enough to make the release. I found it hard to follow what spred vs. spredP vs. spred2 vs pred are all supposed to store and think that a comment describing the variables (and more descriptive names) would be very helpful - however, I think the renaming should wait until 0.18.

gs.sync();
// now, compute the mean value to be used for metric update
for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
scount[i] = count[gcOffset + i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next revision, I would love to have these variable names be a bit more explicit. They all look similar and it was a bit hard to parse the formulas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I had to read the code very carefully to deduce their meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with John. When this was written initially, readability was not kept as a priority item! @hcho3 can you please file an issue so that we don't forget about this?

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hcho3 for the fix! Change LGTM.

@rapids-bot rapids-bot bot merged commit a4c8de5 into rapidsai:branch-0.17 Dec 4, 2020
@hcho3 hcho3 deleted the fix_degenerate_regression2 branch December 4, 2020 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] 10x slowdown with cuML RF regression with new experimental backend
5 participants