[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

hcho3 · 2020-12-03T05:37:42Z

Closes #3231
Closes #3128
Partially addresses #3188

The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case.

The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k.

As a result, the MSE gain is computed very similarly as the MAE gain.

Disadvantage: now we always make two passes over data to compute the gain.

cc @teju85 @vinaydes @JohnZed

…tical

GPUtester · 2020-12-03T05:38:09Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

hcho3 · 2020-12-03T06:05:24Z

Example from #3188 (comment)

Before (max_depth=3):

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    },
                    {
                        "nodeid": 4,
                        "split_feature": 0,
                        "split_threshold": 2.9149611,
                        "gain": 5.96046448e-08,
                        "yes": 7,
                        "no": 8,
                        "children": [
                            {
                                "nodeid": 7,
                                "leaf_value": 1
                            },
                            {
                                "nodeid": 8,
                                "leaf_value": 1
                            }
                        ]
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]

After this PR (max_depth=3):

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.539999962,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]

Example from #3188 (comment)

Before:

[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 0,
        "gain": 2.98023224e-08,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1
            },
            {
                "nodeid": 2,
                "leaf_value": 1
            }
        ]
    }
]

After:

[
    {
        "nodeid": 0,
        "leaf_value": 1
    }
]

Example from #3128

python -m cuml.benchmark.run_benchmarks --cuml-param-sweep use_experimental_backend=[true,false]  \
  max_depth=[10,30] --n-reps 1 --skip-cpu --num-rows 1000000 --num-features 256  \
  RandomForestRegressor

Before:

                     algo  input     cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0   RandomForestRegressor  numpy   11.271957       0.0  0.999997      0.0      0.0    1000000         256                      True         10
1   RandomForestRegressor  numpy  115.319010       0.0  0.991797      0.0      0.0    1000000         256                      True         30
2   RandomForestRegressor  numpy    2.103713       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3   RandomForestRegressor  numpy    3.486071       0.0  0.999999      0.0      0.0    1000000         256                     False         30

After:

                    algo  input    cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0  RandomForestRegressor  numpy  13.305948       0.0  1.000000      0.0      0.0    1000000         256                      True         10
1  RandomForestRegressor  numpy  14.542375       0.0  0.999999      0.0      0.0    1000000         256                      True         30
2  RandomForestRegressor  numpy   2.094524       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3  RandomForestRegressor  numpy   3.602167       0.0  0.999999      0.0      0.0    1000000         256                     False         30

codecov-io · 2020-12-03T10:13:12Z

Codecov Report

Merging #3243 (290184f) into branch-0.17 (d0cd8c1) will increase coverage by 0.01%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.17    #3243      +/-   ##
===============================================
+ Coverage        71.48%   71.49%   +0.01%     
===============================================
  Files              206      206              
  Lines            16648    16648              
===============================================
+ Hits             11900    11902       +2     
+ Misses            4748     4746       -2

Impacted Files	Coverage Δ
python/cuml/ensemble/randomforest_common.pyx	`83.47% <0.00%> (+0.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0cd8c1...290184f. Read the comment docs.

JohnZed

Overall looks isolated enough to make the release. I found it hard to follow what spred vs. spredP vs. spred2 vs pred are all supposed to store and think that a comment describing the variables (and more descriptive names) would be very helpful - however, I think the renaming should wait until 0.18.

JohnZed · 2020-12-03T21:54:40Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

+  gs.sync();
+  // now, compute the mean value to be used for metric update
+  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
+    scount[i] = count[gcOffset + i];


In the next revision, I would love to have these variable names be a bit more explicit. They all look similar and it was a bit hard to parse the formulas.

I agree. I had to read the code very carefully to deduce their meaning.

Agree with John. When this was written initially, readability was not kept as a priority item! @hcho3 can you please file an issue so that we don't forget about this?

teju85

Thank you @hcho3 for the fix! Change LGTM.

hcho3 added 2 commits December 2, 2020 20:45

Add unit test for MAE gain

128b9d9

Avoid unnecessary split for degenerate case where all labels are iden…

1a7101f

…tical

hcho3 requested a review from a team as a code owner December 3, 2020 05:37

Update changelog

59a1d36

hcho3 added 2 - In Progress Currenty a work in progress bug Something isn't working CUDA / C++ CUDA issue non-breaking Non-breaking change labels Dec 3, 2020

Add a small integration test for the degenerate case

8ce7554

hcho3 requested a review from a team as a code owner December 3, 2020 06:00

hcho3 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Dec 3, 2020

hcho3 changed the title ~~Avoid unnecessary split for degenerate case where all labels are identical~~ [REVIEW] Avoid unnecessary split for degenerate case where all labels are identical Dec 3, 2020

hcho3 requested review from teju85 and JohnZed December 3, 2020 06:01

JohnZed reviewed Dec 3, 2020

View reviewed changes

Merge branch 'branch-0.17' into fix_degenerate_regression2

290184f

teju85 approved these changes Dec 4, 2020

View reviewed changes

JohnZed added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 4, 2020

This was referenced Dec 4, 2020

[FEA] Rename variables in RF metric kernels to clarify their meaning #3252

Closed

[FEA] Explore a possible single-pass implementation of MSE metric kernel in RF #3253

Open

JohnZed approved these changes Dec 4, 2020

View reviewed changes

rapids-bot bot merged commit a4c8de5 into rapidsai:branch-0.17 Dec 4, 2020

hcho3 deleted the fix_degenerate_regression2 branch December 4, 2020 05:30

hcho3 mentioned this pull request May 18, 2021

[BUG] Old and new backends diverge on a 10-row toy example #3188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

hcho3 commented Dec 3, 2020 •

edited

Loading

GPUtester commented Dec 3, 2020

hcho3 commented Dec 3, 2020

codecov-io commented Dec 3, 2020 •

edited

Loading

JohnZed left a comment

JohnZed Dec 3, 2020

hcho3 Dec 3, 2020

teju85 Dec 4, 2020

teju85 left a comment

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

Conversation

hcho3 commented Dec 3, 2020 • edited Loading

GPUtester commented Dec 3, 2020

hcho3 commented Dec 3, 2020

Example from #3188 (comment)

Example from #3188 (comment)

Example from #3128

codecov-io commented Dec 3, 2020 • edited Loading

Codecov Report

JohnZed left a comment

Choose a reason for hiding this comment

JohnZed Dec 3, 2020

Choose a reason for hiding this comment

hcho3 Dec 3, 2020

Choose a reason for hiding this comment

teju85 Dec 4, 2020

Choose a reason for hiding this comment

teju85 left a comment

Choose a reason for hiding this comment

hcho3 commented Dec 3, 2020 •

edited

Loading

codecov-io commented Dec 3, 2020 •

edited

Loading