Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Old and new backends diverge on a 10-row toy example #3188

Closed
hcho3 opened this issue Nov 25, 2020 · 11 comments
Closed

[BUG] Old and new backends diverge on a 10-row toy example #3188

hcho3 opened this issue Nov 25, 2020 · 11 comments
Labels
bug Something isn't working CUDA / C++ CUDA issue inactive-30d inactive-90d

Comments

@hcho3
Copy link
Contributor

hcho3 commented Nov 25, 2020

The old and new backends of RF produces different models for a toy example (given below).

Minimum reproducer: this reproducer consists of a toy example with 10 rows and 2 columns, generated by make_blobs.

"""mre.py"""
from cuml.ensemble import RandomForestRegressor as rfr
from sklearn.datasets import make_blobs
from toy_dt import fit as toy_dt_fit
from toy_dt import dump as toy_dt_dump
from toy_dt import NumpyEncoder
import numpy as np
import json

n_row = 10
n_feature = 2

X, y = make_blobs(n_samples=n_row, n_features=n_feature, centers=None,
                  random_state=42)
X, y = X.astype(np.float32), y.astype(np.float32)
print(f'X = {X}')
print(f'y = {y}')

max_depth = 1

kwargs = {'max_features': 1.0, 'n_estimators': 1, 'max_depth': max_depth, 'bootstrap': False}

for use_experimental in [True, False]:
    clf = rfr(use_experimental_backend=use_experimental, **kwargs)
    clf.fit(X, y)
    json_obj = json.loads(clf.dump_as_json())
    print(f'=========use_experimental={use_experimental}=========')
    print(json.dumps(json_obj, indent=4))

tree = toy_dt_fit(X, y, max_depth=max_depth, min_samples_split=2, min_samples_leaf=1)
print('=========Toy implementation=========')
print(json.dumps(toy_dt_dump(tree), indent=4, cls=NumpyEncoder))

The reproducer also includes a toy implementation of the Decision Tree algorithm:

"""toy_dt.py"""
import numpy as np
import json

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

def mse(y):
    if y.shape[0] == 0:
        return 0
    y_pred = np.mean(y)
    return np.mean((y - y_pred)**2)

def initialize_tree(X, y):
    assert X.shape[0] == y.shape[0]
    return {'data': np.arange(X.shape[0]), 'leaf_value': np.mean(y)}

def find_best_split(X, y, node, *, min_samples_split, min_samples_leaf):
    ind = node['data']
    X_node, y_node = X[ind, :], y[ind]

    n_row, n_col = X_node.shape[0], X_node.shape[1]
    assert n_row == y_node.shape[0]
    if n_row < min_samples_split:
        return (None, None, 0.0)  # do not split
    best_gain = 0.0
    best_feature_id = None
    best_threshold = None
    for feature_id in range(n_col):
        feature_values = np.unique(X_node[:, feature_id])
        threshold_candidates = [-np.inf] + feature_values.tolist() + [np.inf]

        for threshold in threshold_candidates:
            left_ind = (X_node[:, feature_id] <= threshold)
            right_ind = (X_node[:, feature_id] > threshold)
            n_left, n_right = np.sum(left_ind), np.sum(right_ind)
            if n_left < min_samples_leaf or n_right < min_samples_leaf:
                continue
            y_left, y_right = y_node[left_ind], y_node[right_ind]
            assert n_left + n_right == n_row
            gain = mse(y_node) - n_left / n_row * mse(y_left) - n_right / n_row * mse(y_right)
            if gain > best_gain:
                best_gain = gain
                best_feature_id = feature_id
                best_threshold = threshold

    return (best_feature_id, best_threshold, best_gain)

def add_split(X, y, node, feature_id, threshold, gain):
    node['base_weight'] = node['leaf_value']
    del node['leaf_value']
    ind = node['data']
    X_node, y_node = X[ind, :], y[ind]
    left_ind = (X_node[:, feature_id] <= threshold)
    right_ind = (X_node[:, feature_id] > threshold)
    y_left, y_right = y_node[left_ind], y_node[right_ind]
    node['left'] = {'data': ind[left_ind], 'leaf_value': np.mean(y_left)}
    node['right'] = {'data': ind[right_ind], 'leaf_value': np.mean(y_right)}
    node['gain'] = gain
    node['feature_id'] = feature_id
    node['threshold'] = threshold

def fit(X, y, *, max_depth, min_samples_split, min_samples_leaf):
    tree = initialize_tree(X, y)
    queue = [tree]
    for i in range(max_depth):
        next_queue = []
        for node in queue:
            feature_id, threshold, gain = find_best_split(X, y, node,
                                                          min_samples_split=min_samples_split,
                                                          min_samples_leaf=min_samples_leaf)
            if gain > 0.0:
                add_split(X, y, node, feature_id, threshold, gain)
                next_queue.append(node['left'])
                next_queue.append(node['right'])
        queue = next_queue
        if not queue:
            break
    return tree

def dump(tree):
    if 'left' in tree:
        return {'base_weight': tree['base_weight'],
                'gain': tree['gain'],
                'feature_id': tree['feature_id'],
                'threshold': tree['threshold'],
                'left': dump(tree['left']),
                'right': dump(tree['right'])}
    else:
        return {'leaf_value': tree['leaf_value']}

Here are the results with varying values for max_depth.

max_depth=1

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1.5
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1.5
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "leaf_value": 1.5
    },
    "right": {
        "leaf_value": 0.0
    }
}

max_depth=2

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -5.41397858,
                "gain": 1.5,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "base_weight": 1.5,
        "gain": 0.25,
        "feature_id": 0,
        "threshold": -5.413978576660156,
        "left": {
            "leaf_value": 2.0
        },
        "right": {
            "leaf_value": 1.0
        }
    },
    "right": {
        "leaf_value": 0.0
    }
}

max_depth=3

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    },
                    {
                        "nodeid": 4,
                        "split_feature": 0,
                        "split_threshold": 2.9149611,
                        "gain": 5.96046448e-08,
                        "yes": 7,
                        "no": 8,
                        "children": [
                            {
                                "nodeid": 7,
                                "leaf_value": 1
                            },
                            {
                                "nodeid": 8,
                                "leaf_value": 1
                            }
                        ]
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -5.41397858,
                "gain": 1.5,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "base_weight": 1.5,
        "gain": 0.25,
        "feature_id": 0,
        "threshold": -5.413978576660156,
        "left": {
            "leaf_value": 2.0
        },
        "right": {
            "leaf_value": 1.0
        }
    },
    "right": {
        "leaf_value": 0.0
    }
}
@hcho3 hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 25, 2020
@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

Note. Make sure to apply the patch #3186 to obtain the gain values.

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

@vinaydes

Observations so far:

  • The old RF backend yields wrong MSE gain values.
  • On the other hand, the new RF backend produces a gratuitous split when max_depth is set to 3. The following node is an unnecessary split because all the three data rows belonging to this node all have identical label (2). In addition, the MSE gain should actually be zero, not 2e-7.
                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    }

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

Visualization of the toy data:
toy_data

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

I just found a silly mistake in my toy decision tree implementation and fixed it now.

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

Minimal reproducer for the gratuitous split when a node has rows with identical label:

from cuml.ensemble import RandomForestRegressor as rfr
import numpy as np
import json

X = np.array([[-1, 0], [0, 1], [2, 0], [0, 3], [-2, 0]], dtype=np.float32)
y = np.array([1, 1, 1, 1, 1], dtype=np.float32)

kwargs = {'max_features': 1.0, 'n_estimators': 1, 'max_depth': 1, 'bootstrap': False, 'n_bins': 5}

for use_experimental in [True, False]:
    clf = rfr(use_experimental_backend=use_experimental, **kwargs)
    clf.fit(X, y)
    json_obj = json.loads(clf.dump_as_json())
    print(f'=========use_experimental={use_experimental}=========')
    print(json.dumps(json_obj, indent=4))

Output:

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 0,
        "gain": 2.98023224e-08,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1
            },
            {
                "nodeid": 2,
                "leaf_value": 1
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "leaf_value": 1
    }
]

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 25, 2020

I found the reason why the new backend was adding the unnecessary split in the example #3188 (comment). It's because of the order at which three floating-point numbers are added together in the computation of the gain value:

"""catastrophic_cancellation_in_gain.py"""
import numpy as np

X = np.array([[-1, 0], [0, 1], [2, 0], [0, 3], [-2, 0]], dtype=np.float32)
y = np.array([1, 1, 1, 1, 1], dtype=np.float32)

ind = (X[:, 1] <= 0.0)
y_left, y_right = y[ind], y[~ind]

invlen = np.float32(1.0) / np.float32(5.0)
invLeft = np.float32(1.0) / np.float32(3.0)
invRight = np.float32(1.0) / np.float32(2.0)

valL = np.sum(y_left)
valR = np.sum(y_right)
valP = (valL + valR) * invlen

A = -valP * valP
assert isinstance(A, np.float32)
print(f'A = {A}')

B = valL * invlen * valL * invLeft
assert isinstance(B, np.float32)
print(f'B = {B}')

C = valR * invlen * valR * invRight
assert isinstance(C, np.float32)
print(f'C = {C}')

gain = (A + B) + C

print('=' * 60)
print(f'A + B = {A + B}')
print(f'A + C = {A + C}')
print(f'B + C = {B + C}')

print('=' * 60)
print(f'(A + B) + C = {(A + B) + C}')
print(f'(A + C) + B = {(A + C) + B}')
print(f'(B + C) + A = {(B + C) + A}')

print('=' * 60)
print(f'gain = (A + B) + C = {gain}')

which prints

A = -1.0
B = 0.6000000238418579
C = 0.4000000059604645
============================================================
A + B = -0.3999999761581421
A + C = -0.6000000238418579
B + C = 1.0
============================================================
(A + B) + C = 2.9802322387695312e-08
(A + C) + B = 0.0
(B + C) + A = 0.0
============================================================
gain = (A + B) + C = 2.9802322387695312e-08

@teju85
Copy link
Member

teju85 commented Nov 27, 2020

This is probably a bit of digression from topic of this issue, but I feel like what I'm writing here might be helpful in (atleast partly) fixing this issue in our new backend...

@hcho3
during our 1:1, @vinaydes and I found something that could be a potential problem with the way new backend is splitting nodes. Let me explain by only taking into account MSE metric (but should apply to all the other 3 metrics too). Here is our MSE metric in new backend. We believe that this function should instead be evaluating the gain as follows:

template <typename DataT, typename IdxT>
DI void mseGain(DataT* spred, IdxT* scount, DataT* sbins,
                Split<DataT, IdxT>& sp, IdxT col, IdxT len, IdxT nbins,
                IdxT min_samples_leaf, DataT min_val, DataT min_impurity_decrease) {
  auto invlen = DataT(1.0) / len;
  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
    auto nLeft = scount[i];
    auto nRight = len - nLeft;
    DataT gain;
    // if there aren't enough samples in this split, don't bother!
    if (nLeft < min_samples_leaf || nRight < min_samples_leaf) {
      gain = min_val;  // IOW, -std::numeric_limits<DataT>::max() (but this constant is not accessible from device methods!)
    else {
      auto invLeft = DataT(1.0) / nLeft;
      auto invRight = DataT(1.0) / nRight;
      auto valL = spred[i];
      auto valR = spred[nbins + i];
      // parent sum is basically sum of its left and right children
      auto valP = (valL + valR) * invlen;
      gain = -valP * valP;
      gain += valL * invlen * valL * invLeft;
      gain += valR * invlen * valR * invRight;
    }
    // if the gain is not "enough", don't bother!
    if (gain <= min_impurity_decrease) {
      gain = min_val;
    }
    sp.update({sbins[i], col, gain, nLeft});
  }
}

In short, if we find that certain splits are going to not satisfy the min_samples_leaf or min_impurity_decrease hyper-params, then we really shouldn't be considering such splits at all! Can you retry with this change, please?

@hcho3
Copy link
Contributor Author

hcho3 commented Dec 1, 2020

I filed #3216 according to your suggestion. While it does not fix the issue in the particular toy example, I believe it should be merged on its own merit, namely ignoring ineligible split candidates.

@hcho3 hcho3 added CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify labels Dec 2, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 4, 2020
…tical(#3243)

Closes #3231 
Closes #3128
Partially addresses #3188 

The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case.

The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k.

As a result, the MSE gain is computed very similarly as the MAE gain.

Disadvantage: now we always make two passes over data to compute the gain.

cc @teju85 @vinaydes @JohnZed

Authors:
  - Hyunsu Cho <[email protected]>
  - Philip Hyunsu Cho <[email protected]>

Approvers:
  - Thejaswi Rao
  - John Zedlewski

URL: #3243
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@hcho3
Copy link
Contributor Author

hcho3 commented May 18, 2021

Fixed in #3243

@hcho3 hcho3 closed this as completed May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

2 participants