[BUG] Old and new backends diverge on a 10-row toy example #3188

hcho3 · 2020-11-25T09:48:58Z

The old and new backends of RF produces different models for a toy example (given below).

Minimum reproducer: this reproducer consists of a toy example with 10 rows and 2 columns, generated by make_blobs.

"""mre.py"""
from cuml.ensemble import RandomForestRegressor as rfr
from sklearn.datasets import make_blobs
from toy_dt import fit as toy_dt_fit
from toy_dt import dump as toy_dt_dump
from toy_dt import NumpyEncoder
import numpy as np
import json

n_row = 10
n_feature = 2

X, y = make_blobs(n_samples=n_row, n_features=n_feature, centers=None,
                  random_state=42)
X, y = X.astype(np.float32), y.astype(np.float32)
print(f'X = {X}')
print(f'y = {y}')

max_depth = 1

kwargs = {'max_features': 1.0, 'n_estimators': 1, 'max_depth': max_depth, 'bootstrap': False}

for use_experimental in [True, False]:
    clf = rfr(use_experimental_backend=use_experimental, **kwargs)
    clf.fit(X, y)
    json_obj = json.loads(clf.dump_as_json())
    print(f'=========use_experimental={use_experimental}=========')
    print(json.dumps(json_obj, indent=4))

tree = toy_dt_fit(X, y, max_depth=max_depth, min_samples_split=2, min_samples_leaf=1)
print('=========Toy implementation=========')
print(json.dumps(toy_dt_dump(tree), indent=4, cls=NumpyEncoder))

The reproducer also includes a toy implementation of the Decision Tree algorithm:

"""toy_dt.py"""
import numpy as np
import json

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

def mse(y):
    if y.shape[0] == 0:
        return 0
    y_pred = np.mean(y)
    return np.mean((y - y_pred)**2)

def initialize_tree(X, y):
    assert X.shape[0] == y.shape[0]
    return {'data': np.arange(X.shape[0]), 'leaf_value': np.mean(y)}

def find_best_split(X, y, node, *, min_samples_split, min_samples_leaf):
    ind = node['data']
    X_node, y_node = X[ind, :], y[ind]

    n_row, n_col = X_node.shape[0], X_node.shape[1]
    assert n_row == y_node.shape[0]
    if n_row < min_samples_split:
        return (None, None, 0.0)  # do not split
    best_gain = 0.0
    best_feature_id = None
    best_threshold = None
    for feature_id in range(n_col):
        feature_values = np.unique(X_node[:, feature_id])
        threshold_candidates = [-np.inf] + feature_values.tolist() + [np.inf]

        for threshold in threshold_candidates:
            left_ind = (X_node[:, feature_id] <= threshold)
            right_ind = (X_node[:, feature_id] > threshold)
            n_left, n_right = np.sum(left_ind), np.sum(right_ind)
            if n_left < min_samples_leaf or n_right < min_samples_leaf:
                continue
            y_left, y_right = y_node[left_ind], y_node[right_ind]
            assert n_left + n_right == n_row
            gain = mse(y_node) - n_left / n_row * mse(y_left) - n_right / n_row * mse(y_right)
            if gain > best_gain:
                best_gain = gain
                best_feature_id = feature_id
                best_threshold = threshold

    return (best_feature_id, best_threshold, best_gain)

def add_split(X, y, node, feature_id, threshold, gain):
    node['base_weight'] = node['leaf_value']
    del node['leaf_value']
    ind = node['data']
    X_node, y_node = X[ind, :], y[ind]
    left_ind = (X_node[:, feature_id] <= threshold)
    right_ind = (X_node[:, feature_id] > threshold)
    y_left, y_right = y_node[left_ind], y_node[right_ind]
    node['left'] = {'data': ind[left_ind], 'leaf_value': np.mean(y_left)}
    node['right'] = {'data': ind[right_ind], 'leaf_value': np.mean(y_right)}
    node['gain'] = gain
    node['feature_id'] = feature_id
    node['threshold'] = threshold

def fit(X, y, *, max_depth, min_samples_split, min_samples_leaf):
    tree = initialize_tree(X, y)
    queue = [tree]
    for i in range(max_depth):
        next_queue = []
        for node in queue:
            feature_id, threshold, gain = find_best_split(X, y, node,
                                                          min_samples_split=min_samples_split,
                                                          min_samples_leaf=min_samples_leaf)
            if gain > 0.0:
                add_split(X, y, node, feature_id, threshold, gain)
                next_queue.append(node['left'])
                next_queue.append(node['right'])
        queue = next_queue
        if not queue:
            break
    return tree

def dump(tree):
    if 'left' in tree:
        return {'base_weight': tree['base_weight'],
                'gain': tree['gain'],
                'feature_id': tree['feature_id'],
                'threshold': tree['threshold'],
                'left': dump(tree['left']),
                'right': dump(tree['right'])}
    else:
        return {'leaf_value': tree['leaf_value']}

Here are the results with varying values for max_depth.

max_depth=1

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1.5
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1.5
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "leaf_value": 1.5
    },
    "right": {
        "leaf_value": 0.0
    }
}

max_depth=2

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -5.41397858,
                "gain": 1.5,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "base_weight": 1.5,
        "gain": 0.25,
        "feature_id": 0,
        "threshold": -5.413978576660156,
        "left": {
            "leaf_value": 2.0
        },
        "right": {
            "leaf_value": 1.0
        }
    },
    "right": {
        "leaf_value": 0.0
    }
}

max_depth=3

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.540000081,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -2.97261524,
                "gain": 0.25,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    },
                    {
                        "nodeid": 4,
                        "split_feature": 0,
                        "split_threshold": 2.9149611,
                        "gain": 5.96046448e-08,
                        "yes": 7,
                        "no": 8,
                        "children": [
                            {
                                "nodeid": 7,
                                "leaf_value": 1
                            },
                            {
                                "nodeid": 8,
                                "leaf_value": 1
                            }
                        ]
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 2.28741693,
        "gain": 0.689999998,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "split_feature": 0,
                "split_threshold": -5.41397858,
                "gain": 1.5,
                "yes": 3,
                "no": 4,
                "children": [
                    {
                        "nodeid": 3,
                        "leaf_value": 2
                    },
                    {
                        "nodeid": 4,
                        "leaf_value": 1
                    }
                ]
            },
            {
                "nodeid": 2,
                "leaf_value": 0
            }
        ]
    }
]
=========Toy implementation=========
{
    "base_weight": 0.8999999761581421,
    "gain": 0.5399999976158142,
    "feature_id": 1,
    "threshold": 2.287416934967041,
    "left": {
        "base_weight": 1.5,
        "gain": 0.25,
        "feature_id": 0,
        "threshold": -5.413978576660156,
        "left": {
            "leaf_value": 2.0
        },
        "right": {
            "leaf_value": 1.0
        }
    },
    "right": {
        "leaf_value": 0.0
    }
}

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-11-25T09:54:41Z

Note. Make sure to apply the patch #3186 to obtain the gain values.

hcho3 · 2020-11-25T10:09:47Z

@vinaydes

Observations so far:

The old RF backend yields wrong MSE gain values.
On the other hand, the new RF backend produces a gratuitous split when max_depth is set to 3. The following node is an unnecessary split because all the three data rows belonging to this node all have identical label (2). In addition, the MSE gain should actually be zero, not 2e-7.

                    {
                        "nodeid": 3,
                        "split_feature": 1,
                        "split_threshold": -8.30485821,
                        "gain": 2.38418579e-07,
                        "yes": 5,
                        "no": 6,
                        "children": [
                            {
                                "nodeid": 5,
                                "leaf_value": 2
                            },
                            {
                                "nodeid": 6,
                                "leaf_value": 2
                            }
                        ]
                    }

hcho3 · 2020-11-25T10:22:53Z

Visualization of the toy data:

hcho3 · 2020-11-25T10:38:25Z

I just found a silly mistake in my toy decision tree implementation and fixed it now.

hcho3 · 2020-11-25T22:32:23Z

Minimal reproducer for the gratuitous split when a node has rows with identical label:

from cuml.ensemble import RandomForestRegressor as rfr
import numpy as np
import json

X = np.array([[-1, 0], [0, 1], [2, 0], [0, 3], [-2, 0]], dtype=np.float32)
y = np.array([1, 1, 1, 1, 1], dtype=np.float32)

kwargs = {'max_features': 1.0, 'n_estimators': 1, 'max_depth': 1, 'bootstrap': False, 'n_bins': 5}

for use_experimental in [True, False]:
    clf = rfr(use_experimental_backend=use_experimental, **kwargs)
    clf.fit(X, y)
    json_obj = json.loads(clf.dump_as_json())
    print(f'=========use_experimental={use_experimental}=========')
    print(json.dumps(json_obj, indent=4))

Output:

=========use_experimental=True=========
[
    {
        "nodeid": 0,
        "split_feature": 1,
        "split_threshold": 0,
        "gain": 2.98023224e-08,
        "yes": 1,
        "no": 2,
        "children": [
            {
                "nodeid": 1,
                "leaf_value": 1
            },
            {
                "nodeid": 2,
                "leaf_value": 1
            }
        ]
    }
]
=========use_experimental=False=========
[
    {
        "nodeid": 0,
        "leaf_value": 1
    }
]

hcho3 · 2020-11-25T23:29:26Z

I found the reason why the new backend was adding the unnecessary split in the example #3188 (comment). It's because of the order at which three floating-point numbers are added together in the computation of the gain value:

"""catastrophic_cancellation_in_gain.py"""
import numpy as np

X = np.array([[-1, 0], [0, 1], [2, 0], [0, 3], [-2, 0]], dtype=np.float32)
y = np.array([1, 1, 1, 1, 1], dtype=np.float32)

ind = (X[:, 1] <= 0.0)
y_left, y_right = y[ind], y[~ind]

invlen = np.float32(1.0) / np.float32(5.0)
invLeft = np.float32(1.0) / np.float32(3.0)
invRight = np.float32(1.0) / np.float32(2.0)

valL = np.sum(y_left)
valR = np.sum(y_right)
valP = (valL + valR) * invlen

A = -valP * valP
assert isinstance(A, np.float32)
print(f'A = {A}')

B = valL * invlen * valL * invLeft
assert isinstance(B, np.float32)
print(f'B = {B}')

C = valR * invlen * valR * invRight
assert isinstance(C, np.float32)
print(f'C = {C}')

gain = (A + B) + C

print('=' * 60)
print(f'A + B = {A + B}')
print(f'A + C = {A + C}')
print(f'B + C = {B + C}')

print('=' * 60)
print(f'(A + B) + C = {(A + B) + C}')
print(f'(A + C) + B = {(A + C) + B}')
print(f'(B + C) + A = {(B + C) + A}')

print('=' * 60)
print(f'gain = (A + B) + C = {gain}')

which prints

A = -1.0
B = 0.6000000238418579
C = 0.4000000059604645
============================================================
A + B = -0.3999999761581421
A + C = -0.6000000238418579
B + C = 1.0
============================================================
(A + B) + C = 2.9802322387695312e-08
(A + C) + B = 0.0
(B + C) + A = 0.0
============================================================
gain = (A + B) + C = 2.9802322387695312e-08

teju85 · 2020-11-27T07:24:39Z

This is probably a bit of digression from topic of this issue, but I feel like what I'm writing here might be helpful in (atleast partly) fixing this issue in our new backend...

@hcho3
during our 1:1, @vinaydes and I found something that could be a potential problem with the way new backend is splitting nodes. Let me explain by only taking into account MSE metric (but should apply to all the other 3 metrics too). Here is our MSE metric in new backend. We believe that this function should instead be evaluating the gain as follows:

template <typename DataT, typename IdxT>
DI void mseGain(DataT* spred, IdxT* scount, DataT* sbins,
                Split<DataT, IdxT>& sp, IdxT col, IdxT len, IdxT nbins,
                IdxT min_samples_leaf, DataT min_val, DataT min_impurity_decrease) {
  auto invlen = DataT(1.0) / len;
  for (IdxT i = threadIdx.x; i < nbins; i += blockDim.x) {
    auto nLeft = scount[i];
    auto nRight = len - nLeft;
    DataT gain;
    // if there aren't enough samples in this split, don't bother!
    if (nLeft < min_samples_leaf || nRight < min_samples_leaf) {
      gain = min_val;  // IOW, -std::numeric_limits<DataT>::max() (but this constant is not accessible from device methods!)
    else {
      auto invLeft = DataT(1.0) / nLeft;
      auto invRight = DataT(1.0) / nRight;
      auto valL = spred[i];
      auto valR = spred[nbins + i];
      // parent sum is basically sum of its left and right children
      auto valP = (valL + valR) * invlen;
      gain = -valP * valP;
      gain += valL * invlen * valL * invLeft;
      gain += valR * invlen * valR * invRight;
    }
    // if the gain is not "enough", don't bother!
    if (gain <= min_impurity_decrease) {
      gain = min_val;
    }
    sp.update({sbins[i], col, gain, nLeft});
  }
}

In short, if we find that certain splits are going to not satisfy the min_samples_leaf or min_impurity_decrease hyper-params, then we really shouldn't be considering such splits at all! Can you retry with this change, please?

hcho3 · 2020-12-01T03:03:56Z

I filed #3216 according to your suggestion. While it does not fix the issue in the particular toy example, I believe it should be merged on its own merit, namely ignoring ineligible split candidates.

@teju85

…tical(#3243) Closes #3231 Closes #3128 Partially addresses #3188 The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case. The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k. As a result, the MSE gain is computed very similarly as the MAE gain. Disadvantage: now we always make two passes over data to compute the gain. cc @teju85 @vinaydes @JohnZed Authors: - Hyunsu Cho <[email protected]> - Philip Hyunsu Cho <[email protected]> Approvers: - Thejaswi Rao - John Zedlewski URL: #3243

github-actions · 2021-02-16T22:19:09Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions · 2021-05-17T23:05:15Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

hcho3 · 2021-05-18T00:21:43Z

Fixed in #3243

hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 25, 2020

hcho3 added CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify labels Dec 2, 2020

This was referenced Dec 2, 2020

[WIP] Avoid unnecessary split for degenerate case where all labels are identical, by checking label range #3231

Closed

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

Merged

github-actions bot added the inactive-30d label Feb 16, 2021

github-actions bot added the inactive-90d label May 17, 2021

hcho3 closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Old and new backends diverge on a 10-row toy example #3188

[BUG] Old and new backends diverge on a 10-row toy example #3188

hcho3 commented Nov 25, 2020 •

edited

Loading

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020 •

edited

Loading

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020 •

edited

Loading

teju85 commented Nov 27, 2020 •

edited

Loading

hcho3 commented Dec 1, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

hcho3 commented May 18, 2021

[BUG] Old and new backends diverge on a 10-row toy example #3188

[BUG] Old and new backends diverge on a 10-row toy example #3188

Comments

hcho3 commented Nov 25, 2020 • edited Loading

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020 • edited Loading

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020

hcho3 commented Nov 25, 2020 • edited Loading

teju85 commented Nov 27, 2020 • edited Loading

hcho3 commented Dec 1, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

hcho3 commented May 18, 2021

hcho3 commented Nov 25, 2020 •

edited

Loading

hcho3 commented Nov 25, 2020 •

edited

Loading

hcho3 commented Nov 25, 2020 •

edited

Loading

teju85 commented Nov 27, 2020 •

edited

Loading