`feature_fraction` RNG looks broken #4371

DexGroves · 2021-06-11T17:48:54Z

Hi!

First up, thanks for maintaining this amazing package!

I noticed something weird while training a model with three predictors and 1/2 < feature_fraction < 5/6. One feature always gets ignored. I think this is because of the per-tree RNG that determines which features are available. As far as I can tell:

NextInt in random.h gives alternating even and odd numbers, which gets weird when we ask it for modulo even things, particularly modulo 2.
There is an off-by-one here? I think it should be r+1, but I could be wrong.

Repro

import lightgbm as lgb
import numpy as np
import pandas as pd

# Make some fake data
N = 1_000
p = 3
seed = 42

np.random.seed(seed)

df = pd.DataFrame(dict(y=np.random.normal(size=N), w=np.random.uniform(size=N)))

features = []
for i in range(p):
    df[str(i)] = np.random.normal(size=N)
    features.append(str(i))

X = lgb.Dataset(df[features], label=df.y)

# Fit a model
params = dict(
    num_iterations=1_000,
    max_depth=3,
    feature_fraction=0.8,
    learning_rate=0.2,
    objective="l2",
    verbosity=-1,
)

regr = lgb.train(params=params, train_set=X)

# One feature is ignored completely, is featured in no splits
regr.feature_importance()

Why I think this is happening

In the select-two-from-three case, we go around this loop exactly twice per tree, and two bad things happen. The first go-around call rolls NextInt(0, 1), which returns 0 in all cases. The zeroth variable always gets included. I think the roll should be NextInt(0, 2), so the zeroth variable has a chance to be left out.

The second go-around rolls NextInt(0, 2), which returns either 0 or 1 depending on seed. This is doing RandInt32() % 2 under the hood. Since this is the 2nd, 4th, 6th, 8th, ..., call to the RNG, and RantInt32 alternates between even and odd numbers, it always hits the same value for every tree.

This doesn't just affect the select-two-from-three case, but it gets more complicated to think about. Even-indexed variables get interesting if they get to roll something_always_even % another_small_even_number. You can construct other combinations of number of features and feature_fraction where variables in the last or second-to-last position get excluded always, or where the model has a significant bias towards using even-indexed variables, etc.

Fixes?

There are two branches for determining feature fraction, and the other one looks fine, so just use that one every time? I assume there is a good reason to use two branches though (and I'd be interested to understand it).
Do something smart with RantInt32 so it doesn't have even/odd alternation?
Use RandInt16 for NextInt?

Happy to submit a PR for the first or third one if you think they're OK solutions.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-06-12T13:46:54Z

Linking #4134 (comment) as related.

shiyu1994 · 2021-06-30T07:00:30Z

@DexGroves Thanks for using LightGBM. After a careful check, I think it should be NextInt(0, r+1), instead of NextInt(0, r) here,

LightGBM/include/LightGBM/utils/random.h

Line 88 in 1b5bec0

int v = NextInt(0, r);

otherwise the last element N-1 will always be left out. Thanks for pointing that out.

The reason to use the second branch in Sample, I think it is because the second branch requires a much smaller number of calls to generate a random number, which makes it more efficient.

With NextInt(r + 1), I think the "selecting two from three" case won't always leave any element out. But it can still always choose 0 or 1 due to the alternating nature of our pseudo random generator. Compared with always leaving some elements out, this is a smaller problem. Do you know any approach to avoid this?

StrikerRUS · 2021-07-07T11:05:43Z

Fixed via #4450. I can confirm that now original reproducible example returns something like the following:

array([ 995, 2118, 1815])

DexGroves · 2021-07-26T19:30:21Z

Thanks for the quick fix @shiyu1994!

github-actions · 2023-08-23T14:26:51Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the bug label Jun 11, 2021

This was referenced Jul 1, 2021

[python-package] Create Dataset from multiple data files #4089

Merged

[fix] fix Reservoir Sampling in Sample of random.h (fix #4371 and #4134) #4450

Merged

StrikerRUS closed this as completed Jul 7, 2021

StrikerRUS mentioned this issue Jul 17, 2021

feature_fraction doesnt picks last feature #4476

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`feature_fraction` RNG looks broken #4371

`feature_fraction` RNG looks broken #4371

DexGroves commented Jun 11, 2021 •

edited

Loading

StrikerRUS commented Jun 12, 2021

shiyu1994 commented Jun 30, 2021

StrikerRUS commented Jul 7, 2021

DexGroves commented Jul 26, 2021

github-actions bot commented Aug 23, 2023

feature_fraction RNG looks broken #4371

feature_fraction RNG looks broken #4371

Comments

DexGroves commented Jun 11, 2021 • edited Loading

Repro

Why I think this is happening

Fixes?

StrikerRUS commented Jun 12, 2021

shiyu1994 commented Jun 30, 2021

StrikerRUS commented Jul 7, 2021

DexGroves commented Jul 26, 2021

github-actions bot commented Aug 23, 2023

`feature_fraction` RNG looks broken #4371

`feature_fraction` RNG looks broken #4371

DexGroves commented Jun 11, 2021 •

edited

Loading