Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_fraction RNG looks broken #4371

Closed
DexGroves opened this issue Jun 11, 2021 · 5 comments
Closed

feature_fraction RNG looks broken #4371

DexGroves opened this issue Jun 11, 2021 · 5 comments
Labels

Comments

@DexGroves
Copy link

DexGroves commented Jun 11, 2021

Hi!

First up, thanks for maintaining this amazing package!

I noticed something weird while training a model with three predictors and 1/2 < feature_fraction < 5/6. One feature always gets ignored. I think this is because of the per-tree RNG that determines which features are available. As far as I can tell:

  1. NextInt in random.h gives alternating even and odd numbers, which gets weird when we ask it for modulo even things, particularly modulo 2.
  2. There is an off-by-one here? I think it should be r+1, but I could be wrong.

Repro

import lightgbm as lgb
import numpy as np
import pandas as pd

# Make some fake data
N = 1_000
p = 3
seed = 42

np.random.seed(seed)

df = pd.DataFrame(dict(y=np.random.normal(size=N), w=np.random.uniform(size=N)))

features = []
for i in range(p):
    df[str(i)] = np.random.normal(size=N)
    features.append(str(i))

X = lgb.Dataset(df[features], label=df.y)

# Fit a model
params = dict(
    num_iterations=1_000,
    max_depth=3,
    feature_fraction=0.8,
    learning_rate=0.2,
    objective="l2",
    verbosity=-1,
)

regr = lgb.train(params=params, train_set=X)

# One feature is ignored completely, is featured in no splits
regr.feature_importance()

Why I think this is happening

In the select-two-from-three case, we go around this loop exactly twice per tree, and two bad things happen. The first go-around call rolls NextInt(0, 1), which returns 0 in all cases. The zeroth variable always gets included. I think the roll should be NextInt(0, 2), so the zeroth variable has a chance to be left out.

The second go-around rolls NextInt(0, 2), which returns either 0 or 1 depending on seed. This is doing RandInt32() % 2 under the hood. Since this is the 2nd, 4th, 6th, 8th, ..., call to the RNG, and RantInt32 alternates between even and odd numbers, it always hits the same value for every tree.

This doesn't just affect the select-two-from-three case, but it gets more complicated to think about. Even-indexed variables get interesting if they get to roll something_always_even % another_small_even_number. You can construct other combinations of number of features and feature_fraction where variables in the last or second-to-last position get excluded always, or where the model has a significant bias towards using even-indexed variables, etc.

Fixes?

  • There are two branches for determining feature fraction, and the other one looks fine, so just use that one every time? I assume there is a good reason to use two branches though (and I'd be interested to understand it).
  • Do something smart with RantInt32 so it doesn't have even/odd alternation?
  • Use RandInt16 for NextInt?

Happy to submit a PR for the first or third one if you think they're OK solutions.

@jameslamb jameslamb added the bug label Jun 11, 2021
@StrikerRUS
Copy link
Collaborator

Linking #4134 (comment) as related.

@shiyu1994
Copy link
Collaborator

@DexGroves Thanks for using LightGBM. After a careful check, I think it should be NextInt(0, r+1), instead of NextInt(0, r) here,

int v = NextInt(0, r);

otherwise the last element N-1 will always be left out. Thanks for pointing that out.

The reason to use the second branch in Sample, I think it is because the second branch requires a much smaller number of calls to generate a random number, which makes it more efficient.

With NextInt(r + 1), I think the "selecting two from three" case won't always leave any element out. But it can still always choose 0 or 1 due to the alternating nature of our pseudo random generator. Compared with always leaving some elements out, this is a smaller problem. Do you know any approach to avoid this?

@StrikerRUS
Copy link
Collaborator

Fixed via #4450. I can confirm that now original reproducible example returns something like the following:

array([ 995, 2118, 1815])

@DexGroves
Copy link
Author

Thanks for the quick fix @shiyu1994!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants