Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix dump_model() information for root node #6569

Open
wants to merge 42 commits into
base: master
Choose a base branch
from

Conversation

neNasko1
Copy link
Contributor

@neNasko1 neNasko1 commented Jul 24, 2024

This PR corrects the output of dump_model() and other dump-related functions like trees_to_dataframe(). There are 2 fixes implemented:

  1. The current Tree::Split implementation incorrectly saves the old leaf output value in the internal_value_ array when called on the root node. This in turn makes inspecting the whole training process from python incomplete.

Before:

(Pdb) booster_.trees_to_dataframe()
     tree_index  node_depth node_index left_child right_child parent_index  ... decision_type  missing_direction missing_type     value weight count
0             0           1       0-S0       0-S1        0-S2         None  ...            ==              right         None  0.000000      0   200
1             0           2       0-S1       0-S5        0-S4         0-S0  ...            <=               left         None  0.106573    113   113
2             0           3       0-S5       0-L0        0-L6         0-S1  ...            ==              right         None  0.082122     56    56
3             0           4       0-L0       None        None         0-S5  ...          None               None         None  0.064612     26    26
4             0           4       0-L6       None        None         0-S5  ...          None               None         None  0.097297     30    30

After:

(Pdb) booster_.trees_to_dataframe().head()
   tree_index  node_depth node_index left_child right_child parent_index  ... decision_type  missing_direction missing_type     value weight count
0           0           1       0-S0       0-S1        0-S2         None  ...            ==              right         None  0.081757    200   200
1           0           2       0-S1       0-S5        0-S4         0-S0  ...            <=               left         None  0.106573    113   113
2           0           3       0-S5       0-L0        0-L6         0-S1  ...            ==              right         None  0.082122     56    56
3           0           4       0-L0       None        None         0-S5  ...          None               None         None  0.064612     26    26
4           0           4       0-L6       None        None         0-S5  ...          None               None         None  0.097297     30    30
  1. Stump has no leaf_count inside dump_model() output #5962

@neNasko1
Copy link
Contributor Author

Currently the CI is not passing as #6574 is blocking.

@neNasko1
Copy link
Contributor Author

I am open to ideas of ways to test related functionalities.

Tests should now be sufficient for the change.

@jameslamb jameslamb changed the title [c++] Root internal_value_ is not calculated properly [c++] Fix calculation of internal_value_ for root node Jul 29, 2024
@neNasko1
Copy link
Contributor Author

@jameslamb
Could you take a look at the PR, now that the CI is passing?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 or @guolinke could you help with a review of this?

I'm not sure if this will correctly handle these cases:

  • custom init_score provided (via Dataset)
  • boost_from_average=False passed

@neNasko1 could you also look at #5962 and let us know if you think this change would fix the issue @thatlittleboy reported there?

@neNasko1
Copy link
Contributor Author

neNasko1 commented Aug 3, 2024

Thank you for taking the time to look into the PR and linking a relevant issue.

I'm not sure if this will correctly handle these cases:

  • custom init_score provided (via Dataset)
  • boost_from_average=False passed

I think those cases are handled as the results are consistent with what leaf values report, I also remade the test to boost from average.

@neNasko1 could you also look at #5962 and let us know if you think this change would fix the issue @thatlittleboy reported there?

I took the liberty to merge @thatlittleboy's WIP code into mine, additionally fixing the issues that they reported. I will also change the description of the PR to reflect both of the fixes.

@neNasko1 neNasko1 changed the title [c++] Fix calculation of internal_value_ for root node [c++] Fix dump_model() information for root node Aug 3, 2024
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I left a few questions for your consideration.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! 🚀

@neNasko1
Copy link
Contributor Author

neNasko1 commented Sep 2, 2024

@jameslamb
Just to recap: the tests/python_package_test/test_dask.py seems to have previously been a no-op since the 2 models produced with and without an init scores are the same for the classifier case. This however is not related to the current changes. Can you tell me whether I am missing something?

@jameslamb
Copy link
Collaborator

the tests/python_package_test/test_dask.py seems to have previously been a no-op since the 2 models produced with and without an init scores are the same for the classifier case.

I'll investigate this when I can, hopefully in the next few days. In the interim, you can help move this forward by resolving merge conflicts and pulling in the latest changes on master.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

But I'll keep following the discussion about Dask Ranker test (#6569 (comment)).

@neNasko1
Copy link
Contributor Author

neNasko1 commented Oct 1, 2024

@jameslamb can you submit a final review on the change, so that we can merge it?

Sorry for the caused inconvenience!

@jameslamb
Copy link
Collaborator

I will look when I can. I have spent most of my limited open source time in the last few weeks investigating and fixing multiple difficult, time-sensitive CI issues in this project, and there is yet another one that is still not done and a primary focus for me right now (#6651).

If @StrikerRUS has time to re-review the commits and comments you've pushed since his approval, and if he approves, then my review can be dismissed and this can be merged without another review from me. Otherwise, you will have to be patient a bit longer.

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 8, 2024

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/11226953247

Status: success ✔️.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've left some minor suggestions for your consideration, around making the tests stricter and easier to understand.

I've also triggered valgrind checks on this branch, to ensure no new memory-management issues have been introduced by this PR.

Unfortunately, this still needs a bit more investigation before I'm confident in it... I was finally able to investigate your comments in #6569 (comment), and found that that Dask test checking the trees_to_dataframe() output in the presence of init_score really was testing that the init_score wasn't ignored. I am going to try right now to figure out why that was, and if it has implications for this PR.


By the way, most of the commits you've pushed here are not tied to your GitHub account.

Screenshot 2024-10-07 at 8 37 43 PM

Doesn't really matter in this repo because if this is merged we'll squash everything into one commit, and that'll be correctly tied to your account. But just making you aware of it as it might cause problems for you in other GitHub-based projects. You can fix that for future commits like this:

git config --global user.email "${EMAIL}"

replacing ${EMAIL} with an email address tied to your GitHub account.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
@@ -1464,8 +1464,7 @@ def test_init_score(task, output, cluster):
init_scores = dy.map_blocks(lambda x: np.full((x.size, size_factor), init_score))
model = model_factory(client=client, **params)
model.fit(dX, dy, sample_weight=dw, init_score=init_scores, group=dg)
# value of the root node is 0 when init_score is set
assert model.booster_.trees_to_dataframe()["value"][0] == 0
assert model.fitted_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the init_score was tested seems like a flawed way to test it, because as we see from this PR the root value was always 0.

I finally was able to test this... this claim is just not true.

I just tested on latest master (0643230), with a patch like this:

diff --git a/tests/python_package_test/test_dask.py b/tests/python_package_test/test_dask.py
index 247f2eb1..e057320b 100644
--- a/tests/python_package_test/test_dask.py
+++ b/tests/python_package_test/test_dask.py
@@ -1463,7 +1463,7 @@ def test_init_score(task, output, cluster):
         else:
             init_scores = dy.map_blocks(lambda x: np.full((x.size, size_factor), init_score))
         model = model_factory(client=client, **params)
-        model.fit(dX, dy, sample_weight=dw, init_score=init_scores, group=dg)
+        model.fit(dX, dy, sample_weight=dw, group=dg)
         # value of the root node is 0 when init_score is set
         assert model.booster_.trees_to_dataframe()["value"][0] == 0
testing code (click me)
docker run \
    --rm \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it python:3.10 \
    bash

sh build-python.sh bdist_wheel install

pip install \
    cloudpickle \
    dask \
    distributed \
    numpy \
    pandas \
    pyarrow \
    pytest \
    scikit-learn \
    scipy

pytest 'tests/python_package_test/test_dask.py::test_init_score'

With init_score not provided, the root node's value is a non-0 value, and the test reliably and consistently fail.

E           assert np.float64(-0.0645985) == 0

tests/python_package_test/test_dask.py:1468: AssertionError

I will try to figure out how that got there, because now it's making me question some of the other changes in this PR.

@neNasko1
Copy link
Contributor Author

neNasko1 commented Oct 8, 2024

I will look when I can. I have spent most of my limited open source time in the last few weeks investigating and fixing multiple difficult, time-sensitive CI issues in this project, and there is yet another one that is still not done and a primary focus for me right now (#6651).

Sorry for making it seem like there is some rush around this. I understand that this project is run mainly by volunteers and I do not want to harass any of the maintainers.

Thanks. I've left some minor suggestions for your consideration, around making the tests stricter and easier to understand.
I've also triggered valgrind checks on this branch, to ensure no new memory-management issues have been introduced by this PR.

Thanks for the suggestions, all of them are reasonable and are now merged. I grouped them into one commit as I wanted to test if everything is okay.

Again, thanks to everyone for the time spent!

@neNasko1
Copy link
Contributor Author

neNasko1 commented Oct 8, 2024

@jameslamb

With init_score not provided, the root node's value is a non-0 value, and the test reliably and consistently fail.

Is this related to boost_from_average?

  • When there is a init_score then boost_from_average does not occur.
  • After training the first trees(iteration=1) we then add the average to them.

I retract my comment of the test being a no-op previously, however after this fix there is a need for a change in this test.

@jameslamb
Copy link
Collaborator

Thanks, changes look great. For your other questions, let's please stay in the thread you're quoting, so the conversation can all be grouped together. I've responded there: #6569 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants