Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Updates to support xgboost==2.1.0 #46667

Merged
merged 20 commits into from
Aug 8, 2024

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

xgboost 2.1.0 was recently released, and it changed some of the distributed setup APIs.

In particular:

This PR branches the setup logic between pre 2.1.0 and post 2.1.0. We should eventually drop pre-2.1.0 support.

Testing

This PR also updates the tested xgboost version to 2.1.0. Pre-2.1.0 has been tested manually.

Related issue number

Closes #46476

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation! Two questions from me:

  1. The RabitTracker class functions worker_args and worker_env resturn the same type of things Dict[str, Union[int, str]]. The only difference is the key of worker_env are uppercase letters but worker_args are lowercase letters. Our adaption to this change is to move from env settup to training context settup, is that correct?
  2. The RabitTracker class doesn't maintain a thread itself, instead we need to create a main thread kind of thing using its wait_for method to wait for tracker.start by ourselves. My question is: Do we also need to distinguish xgboost version before/post 210 in the on_shutdown method. My first intuition is to using tracker.thread before 210, and using wait_for after 210. Currently, It seems we always use the wai_for method now.

@justinvyu
Copy link
Contributor Author

The RabitTracker class functions worker_args and worker_env resturn the same type of things Dict[str, Union[int, str]]. The only difference is the key of worker_env are uppercase letters but worker_args are lowercase letters. Our adaption to this change is to move from env settup to training context settup, is that correct?

Yes. The API changed from accepting environment variables to only allowing you to pass the arguments directly as kwargs with those lower-case names.

def on_training_start(
self, worker_group: WorkerGroup, backend_config: XGBoostConfig
):
assert backend_config.xgboost_communicator == "rabit"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems XGBoostingConfig has a hard coded backend_config field being "rabit", why do we still need an assertion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I can probably remove this field for now, since we don't support the "federated" option.

Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, most of them are nits and should be non-blocking.
It should be good to go if all unit tests look good.

python/ray/train/xgboost/config.py Show resolved Hide resolved
python/ray/train/xgboost/config.py Show resolved Hide resolved
python/ray/train/xgboost/config.py Show resolved Hide resolved
@@ -37,28 +41,93 @@ class XGBoostConfig(BackendConfig):
def train_func_context(self):
@contextmanager
def collective_communication_context():
with CommunicatorContext():
with CommunicatorContext(**_get_xgboost_args()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to save the xgboost_args into XGBoost config so we can avoid modifying the global variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. I actually don't understand why we need both BackendConfig and Backend classes. Any context here @matthewdeng ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BackendConfig is the public API that the user could interact with. There is probably a better/cleaner way to organize the two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah currently the dependency between BackendConfig and Backend are unidirectional. It's kind of hard to pass information from Backed -> BackendConfig.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should train_func_context be part of the Backend instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the very least the default one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hm maybe that won't work because we construct the train loop before the backend...

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@justinvyu justinvyu enabled auto-merge (squash) August 7, 2024 22:28
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 7, 2024
@github-actions github-actions bot disabled auto-merge August 8, 2024 00:33
@justinvyu justinvyu merged commit c634872 into ray-project:master Aug 8, 2024
5 checks passed
@justinvyu justinvyu deleted the xgb210compat branch August 8, 2024 18:48
dev-goyal pushed a commit to dev-goyal/ray that referenced this pull request Aug 8, 2024
Support xgboost 2.1.0, which was recently released and changed some of the
distributed setup APIs.
---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Dev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray Train incompatible broken with XGBoost 2.1.0
4 participants