Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Simplify ray.train.xgboost/lightgbm (1/n): Align frequency-based and checkpoint_at_end checkpoint formats #42111

Merged
merged 57 commits into from
Feb 13, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Dec 28, 2023

Why are these changes needed?

This PR fixes XGBoostTrainer and LightGBMTrainer checkpointing:

  1. Centralizes checkpoint saving and loading implementations around the utility callbacks ray.train.xgboost/lightgbm.RayTrainReportCallback as the standard utilities to define checkpoints save/load format.
    • Previously, we had 3 separate implementations of xgb/lgbm checkpointing: (1) XGBoostCheckpoint, (2) ray.tune.integration.xgboost.TuneReportCheckpointCallback, and (3) XGBoostTrainer._save_model.
      • (1) was only used by (3) to get the XGBoostCheckpoint.MODEL_FILENAME constant in some places. But, we re-implemented the from_model and get_model logic for some reason.
      • (2) was used for saving frequency-based checkpoints configured via CheckpointConfig(checkpoint_frequency)
      • (3) was only used for saving the final checkpoint at the end of training
      • These diverging codepaths is what resulted in the confusing bug described here.
    • Now, we have a single codepath (the ray.train.*.RayTrainReportCallback) that handles both checkpoint_frequency and checkpoint_at_end. This codepath standardizes on the framework specific checkpoint implementation of checkpoint saving.
  2. Deletes a lot of the legacy code that was needed during Ray 2.7 to 2.9 for migration purposes.
  3. Hard deprecates legacy APIs (TuneReportCallback). The migration is simple: TuneReportCallback() -> TuneReportCheckpointCallback(frequency=0).
  4. Removes the circular dependency of ray.tune and xgboost_ray/lightgbm_ray.

API Change Summary

API Change Notes
ray.train.xgboost.RayTrainReportCallback 🆕 This is the utility callback that we can recommend users to build off of, similar to ray.train.lightning.RayTrainReportCallback. This will be exposed to users if they have full control over the training loop in the new simplified XGBoostTrainer.
ray.train.xgboost.RayTrainReportCallback.get_model(filename) 🆕 This is a utility method to load a model from a checkpoint saved by this callback. This should be recommended over XGBoostTrainer.get_model in the future.
ray.tune.integration.xgboost.TuneReportCheckpointCallback ↔️ This callback is now just an alias of the one above, and will be usable by Tune users who want to use the callback in Tune without distributed training. We can consider deprecating it in the future.

The same APIs are introduced for the lightgbm counterparts.

TODOs left for followups

  • Revamp release tests, as most of them are just testing xgboost_ray right now.
  • Fix the checkpoint_at_end vs. checkpoint_frequency overlap logic for the test case with a TODO in test_xgboost_trainer after switching to the simplified xgboost trainer.

Related issue number

Closes #41608

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@justinvyu justinvyu changed the title [train] Fix XGBoostTrainer and LightGBMTrainer checkpointing [train] Simplify ray.train.xgboost/lightgbm (1/n): Align frequency-based and checkpoint_at_end checkpoint formats Feb 13, 2024
@@ -23,12 +23,17 @@ def from_model(
booster: lightgbm.Booster,
*,
preprocessor: Optional["Preprocessor"] = None,
path: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need these changes if we're centralizing on the Callbacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope I can get rid of it. If anybody does use this, specifying your own temp dir might be useful though if you want it to be cleaned up after.

from ray.train.lightgbm import RayTrainReportCallback

# Get a `Checkpoint` object that is saved by the callback during training.
result = trainer.fit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For consistency with this, should we update the training example to use the LightGBMTrainer? Same for xgboost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add the *Trainer examples once I add in a v2 xgboost/lightgbm trainer, since then it'll actually show the callback usage in the training func. Right now the user doesn't need to create the callback themselves.

python/ray/train/xgboost/xgboost_trainer.py Show resolved Hide resolved
python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved
python/ray/tune/integration/lightgbm.py Outdated Show resolved Hide resolved
@@ -38,6 +38,7 @@ class RayTrainReportCallback:
independent xgboost trials (without data parallelism within a trial).

.. testcode::
:skipif: True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to add them back later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be a code-block that didn't run 😅 I just wanted to show a mock xgboost.train call with the callback inside, without needing to specify the dataset and everything.

doc/source/tune/examples/tune-xgboost.ipynb Show resolved Hide resolved


@PublicAPI(stability="beta")
class RayTrainReportCallback:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TuneCallback for lgbm was originally an empty class that wasn't referenced anywhere else so I just removed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed
3 participants