[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig #26661

krfricke · 2022-07-18T15:41:40Z

Why are these changes needed?

Includes/depends on #26656

This PR adds the checkpoint_freq and checkpoint_at_end arguments to the CheckpointConfig:

Adds CheckpointConfig.checkpoint_frequency and checkpoint_at_end
Implements the argument for LightGBM and XGBoost
Adds tests for LightGBM, XGBoost, and RLTrainer
Raises an error if used with an incompatible Trainer (e.g. TorchTrainer)
Sets default value for checkpoint_at_end

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

gjoliver · 2022-07-18T16:39:08Z

python/ray/train/gbdt_trainer.py

    _dmatrix_cls: type
    _ray_params_cls: type
-    _tune_callback_cls: type
+    _tune_callback_report_cls: type
+    _tune_callback_checkpoint_cls: type


clearly being clueless here,
why do we need to add these separately to all of gbdt, lightbgm, xgboost trainers?
why don't we just save CheckpointConfig and call save_checkpoint() from the base class?

also, why doesn't RLTrainer need these changes???

This is a specific pattern for some of our downstream libraries. XGBoost and LightGBM use callbacks in the framework library to save checkpoints and report to tune (i.e. the calls to tune.checkpoint_dir() and tune.report(), which will be changed to session.report()).

Because LightGBM and XGBoost are so similar (LightGBM's API was based on XGBoost) we have a GBDTTrainer that can be used for most commong things. We only have to deal with a few framework-specific details, which are the actual callbacks used for saving checkpoints/reporting results, getting information from the library-native model and saving it to disk.

# Conflicts: # python/ray/tune/tests/test_tuner.py

richardliaw · 2022-07-18T20:17:44Z

python/ray/train/gbdt_trainer.py

+        if not any(
+            isinstance(
+                cb, (self._tune_callback_report_cls, self._tune_callback_checkpoint_cls)
+            )
+            for cb in config["callbacks"]
+        ):
+            # Only add our own callback if it hasn't been added before
+            checkpoint_frequency = (
+                self.run_config.checkpoint_config.checkpoint_frequency
+            )
+            if checkpoint_frequency > 0:


can we ban this in the future?

Users can always add their own callbacks (which could also be non-reporting, or derived from our reporting callbacks), so we should check if the callbacks exist. We do the same thing in Tune. But for better readability I can at least put this into a separate function.

richardliaw

looks good to me.

Signed-off-by: Kai Fricke <[email protected]>

# Conflicts: # python/ray/train/tests/test_torch_trainer.py

Signed-off-by: Kai Fricke <[email protected]>

gjoliver

a few quick questions.

gjoliver · 2022-07-19T09:19:52Z

python/ray/train/gbdt_trainer.py

@@ -193,6 +223,21 @@ def training_loop(self) -> None:
            **config,
        )

+        checkpoint_at_end = self.run_config.checkpoint_config.checkpoint_at_end
+        if checkpoint_at_end is None:
+            checkpoint_at_end = True


providing a default in customer class seems strange.
should we just give a True default in CheckpointConfig?

In this case we can't. We have different default values for checkpoint_at_end depending on the trainable. If we set it to True per default, it will be incompatible with legacy trainables such as regular function trainables and raise an error downstream in tune.run. If we set it to False per default, we won't save checkpoints for most trainers and can't use their models in downstream processing.

I'm working on a FunctionTrainer today that will replace running legacy trainable functions. This will enable us to ignore the legacy function trainables here and default to True. Does that sound good?

Note that another reason why we can't to it at the moment is that if people specifically pass checkpoint_at_end=True to use with their function trainables, we don't want to silently set it to False, but we could do this once we use a FunctionTrainer (though it's not ideal as users may still pass True and wonder why they don't see any saved checkpoints)

I see. appreciate the detailed explanation.
navigating through a lot of legacy stuff is hard ...

python/ray/tune/impl/tuner_internal.py

gjoliver · 2022-07-19T09:28:49Z

python/ray/tune/impl/tuner_internal.py

+                checkpoint_at_end = False
+            # If this is a user-defined trainable, just keep the value
+        elif checkpoint_at_end is None:
+            # Set default to False for function trainables and True for everything else


intuitively, why does the type of trainable has anything to do with whether we should checkpoint at end??
are these 2 orthogonal things?

Function trainables and generic training loop trainers like the TorchTrainer don't support driver-based checkpointing at the end of a run. This is because the user defines the training loop and thus decides themselves when to save checkpoints. This is in contrast to e.g. rllib where we can just call trainable.save.remote() anytime we want

right, I guess I am still not sure why we want to default checkpoint_at_end to True just because we are able to do it for class trainables.

Signed-off-by: Kai Fricke <[email protected]>

gjoliver

Thanks for adding this.

…heckpointConfig (ray-project#26661) Signed-off-by: Stefan van der Kleij <[email protected]>

Kai Fricke added 8 commits July 18, 2022 08:36

ProgressReporter

a7e18f6

Signed-off-by: Kai Fricke <[email protected]>

Reuse actors

55db993

Signed-off-by: Kai Fricke <[email protected]>

Add test

f8f2053

Signed-off-by: Kai Fricke <[email protected]>

log_to_file

c6006a5

Signed-off-by: Kai Fricke <[email protected]>

Max concurrent

cf38402

Signed-off-by: Kai Fricke <[email protected]>

Remove comment

ed14b87

Signed-off-by: Kai Fricke <[email protected]>

time_budget_s

1cb6a41

Signed-off-by: Kai Fricke <[email protected]>

post init run config

0f66647

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested review from xwjiang2010 and gjoliver July 18, 2022 15:41

krfricke assigned gjoliver and xwjiang2010 Jul 18, 2022

Update docs

937259b

Signed-off-by: Kai Fricke <[email protected]>

krfricke force-pushed the air/checkpoint-freq branch from 987ccca to 301b48f Compare July 18, 2022 15:45

Kai Fricke added 8 commits July 18, 2022 17:26

Misspelled...

6fdb115

Signed-off-by: Kai Fricke <[email protected]>

Add attributes

29ff165

Signed-off-by: Kai Fricke <[email protected]>

Implement for XGBoost/LightGBM

e65dbf9

Signed-off-by: Kai Fricke <[email protected]>

Pass in tune.run, remove checkpoint_at_end argument

c9ebac2

Signed-off-by: Kai Fricke <[email protected]>

Add test for rl trainer

98b962e

Signed-off-by: Kai Fricke <[email protected]>

Add test for torch trainer

9aa0e12

Signed-off-by: Kai Fricke <[email protected]>

Re-add checkpoint_at_end attribute

4d66752

Signed-off-by: Kai Fricke <[email protected]>

Make attribute optional

82b3d19

Signed-off-by: Kai Fricke <[email protected]>

krfricke force-pushed the air/checkpoint-freq branch from 301b48f to 82b3d19 Compare July 18, 2022 16:27

gjoliver reviewed Jul 18, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into air/checkpoint-freq

f8884c1

# Conflicts: # python/ray/tune/tests/test_tuner.py

krfricke marked this pull request as ready for review July 18, 2022 18:41

richardliaw reviewed Jul 18, 2022

View reviewed changes

richardliaw approved these changes Jul 18, 2022

View reviewed changes

richardliaw added the air label Jul 18, 2022

Set default value in gbdt trainer

20b88e3

Signed-off-by: Kai Fricke <[email protected]>

Kai Fricke added 4 commits July 19, 2022 00:16

Merge remote-tracking branch 'upstream/master' into air/checkpoint-freq

ade3919

# Conflicts: # python/ray/train/tests/test_torch_trainer.py

Scaling config

345ff53

Signed-off-by: Kai Fricke <[email protected]>

Compat with xgboost < 1.4

59cf578

Signed-off-by: Kai Fricke <[email protected]>

Fix checkpoint at end results

6b1d169

Signed-off-by: Kai Fricke <[email protected]>

gjoliver requested changes Jul 19, 2022

View reviewed changes

Validate checkpoint frequency

d60a138

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested a review from gjoliver July 19, 2022 13:24

gjoliver approved these changes Jul 19, 2022

View reviewed changes

richardliaw merged commit 02dded1 into ray-project:master Jul 19, 2022

krfricke deleted the air/checkpoint-freq branch July 19, 2022 17:23

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to C…

cf2b75f

…heckpointConfig (ray-project#26661) Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig #26661

[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig #26661

krfricke commented Jul 18, 2022

gjoliver Jul 18, 2022

krfricke Jul 18, 2022

richardliaw Jul 18, 2022

krfricke Jul 18, 2022

richardliaw left a comment

gjoliver left a comment

gjoliver Jul 19, 2022

krfricke Jul 19, 2022

krfricke Jul 19, 2022

gjoliver Jul 19, 2022

gjoliver Jul 19, 2022

krfricke Jul 19, 2022 •

edited

Loading

gjoliver Jul 19, 2022

gjoliver left a comment

[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig #26661

[air/tuner] Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig #26661

Conversation

krfricke commented Jul 18, 2022

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

krfricke Jul 19, 2022 •

edited

Loading