feat: remove steps from pytorch callbacks [DET-3526] #831

stoksc · 2020-07-06T20:44:51Z

Description

As part of #remove-steps, we also need to remove the concept of steps from the PytorchCallbacks API.

Test Plan

update unit tests

Commentary (optional)

Currently, this is looking to be the only breaking change of #remove-steps. Everything else has been done by allowing new and old interfaces to coexist and deprecating the old ones. Open to suggestions on when/how to land this, or to leave the old interfaces intact and mark them as deprecated for at least a while.

harness/determined/pytorch/_callback.py

rb-determined-ai · 2020-07-07T14:13:19Z

harness/determined/pytorch/_callback.py

        configure a callback implementation to execute on a subset of GPUs, please condition
        your implementation on ``trial.context.distributed.get_rank()``.
    """

-    def on_train_step_start(self, step_id: int) -> None:
+    def on_batch_start(self, batch_idx: int) -> None:


This is really different behavior than what we had before, and it doesn't seem like it would be terribly useful to me, since we already call train_batch on every batch.

I'm not sure what use cases we were trying to solve before though? Are those use cases still valid?

Git blame says @yoavz wrote these hooks, maybe @shiyuann or @aaron276h know the answer though?

And if those use cases are still valid, how would we address them after removing steps from the UX?

This was done so that users can do adjustments to the optimizer and model before training. Don't remember the exact use cases. I do think it makes sense to leave this callback in.

I chose these changes because it's exactly what is in the ERD. I assumed they were discussed/decided already. I understand on_batch_start/on_batch_end seem pretty not useful though. I think most of the context on the original decision is somewhere in #ml-ag slack.

aaron276h · 2020-07-07T14:18:32Z

harness/determined/pytorch/_callback.py

+
+    def on_epoch_end(self, epoch_idx: int, metrics: Dict[str, Any]) -> None:
+        """
+        Run after every epoch ends.


blocking: We should be more descriptive here about the timing. Often epoch_end is considered to be the end of training and evaluating on a full dataset, which is not the case for us. Might be even worth renaming this callback to on_train_epoch_end().

hadn't thought that would be the expected behavior. thanks.

aaron276h · 2020-07-07T14:19:22Z

harness/determined/pytorch/_callback.py

        """
-        Run after every training step ends.
+        Run after every batch is trained.


blocking: you need to add a warning about metrics here. Additionally, we need to decide if we want to average metrics here every batch if optimizations.average_training_metrics is enabled and if on_batch_end is used.

I feel like on_batch_end we shouldn't but on_train_epoch_end we should?

I agree with that, the tricky part here is we may not always have the metrics for the entire epoch (if the training was resumed mid-epoch). I would propose that we just don't provide averaged metrics in the training callbacks for now, and if/when the need for them arises, we can decide on the proper mechanism to do so.

harness/determined/pytorch/_pytorch_trial.py

aaron276h · 2020-07-07T14:20:58Z

harness/determined/pytorch/_callback.py

+        """
+        pass
+
+    def on_epoch_end(self, epoch_idx: int, metrics: Dict[str, Any]) -> None:


blocking: need to be more descriptive about what these metrics are.

aaron276h · 2020-07-07T14:22:17Z

harness/determined/pytorch/_callback.py

        configure a callback implementation to execute on a subset of GPUs, please condition
        your implementation on ``trial.context.distributed.get_rank()``.
    """

-    def on_train_step_start(self, step_id: int) -> None:
+    def on_batch_start(self, batch_idx: int) -> None:


This was done so that users can do adjustments to the optimizer and model before training. Don't remember the exact use cases. I do think it makes sense to leave this callback in.

stoksc · 2020-07-07T17:58:55Z

@shiyuann after our discussion, it seems like there are no concerns with breaking these interfaces. the only question is, do you want to try to break it in sync with your pytorch changes?

shiyuann · 2020-07-07T18:30:33Z

@stoksc This break in Pytorch callbacks should be orthogonal from my change.

stoksc · 2020-07-07T18:46:48Z

@shiyuann Yeah, @aaron276h was just thinking maybe if we're making breaking changes, we should ship them together.

shiyuann · 2020-07-07T18:54:21Z

Doesn't really make sense to me to have on_batch_start/on_batch_end. I guess when we were discussing on ERD, the interface is very different. After introducing new primitives, we don't need these callbacks at all.

aaron276h

Looks even better the second time around

aaron276h · 2020-07-08T02:46:05Z

harness/determined/pytorch/_callback.py

        """
-        # TODO(DET-3267): deprecate this when releasing pytorch flexible primitives.
        pass

    def on_validation_step_start(self) -> None:


non-blocking: might be worth adding a comment that this should be removed in the future

shiyuann

Looks good!

feat: remove steps from pytorch callbacks

81a68c3

stoksc added the User-facing API Change label Jul 6, 2020

stoksc requested review from aaron276h and shiyuann July 6, 2020 20:44

stoksc assigned shiyuann Jul 6, 2020

cla-bot bot added the cla-signed label Jul 6, 2020

stoksc commented Jul 6, 2020

View reviewed changes

harness/determined/pytorch/_callback.py Outdated Show resolved Hide resolved

stoksc added 2 commits July 6, 2020 16:49

remove misleading warning from docstring

aac12ab

remove excess whitespace

681f9c8

rb-determined-ai reviewed Jul 7, 2020

View reviewed changes

aaron276h reviewed Jul 7, 2020

View reviewed changes

aaron276h assigned stoksc Jul 7, 2020

stoksc added 2 commits July 7, 2020 10:41

rename and switch order of callbacks

6498fda

better docstrings for pytorch callbacks

b16ead1

stoksc assigned aaron276h and unassigned stoksc Jul 7, 2020

aaron276h approved these changes Jul 7, 2020

View reviewed changes

aaron276h removed their assignment Jul 7, 2020

fix docs

13b3b30

rename callbacks to be more semantically correct given future changes

41051ca

shiyuann assigned stoksc and unassigned shiyuann Jul 7, 2020

stoksc added 4 commits July 7, 2020 21:57

remove some callbacks, bring back others with deprecaton waring

85d45c2

call other callback too

49e8133

remove punct

662c189

add is_epoch_start/is_epoch_end to PytorchTrialContext

85b8d7a

stoksc assigned aaron276h and shiyuann and unassigned stoksc Jul 8, 2020

stoksc requested a review from aaron276h July 8, 2020 02:33

stoksc assigned stoksc and unassigned aaron276h and shiyuann Jul 8, 2020

aaron276h approved these changes Jul 8, 2020

View reviewed changes

better docstrings/comments

efd3bff

stoksc changed the title ~~feat: remove steps from pytorch callbacks [DET-3252]~~ feat: remove steps from pytorch callbacks [DET-3252 Jul 8, 2020

stoksc changed the title ~~feat: remove steps from pytorch callbacks [DET-3252~~ feat: remove steps from pytorch callbacks [DET-3252] Jul 8, 2020

shiyuann approved these changes Jul 8, 2020

View reviewed changes

stoksc merged commit bbdf964 into determined-ai:master Jul 8, 2020

stoksc deleted the remove-steps-from-pytorch-callbacks branch July 8, 2020 19:22

stoksc changed the title ~~feat: remove steps from pytorch callbacks [DET-3252]~~ feat: remove steps from pytorch callbacks [DET-3526] Jul 9, 2020

dannysauer added this to the 0.12.12 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove steps from pytorch callbacks [DET-3526] #831

feat: remove steps from pytorch callbacks [DET-3526] #831

stoksc commented Jul 6, 2020 •

edited

Loading

rb-determined-ai Jul 7, 2020

aaron276h Jul 7, 2020

stoksc Jul 7, 2020

aaron276h Jul 7, 2020

stoksc Jul 7, 2020

aaron276h Jul 7, 2020

stoksc Jul 7, 2020

aaron276h Jul 7, 2020

aaron276h Jul 7, 2020

aaron276h Jul 7, 2020

stoksc commented Jul 7, 2020

shiyuann commented Jul 7, 2020

stoksc commented Jul 7, 2020

shiyuann commented Jul 7, 2020

aaron276h left a comment

aaron276h Jul 8, 2020

shiyuann left a comment

feat: remove steps from pytorch callbacks [DET-3526] #831

feat: remove steps from pytorch callbacks [DET-3526] #831

Conversation

stoksc commented Jul 6, 2020 • edited Loading

Description

Test Plan

Commentary (optional)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stoksc commented Jul 7, 2020

shiyuann commented Jul 7, 2020

stoksc commented Jul 7, 2020

shiyuann commented Jul 7, 2020

aaron276h left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyuann left a comment

Choose a reason for hiding this comment

stoksc commented Jul 6, 2020 •

edited

Loading