[train] refactor callback logdir and results preprocessors #21468

matthewdeng · 2022-01-07T20:38:18Z

Why are these changes needed?

These changes are made in an effort to make Train Callbacks more structured, legible, and extensible.

Implementation

This PR introduces 2 patterns:

Results preprocessing.
Log directory creation.

Results preprocessing

Previously, each Callback implemented its own version of results preprocessing. However, there was no clear pattern for sharing preprocessing across different Callbacks, or even modifying an existing Callback to support different preprocessing.

This PR introduces a more generalized ResultsPreprocessor abstraction so that custom preprocessing done can be added for each Callback. At a high level, this breaks the tight coupling between preprocessing logic and the actual application logic owned by the Callback.

TrainingCallback now has a process_results method which is called by the Trainer. This will call _results_preprocessor.preprocess prior to handle_results.
Each Callback (instance) can define its own _results_preprocessor : ResultsPreprocessor in addition to handle_results.
The ResultsPreprocessor abstract class has a single preprocess that must defined.
A SequentialResultsPreprocessor is added to help chain together preprocessors.
IndexedResultsPreprocessor and KeyResultsPreprocessor are example preprocessors added to support the existing Callback functionality.

Log directory creation

In previous iterations, log directory creation was done through a Mixin, and then through a subclass of TrainingCallback. However, this ended up being confusing from the user API, method resolution order, and requirement of attributes.

As an alternative, in this PR logdir creation is separated into a TrainCallbackLogdirManager.

The typical pattern is as follows:

TrainCallbackLogdirManager owns general logic for logdir validation and creation.
TrainCallbackLogdirManager is initialized at the start of the Callback, and the user can define a logdir path.
TrainCallbackLogdirManager.setup_logdir should be called in TrainCallback.start_training, which passes in the Trainer's rundir which is used as the default directory.

Public Callback API

In this PR, all of the public callback APIs remain the same, with only the flatten args that are directly related to the callbacks. These args are internally converted into a ResultsPreprocessor and TrainCallbackLogdirManager.

Related issue number

Related to: #21066.
Original PR: #21367.

Checks

TODO: Write tests.

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…processors

…filer

Yard1

Looks good, just some nits. Do you think it would make sense to extend the concept of ResultPreprocessor to start_training as well? That would cover the logdir stuff. Not sure if the effort would be justifiable, though

python/ray/train/callbacks/logging.py

python/ray/train/callbacks/results_prepocessors/keys.py

Co-authored-by: Antoni Baum <[email protected]>

…actor

Yard1

LGTM

amogkam

Can we also document the preprocessors in the user guide under the Custom Callbacks section?

python/ray/train/callbacks/callback.py

amogkam · 2022-01-11T00:26:07Z

python/ray/train/callbacks/results_prepocessors/preprocessor.py

+        return results
+
+
+class SequentialResultsPreprocessor():


Hmm this might be an unnecessary abstraction. Can't the result_preprocessors just be a list (or any iterable)?

Technically yes, but I found this to be cleaner and more extensible. Preprocessing is captured as a single preprocess step which can be customized as needed.

Isn't the main purpose of SequentialPreprocessor to contain multiple preprocessors and enforce an ordering? These properties are already captured by any ordered iterable.

The main reasoning for using a list/iterable instead of a wrapper class is for usability. IMO, as a user, if I want to pass in multiple preprocessors to my callback, it would be much easier to just pass in a list instead of wrapping in a SequentialResultsPreprocessor.

I think the only change that would have to be made to support a list would be the process_results method in the base TrainingCallback.

But I'm curious what you had in mind for the extensibility use case?

Yes that is correct!

The difference is whether we want sequential preprocessing to be a first-class idea in process_results API. In my mind it is a secondary utility.

For extensibility:

If we have system level preprocessing (e.g. to support required preprocessing), I think it is cleaner to have the structure as SequentialResultsPreprocessor([SystemPreprocessor, UserPreprocessor]) vs. concatenating 2 flattened lists.

I haven't thought this one through yet, but if we want to split preprocessing into different paths and then merge the results together, the Iterable API may not make sense.

Ok I'm fine with keeping this as an internal concept, but would like to not expose this to users.

What do you think about still allowing users to pass in an iterable as the result preprocessors? I think that would just be adding these 2 lines to process_results

if isinstance(results_preprocessors, Iterable): results_preprocessors = SequentialResultsPreprocessor(results_preprocessors)

Personally I still don't think having the user to add SequentialResultsPreprocessor(...) is a significant overhead for the explicit indicator that they will be processed sequentially, and don't think it needs to be privatized.

There are a few of things I could nitpick about using the Iterable, but I'm wondering if we could postpone this until a clear use-case arises? The snippet you shared could easily be added if there's a clear indicator that this improves usability, whereas removal is usually less preferred.

Commented this below, but to me the cleanest solution would be that training callbacks accept an optional list of result preprocessors in __init__. If this is provided, it overrides the default preprocessors that are set. Internally we can just convert this into a SequentialResultsPreprocessor. All callbacks should always be calling super anyways.

The benefit here is that this behavior is clearly documented in the docstring, and passing in args via a parameter is cleaner than having to set an attribute (which would be more difficult to maintain backwards compatibility for).

My concern is that it requires the user to define the entire ResultsPreprocessor (single or multiple) in the constructor vs. flattened args that are more understandable to the user (e.g. workers_to_log).

This goes back to the question of whether ResultsPreprocessors should be part of the public API or developer API.

We could accept both args and have one override the other (i.e. pythonic way of method overloading), but I agree if we want to have this be a developer API then it's fine to leave as is.

amogkam · 2022-01-11T19:06:27Z

python/ray/train/callbacks/logging.py

-            logdir_path = Path(self._logdir)
-        else:
-            logdir_path = Path(logdir)
+class TrainCallbackLogdirManager():


Looking at this again, I think this manager class may be unnecessary. Can't this whole thing just be a utility function?

def get_logdir_path(logdir: str, create_logdir_if_not_exists: bool) -> Path: ...

Then in start_training in the callbacks, you would just call this function self._logdir = get_logdir_path(...), and make sure to pass in the correct logdir.

I thought about this but this requires each caller to essentially define logdir = logdir or default_logdir themselves, in which case this method just contains logic for checking if the logdir exists.

The class structure follows the actual call pattern more closely (__init__ vs. start_training).

Right, but that logic is simple enough that I think it's fine for the caller to do it themselves.

In my opinion, I think that the abstraction of passing in 2 logdirs to a manager is less intuitive than just figuring out the correct logdir in the callback itself, especially since it can literally just be 1 line.

Yeah that's fair, but in that case is there any point in providing any utility method at all?

I will probably need to think about this some more. By changing it from a Mixin/TrainingCallback to this new class, we did lose some of structure that glued it all together. Not so sure what the right abstraction is anymore...

I think in the long run we may want to end up with something like what @Yard1 suggested:

Do you think it would make sense to extend the concept of ResultPreprocessor to start_training as well? That would cover the logdir stuff. Not sure if the effort would be justifiable, though

While this could fit the pattern quite nicely, the desired interface would be a lot more clear when a second use-case comes around.

python/ray/train/tests/test_callbacks.py

amogkam · 2022-01-18T18:39:28Z

doc/source/train/user_guide.rst

@@ -497,6 +497,35 @@ A simple example for creating a callback that will print out results:
    trainer.shutdown()


+Results Preprocessors


Discussed offline, but I think the benefit of preprocessors here is that it allows you to easily customize existing callbacks, rather than for use when developing custom callbacks.

Instead of the example below, I would change this to

callback = PrintingCallback() callback.results_preprocessor = IndexedResultsPreprocessor(0)

And I would also move this subsection to the "Built-in callbacks" (and rename to "Advanced: Customizing Built-in Callbacks") instead of the "Custom Callbacks" Section.

Then in the "Custom Callbacks" section I would add a note linking to this section if you want to customize existing callbacks.

Actually looking at the example above, in my opinion, it would be cleaner if callbacks would accept an optional list of result preprocessors in the __init__, rather than having to set an attribute.

If we want to make this a developer API, then I think we can actually not include this information in the user guide, right? (we can still leave it here, but comment it out).

amogkam · 2022-01-18T18:47:55Z

python/ray/train/callbacks/results_prepocessors/preprocessor.py

+        return results
+
+
+class SequentialResultsPreprocessor():


Commented this below, but to me the cleanest solution would be that training callbacks accept an optional list of result preprocessors in __init__. If this is provided, it overrides the default preprocessors that are set. Internally we can just convert this into a SequentialResultsPreprocessor. All callbacks should always be calling super anyways.

The benefit here is that this behavior is clearly documented in the docstring, and passing in args via a parameter is cleaner than having to set an attribute (which would be more difficult to maintain backwards compatibility for).

amogkam

LGTM if we want to leave this as a developer API! I agree, we can always upgrade this to a public API once a use case arises.

What do you think about these remaining TODOs?:

Can you annotate the preprocessors with @DeveloperAPI?
If this is a developer API, then I think we can comment out the recommended usage from the user guide

amogkam · 2022-01-19T02:31:03Z

doc/source/train/user_guide.rst

@@ -497,6 +497,35 @@ A simple example for creating a callback that will print out results:
    trainer.shutdown()


+Results Preprocessors


If we want to make this a developer API, then I think we can actually not include this information in the user guide, right? (we can still leave it here, but comment it out).

…actor

matthewdeng added 6 commits January 3, 2022 18:41

[train] Add TorchTensorboardProfilerCallback and introduce ResultsPre…

8f63ab9

…processors

Merge branch 'master' of github.com:ray-project/ray into torch_tb_pro…

ef3f9ea

…filer

simplify profiler

16201a7

read on get_and_clear_profile_traces

883c3d1

refactor callbacks

f2aa677

remove var

275eadb

matthewdeng requested review from amogkam and Yard1 January 7, 2022 20:38

Yard1 reviewed Jan 7, 2022

View reviewed changes

matthewdeng and others added 3 commits January 10, 2022 11:22

Update python/ray/train/callbacks/logging.py

cad9124

Co-authored-by: Antoni Baum <[email protected]>

Update python/ray/train/callbacks/results_prepocessors/keys.py

f521299

Co-authored-by: Antoni Baum <[email protected]>

address comments; add tests

fab48ba

amogkam self-assigned this Jan 11, 2022

matthewdeng added 2 commits January 10, 2022 16:19

fix test

c173e77

Merge branch 'master' of github.com:ray-project/ray into callback-ref…

d1da4d2

…actor

Yard1 approved these changes Jan 11, 2022

View reviewed changes

amogkam requested changes Jan 11, 2022

View reviewed changes

matthewdeng added 2 commits January 11, 2022 17:46

address comments

8036dd1

docs

33ea4cc

amogkam reviewed Jan 18, 2022

View reviewed changes

amogkam approved these changes Jan 19, 2022

View reviewed changes

matthewdeng added 3 commits January 20, 2022 17:14

address comments'

e4fd1fb

Merge branch 'master' of github.com:ray-project/ray into callback-ref…

618b4c8

…actor

fix test

747c5ab

matthewdeng requested a review from amogkam January 21, 2022 16:50

amogkam merged commit 8119b62 into ray-project:master Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] refactor callback logdir and results preprocessors #21468

[train] refactor callback logdir and results preprocessors #21468

matthewdeng commented Jan 7, 2022

Yard1 left a comment

Yard1 left a comment

amogkam left a comment

amogkam Jan 11, 2022

matthewdeng Jan 11, 2022

amogkam Jan 11, 2022

matthewdeng Jan 11, 2022

amogkam Jan 11, 2022

matthewdeng Jan 11, 2022

amogkam Jan 18, 2022 •

edited

Loading

matthewdeng Jan 18, 2022

amogkam Jan 19, 2022 •

edited

Loading

amogkam Jan 11, 2022

matthewdeng Jan 11, 2022

amogkam Jan 11, 2022

matthewdeng Jan 11, 2022

matthewdeng Jan 12, 2022

amogkam Jan 18, 2022

amogkam Jan 18, 2022

amogkam Jan 19, 2022

amogkam Jan 18, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading

amogkam Jan 19, 2022

		@@ -497,6 +497,35 @@ A simple example for creating a callback that will print out results:
		trainer.shutdown()


		Results Preprocessors

[train] refactor callback logdir and results preprocessors #21468

[train] refactor callback logdir and results preprocessors #21468

Conversation

matthewdeng commented Jan 7, 2022

Why are these changes needed?

Implementation

Results preprocessing

Log directory creation

Public Callback API

Related issue number

Checks

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 left a comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Jan 18, 2022 •

edited

Loading

amogkam Jan 19, 2022 •

edited

Loading

amogkam Jan 18, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading