Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] refactor callback logdir and results preprocessors #21468

Merged
merged 16 commits into from
Jan 22, 2022

Conversation

matthewdeng
Copy link
Contributor

Why are these changes needed?

These changes are made in an effort to make Train Callbacks more structured, legible, and extensible.

Implementation

This PR introduces 2 patterns:

  1. Results preprocessing.
  2. Log directory creation.

Results preprocessing

Previously, each Callback implemented its own version of results preprocessing. However, there was no clear pattern for sharing preprocessing across different Callbacks, or even modifying an existing Callback to support different preprocessing.

This PR introduces a more generalized ResultsPreprocessor abstraction so that custom preprocessing done can be added for each Callback. At a high level, this breaks the tight coupling between preprocessing logic and the actual application logic owned by the Callback.

  1. TrainingCallback now has a process_results method which is called by the Trainer. This will call _results_preprocessor.preprocess prior to handle_results.
  2. Each Callback (instance) can define its own _results_preprocessor : ResultsPreprocessor in addition to handle_results.
  3. The ResultsPreprocessor abstract class has a single preprocess that must defined.
  4. A SequentialResultsPreprocessor is added to help chain together preprocessors.
  5. IndexedResultsPreprocessor and KeyResultsPreprocessor are example preprocessors added to support the existing Callback functionality.

Log directory creation

In previous iterations, log directory creation was done through a Mixin, and then through a subclass of TrainingCallback. However, this ended up being confusing from the user API, method resolution order, and requirement of attributes.

As an alternative, in this PR logdir creation is separated into a TrainCallbackLogdirManager.

The typical pattern is as follows:

  1. TrainCallbackLogdirManager owns general logic for logdir validation and creation.
  2. TrainCallbackLogdirManager is initialized at the start of the Callback, and the user can define a logdir path.
  3. TrainCallbackLogdirManager.setup_logdir should be called in TrainCallback.start_training, which passes in the Trainer's rundir which is used as the default directory.

Public Callback API

In this PR, all of the public callback APIs remain the same, with only the flatten args that are directly related to the callbacks. These args are internally converted into a ResultsPreprocessor and TrainCallbackLogdirManager.

Related issue number

Related to: #21066.
Original PR: #21367.

Checks

TODO: Write tests.

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some nits. Do you think it would make sense to extend the concept of ResultPreprocessor to start_training as well? That would cover the logdir stuff. Not sure if the effort would be justifiable, though

python/ray/train/callbacks/logging.py Outdated Show resolved Hide resolved
python/ray/train/callbacks/logging.py Outdated Show resolved Hide resolved
python/ray/train/callbacks/logging.py Outdated Show resolved Hide resolved
python/ray/train/callbacks/results_prepocessors/keys.py Outdated Show resolved Hide resolved
python/ray/train/callbacks/results_prepocessors/keys.py Outdated Show resolved Hide resolved
@amogkam amogkam self-assigned this Jan 11, 2022
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also document the preprocessors in the user guide under the Custom Callbacks section?

python/ray/train/callbacks/callback.py Outdated Show resolved Hide resolved
return results


class SequentialResultsPreprocessor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this might be an unnecessary abstraction. Can't the result_preprocessors just be a list (or any iterable)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but I found this to be cleaner and more extensible. Preprocessing is captured as a single preprocess step which can be customized as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the main purpose of SequentialPreprocessor to contain multiple preprocessors and enforce an ordering? These properties are already captured by any ordered iterable.

The main reasoning for using a list/iterable instead of a wrapper class is for usability. IMO, as a user, if I want to pass in multiple preprocessors to my callback, it would be much easier to just pass in a list instead of wrapping in a SequentialResultsPreprocessor.

I think the only change that would have to be made to support a list would be the process_results method in the base TrainingCallback.

But I'm curious what you had in mind for the extensibility use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct!

The difference is whether we want sequential preprocessing to be a first-class idea in process_results API. In my mind it is a secondary utility.

For extensibility:

  1. If we have system level preprocessing (e.g. to support required preprocessing), I think it is cleaner to have the structure as SequentialResultsPreprocessor([SystemPreprocessor, UserPreprocessor]) vs. concatenating 2 flattened lists.
  2. I haven't thought this one through yet, but if we want to split preprocessing into different paths and then merge the results together, the Iterable API may not make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm fine with keeping this as an internal concept, but would like to not expose this to users.

What do you think about still allowing users to pass in an iterable as the result preprocessors? I think that would just be adding these 2 lines to process_results

if isinstance(results_preprocessors, Iterable):
    results_preprocessors = SequentialResultsPreprocessor(results_preprocessors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I still don't think having the user to add SequentialResultsPreprocessor(...) is a significant overhead for the explicit indicator that they will be processed sequentially, and don't think it needs to be privatized.

There are a few of things I could nitpick about using the Iterable, but I'm wondering if we could postpone this until a clear use-case arises? The snippet you shared could easily be added if there's a clear indicator that this improves usability, whereas removal is usually less preferred.

Copy link
Contributor

@amogkam amogkam Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented this below, but to me the cleanest solution would be that training callbacks accept an optional list of result preprocessors in __init__. If this is provided, it overrides the default preprocessors that are set. Internally we can just convert this into a SequentialResultsPreprocessor. All callbacks should always be calling super anyways.

The benefit here is that this behavior is clearly documented in the docstring, and passing in args via a parameter is cleaner than having to set an attribute (which would be more difficult to maintain backwards compatibility for).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it requires the user to define the entire ResultsPreprocessor (single or multiple) in the constructor vs. flattened args that are more understandable to the user (e.g. workers_to_log).

This goes back to the question of whether ResultsPreprocessors should be part of the public API or developer API.

Copy link
Contributor

@amogkam amogkam Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could accept both args and have one override the other (i.e. pythonic way of method overloading), but I agree if we want to have this be a developer API then it's fine to leave as is.

logdir_path = Path(self._logdir)
else:
logdir_path = Path(logdir)
class TrainCallbackLogdirManager():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think this manager class may be unnecessary. Can't this whole thing just be a utility function?

def get_logdir_path(logdir: str, create_logdir_if_not_exists: bool) -> Path:
    ...

Then in start_training in the callbacks, you would just call this function self._logdir = get_logdir_path(...), and make sure to pass in the correct logdir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but this requires each caller to essentially define logdir = logdir or default_logdir themselves, in which case this method just contains logic for checking if the logdir exists.

The class structure follows the actual call pattern more closely (__init__ vs. start_training).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that logic is simple enough that I think it's fine for the caller to do it themselves.

In my opinion, I think that the abstraction of passing in 2 logdirs to a manager is less intuitive than just figuring out the correct logdir in the callback itself, especially since it can literally just be 1 line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair, but in that case is there any point in providing any utility method at all?

I will probably need to think about this some more. By changing it from a Mixin/TrainingCallback to this new class, we did lose some of structure that glued it all together. Not so sure what the right abstraction is anymore...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the long run we may want to end up with something like what @Yard1 suggested:

Do you think it would make sense to extend the concept of ResultPreprocessor to start_training as well? That would cover the logdir stuff. Not sure if the effort would be justifiable, though

While this could fit the pattern quite nicely, the desired interface would be a lot more clear when a second use-case comes around.

python/ray/train/tests/test_callbacks.py Outdated Show resolved Hide resolved
@@ -497,6 +497,35 @@ A simple example for creating a callback that will print out results:
trainer.shutdown()


Results Preprocessors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, but I think the benefit of preprocessors here is that it allows you to easily customize existing callbacks, rather than for use when developing custom callbacks.

Instead of the example below, I would change this to

callback = PrintingCallback()
callback.results_preprocessor = IndexedResultsPreprocessor(0)

And I would also move this subsection to the "Built-in callbacks" (and rename to "Advanced: Customizing Built-in Callbacks") instead of the "Custom Callbacks" Section.

Then in the "Custom Callbacks" section I would add a note linking to this section if you want to customize existing callbacks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking at the example above, in my opinion, it would be cleaner if callbacks would accept an optional list of result preprocessors in the __init__, rather than having to set an attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make this a developer API, then I think we can actually not include this information in the user guide, right? (we can still leave it here, but comment it out).

return results


class SequentialResultsPreprocessor():
Copy link
Contributor

@amogkam amogkam Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented this below, but to me the cleanest solution would be that training callbacks accept an optional list of result preprocessors in __init__. If this is provided, it overrides the default preprocessors that are set. Internally we can just convert this into a SequentialResultsPreprocessor. All callbacks should always be calling super anyways.

The benefit here is that this behavior is clearly documented in the docstring, and passing in args via a parameter is cleaner than having to set an attribute (which would be more difficult to maintain backwards compatibility for).

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if we want to leave this as a developer API! I agree, we can always upgrade this to a public API once a use case arises.

What do you think about these remaining TODOs?:

  • Can you annotate the preprocessors with @DeveloperAPI?
  • If this is a developer API, then I think we can comment out the recommended usage from the user guide

@@ -497,6 +497,35 @@ A simple example for creating a callback that will print out results:
trainer.shutdown()


Results Preprocessors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make this a developer API, then I think we can actually not include this information in the user guide, right? (we can still leave it here, but comment it out).

@amogkam amogkam merged commit 8119b62 into ray-project:master Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants