Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Make the logging of the function API consistent and predictable #4011

Merged
merged 33 commits into from
Mar 19, 2019

Conversation

gehring
Copy link
Contributor

@gehring gehring commented Feb 11, 2019

What do these changes do?

This is a re-implementation of the FunctionRunner which enforces some synchronicity between the thread running the training function and the thread running the Trainable which logs results. The main purpose is to make logging consistent across APIs in anticipation of a new function API which will be generator based (through yield statements). Without these changes, it will be impossible for the (possibly soon to be) deprecated reporter based API to behave the same as the generator based API.

This new implementation provides additional guarantees to prevent results from being dropped. This makes the logging behavior more intuitive and consistent with how results are handled in custom subclasses of Trainable.

New guarantees for the tune function API:

  • Every reported result, i.e., reporter(**kwargs) calls, is forwarded to the appropriate loggers instead of being dropped if not enough time has elapsed since the last results.
  • The wrapped function only runs if the FunctionRunner expects a result, i.e., when FunctionRunner._train() has been called. This removes the possibility that a result will be generated by the function but never logged.
  • The wrapped function is not called until the first _train() call. Currently, the wrapped function is started during the setup phase which could result in dropped results if the trial is cancelled between _setup() and the first _train() call.
  • Exceptions raised by the wrapped function won't be propagated until all results are logged to prevent dropped results.
  • The thread running the wrapped function is explicitly stopped when the FunctionRunner is stopped with _stop().
  • If the wrapped function terminates without reporting done=True, a duplicate result with {"done": True}, is reported to explicitly terminate the trial, but will not be logged.

Related issue number

#3956
#3949
#3834

@richardliaw richardliaw changed the title Make the logging of the function API consistent and predictable [tune] Make the logging of the function API consistent and predictable Feb 11, 2019
@gehring
Copy link
Contributor Author

gehring commented Feb 11, 2019

@richardliaw Would you mind checking this out. I didn't want to go ahead with the generator based API until I could guarantee that both the old and new function runner would behave mostly the same.

This implementation seems to work as expected. Most test pass at the moment with the exception of the tests which expect the last result to be repeated when done=True is never set. I'll go through and fix the tests once you've given me your thoughts on the new logging behavior and runner implementation.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11754/
Test FAILed.

@richardliaw
Copy link
Contributor

Sorry for the wait, I'll get to this today! (And feel free to ping if you don't hear)

@richardliaw
Copy link
Contributor

jenkins retest this please

@richardliaw
Copy link
Contributor

richardliaw commented Feb 13, 2019 via email

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11878/
Test FAILed.

@gehring
Copy link
Contributor Author

gehring commented Feb 13, 2019

What do you think about having an option that toggles asynchronous vs synchronous behavior for the reporter?

This would be fairly straightforward to do but I would highly recommend against as I don't see any advantages in doing so. Adding the option would add complexity to the API (and a source of confusion) while not offering any tangible benefits. In principal, most of the computation time should be spent within the training function so, in the best case, I don't expect any significant performance gains to offset the added complexity (and source of bugs).

While the old version was asynchronous it was not parallel (due to the GIL). Making them fully asynchronous again would only provide performance benefits if logging had especially long blocking calls.

Given this change is only to provide consistency between the Trainable API, the new generator function API and the reporter function API, I'm not keen on allowing users to opt out of the predictable behavior but I'm open to arguments for asynchronicity. Even more so, if the reporter function API is to be deprecated in favor of the generator function API and is kept around only for backwards compatibility purposes.

(As a related aside, any idea how long python 2.7 will be supported by ray? An async/await implementation of the function runner would have been very compact and understandable but I couldn't use it because of the need for backward compatibility.)

@gehring
Copy link
Contributor Author

gehring commented Feb 13, 2019

jenkins retest this please

@richardliaw There are several tests that would require changing before they can be passed as they assume some weird corner case behavior. The new implementation avoids these cases by always reporting done=True. Pretty much all failing tests seemed to use this corner cases as a way to avoid manually reporting done=True and should be easy to fix. Once I've addressed all your concerns with the FunctionWrapper, I'll go through and add the changes to the tests to this pull request.

Alternatively, if returning a simple result, {"done": True}, is not desirable. We could remove the requirement that the FunctionWrapper returns done. I've only kept it in because there was a special block of code in the old version dedicated to preventing this from happening, going as far a repeating the last result to avoid it.

@richardliaw
Copy link
Contributor

In principal, most of the computation time should be spent within the training function so, in the best case, I don't expect any significant performance gains to offset the added complexity (and source of bugs).

That's a good point.

any idea how long python 2.7 will be supported by ray?

No clue... although I think the broader community would be really interested in this discussion (enough to warrant a post on the dev list I think!)

Alternatively, if returning a simple result, {"done": True}, is not desirable.

I think returning a simple result, {"done": True} is fine. It would probably be good to add a comment or log something to tell the user that a done=True result is being provided, and may affect result output (specifically for users that have their own logging mechanisms).

We should also add test we don't report done=True twice, in the case the user sets done=True themselves.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, this change of behavior is good; thanks for making this PR!

Overall, I would prefer removing some complexity (i.e., removing code paths like should_stop and cleanups from _stop, as these have not been issues in the past - this is a pretty small edge case and people will probably checkpoint their stuff with some frequency anyways).

Let me know.

python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
python/ray/tune/function_runner.py Outdated Show resolved Hide resolved
@gehring
Copy link
Contributor Author

gehring commented Feb 16, 2019

At a high level, this change of behavior is good; thanks for making this PR!

Overall, I would prefer removing some complexity (i.e., removing code paths like should_stop and cleanups from _stop, as these have not been issues in the past - this is a pretty small edge case and people will probably checkpoint their stuff with some frequency anyways).

Let me know.

I agree with you. It should all be stripped away now. Let me know if there is anything else! I should be able to get around to fixing the tests in the next couple of days.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12025/
Test FAILed.

@richardliaw
Copy link
Contributor

Thanks! I think this looks good and would be ready after the tests are added.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12029/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12574/
Test FAILed.

@richardliaw
Copy link
Contributor

richardliaw commented Mar 13, 2019

I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report done = i == 99, which is rather cumbersome and a common failure mode:

  1. one user problem that this introduces is that trial.last_result becomes effectively useless

  2. on_trial_complete no longer receives anything useful; which would break a lot of current search algorithms.

One workaround is for the function_runner to call reporter(**last_result, __done__=True). This result will not be exposed to the user, and will skip the logging, but will be passed into on_trial_complete for the scheduler and search algorithms. Let me know what you think, and I can try to implement it if necessary.

EDIT: I've made a PR against your fork, let me know what you think

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12822/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12837/
Test FAILed.

@gehring
Copy link
Contributor Author

gehring commented Mar 13, 2019

I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report done = i == 99, which is rather cumbersome and a common failure mode:

  1. one user problem that this introduces is that trial.last_result becomes effectively useless
  2. on_trial_complete no longer receives anything useful; which would break a lot of current search algorithms.

One workaround is for the function_runner to call reporter(**last_result, __done__=True). This result will not be exposed to the user, and will skip the logging, but will be passed into on_trial_complete for the scheduler and search algorithms. Let me know what you think, and I can try to implement it if necessary.

EDIT: I've made a PR against your fork, let me know what you think

@richardliaw Thanks for the PR. I won't be able to look at it closely for a day or two, but here are my high level thoughts.

I think that a reporter(..., __done__=True) could work fine but I think given there is the possibility of changing the API to a generator based API (i.e., without a reporter), we might just be kicking the problem further down the road. Additionally, it seems to me that it would require reimplementing some of the behavior provided by Trial (or some other class) for keeping track of the last result, editing it when necessary. It might be best to try to find a more general solution.

Without having thought through the details, I'm thinking using the standard StopIteration exception might solve this problem with little effort. If we allow Trainable._train() and Trainable.train() to raise a StopIteration exception, we would only need to modify the class in charge of manipulating the last results (not sure on the top of my head if that is Trainable, Trial, or one of the runner/executors).

The logic to do this would look like this:

class ClassThatHandlesTrainCalls(object):

    ...

    def function_responsible_for_last_result(self, ...):
        # any necessary work to do before the next iteration call
        ...
        try:
            self._last_results = self._trianable.train().copy()
        except StopIteration:
            self._last_results[DONE] = True
        # whatever else this is responsible for updating
        ...

A StopIteration solution would make the implementation of generator based API very simple with the added benefit of allowing custom Trainable classes to implement their own termination mechanics by raising StopIteration. Plus, we avoid having to manage last results in several places in the code base, dodging future issues with bugs and inconsistencies. The only issue with this approach I could foresee would come the fact that the logic is distributed through ray. What do you think?

@richardliaw
Copy link
Contributor

richardliaw commented Mar 13, 2019 via email

@richardliaw
Copy link
Contributor

This now looks good to me and I think we can merge, although it'd be great to get your eyes on some of the minor implementation changes that I've made.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12921/
Test FAILed.

@gehring
Copy link
Contributor Author

gehring commented Mar 16, 2019

@richardliaw

Now that this is blocking another PR, I am happy merging going with (edit: just noticed it was already merged) your __duplicate__ approach and finally merging this PR. Though, I do have some follow up for the StopIteration discussion. We could open a new issue/PR to continue this discussion and avoid delaying this PR further.

  1. Whatever we do needs to be very simple for users; I would like as much
    as possible to avoid any new concepts or gotchas (i.e., StopIteration,
    remembering to set Done, etc -- this is not great IMO), especially in
    places like Trainable where it doesn't make sense to raise StopIteration.
    My PR against your branch explicitly avoids this without requiring to
    reimplement any of the other parts of Tune.

I wasn't suggesting that the user is meant to raise StopIteration, simply that its use is supported. More generally, I think the logic supporting trials to terminate on their own should be independent from the logic in charge of the results.

To be clear, I think the current {"done": True, ...} API is fine and can definitely stay the encouraged /public way of terminating trials from within. However, fully supporting unexpected trial termination would give a lot more flexibility both to the public and private API. In a way, this is already supported since regular exceptions need to be caught and reported so all is needed is a way to trigger this without propagating an error. Support for StopIteration would simply involve treating those particular exceptions the same way python control flow does.

Supporting StopIteration would allow for a very straightforward function runner:

class FunctionRunnerV2(Trainable):
    def _setup(self, config):
        if is_reporter_required(self._train_func):
            # create a backwards compatible generator
       else:
           self._train_iter = self._train_func(config)

    def _train(self):
        # raises StopIteration when exhausted, i.e., when the training function terminates/returns
        return next(self._train_iter)

Finally, supporting StopIteration should be quite simple. It would probably only require the following changes to TrialRunner:

class TrialRunner(object):
    # some other methods
    ...
    def _process_trial(self, trial):
        try:
            result = self.trial_executor.fetch_result(trial)
            # process result
            ...
        # extra code for StopIteration
        except StopIteration:
            # trial is exhausted, do any necessary bookkeeping 
            trial.set_last_result_done(True)
            ...
        except Exception:
            # process other exceptions normally
            logger.exception("Error processing event.")
            ...
  1. We can do away with the "reporter" specifically, but keep in mind,
    backwards compat is important at least for a while after introducing a new
    "recommended" API. There is another API that we are considering internally
    that does not require reporter nor a generator-based API. The high
    level idea is to use some global context within each worker. We will push a
    PR soon, and I'd love to get your feedback then.

Any reason to prefer a global context over an iterator/generator? I'm usually a little apprehensive when relying on some global state, but I'm excited to see what you guys are working on! Do you have a design I can check out?

As for backwards compat, I am definitely keeping it in mind! I don't think it would be too complicated to maintain it.

@gehring
Copy link
Contributor Author

gehring commented Mar 16, 2019

This now looks good to me and I think we can merge, although it'd be great to get your eyes on some of the minor implementation changes that I've made.

Just went through, as long as the __duplicate__ approach hasn't broken any test then everything looks all good to me!

@richardliaw
Copy link
Contributor

richardliaw commented Mar 17, 2019

I wasn't suggesting that the user is meant to raise StopIteration, simply that its use is supported. More generally, I think the logic supporting trials to terminate on their own should be independent from the logic in charge of the results.

OK I agree with this and also agree it shouldn't be hard to add. Instead of a FunctionRunnerV2, I'd maybe create a IteratorRunner and have a check if the input is an IteratorRunner. Sorry for the miscommunication! Let's open up a new issue in reference to the Generator/Iterator support.

Any reason to prefer a global context over an iterator/generator? I'm usually a little apprehensive when relying on some global state, but I'm excited to see what you guys are working on! Do you have a design I can check out?

Yes! It's a very light impl actually but maybe @noahgolmant can comment (or I'll open a new Issue as RFC in a bit).

Just went through, as long as the duplicate approach hasn't broken any test then everything looks all good to me!

Thanks a bunch! I'll fix up the tests and get this merged; I think this will clean up a lot. Thanks!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12987/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12994/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12995/
Test PASSed.

@richardliaw richardliaw merged commit 7c3274e into ray-project:master Mar 19, 2019
@richardliaw
Copy link
Contributor

Thanks a bunch for this contribution @gehring!

@gehring
Copy link
Contributor Author

gehring commented Mar 19, 2019

@richardliaw No problem! Ray/tune/rllib is a great set of libraries and I am happy to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants