[tune] Make the logging of the function API consistent and predictable #4011

gehring · 2019-02-11T01:21:05Z

What do these changes do?

This is a re-implementation of the FunctionRunner which enforces some synchronicity between the thread running the training function and the thread running the Trainable which logs results. The main purpose is to make logging consistent across APIs in anticipation of a new function API which will be generator based (through yield statements). Without these changes, it will be impossible for the (possibly soon to be) deprecated reporter based API to behave the same as the generator based API.

This new implementation provides additional guarantees to prevent results from being dropped. This makes the logging behavior more intuitive and consistent with how results are handled in custom subclasses of Trainable.

New guarantees for the tune function API:

Every reported result, i.e., reporter(**kwargs) calls, is forwarded to the appropriate loggers instead of being dropped if not enough time has elapsed since the last results.
The wrapped function only runs if the FunctionRunner expects a result, i.e., when FunctionRunner._train() has been called. This removes the possibility that a result will be generated by the function but never logged.
The wrapped function is not called until the first _train() call. Currently, the wrapped function is started during the setup phase which could result in dropped results if the trial is cancelled between _setup() and the first _train() call.
Exceptions raised by the wrapped function won't be propagated until all results are logged to prevent dropped results.
The thread running the wrapped function is explicitly stopped when the FunctionRunner is stopped with _stop().
If the wrapped function terminates without reporting done=True, a duplicate result with {"done": True}, is reported to explicitly terminate the trial, but will not be logged.

Related issue number

#3956
#3949
#3834

…cross both the Trainable class APU and the function API.

…imports

…how functions are wrapped

gehring · 2019-02-11T01:29:42Z

@richardliaw Would you mind checking this out. I didn't want to go ahead with the generator based API until I could guarantee that both the old and new function runner would behave mostly the same.

This implementation seems to work as expected. Most test pass at the moment with the exception of the tests which expect the last result to be repeated when done=True is never set. I'll go through and fix the tests once you've given me your thoughts on the new logging behavior and runner implementation.

AmplabJenkins · 2019-02-11T01:51:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11754/
Test FAILed.

richardliaw · 2019-02-12T10:09:49Z

Sorry for the wait, I'll get to this today! (And feel free to ping if you don't hear)

richardliaw · 2019-02-13T09:51:56Z

jenkins retest this please

richardliaw · 2019-02-13T11:02:02Z

I think it looks good! What do you think about having an option that toggles asynchronous vs synchronous behavior for the reporter?

…

On Mon, Feb 11, 2019 at 12:27 PM Robert Nishihara ***@***.***> wrote: Assigned #4011 <#4011> to @richardliaw <https://github.com/richardliaw>. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#4011 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5VoBpd0W_rGjxn1OcGBBN8gtCb9nks5vMdIxgaJpZM4azZg_> .

python/ray/tune/trainable.py

AmplabJenkins · 2019-02-13T14:03:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11878/
Test FAILed.

gehring · 2019-02-13T18:17:45Z

What do you think about having an option that toggles asynchronous vs synchronous behavior for the reporter?

This would be fairly straightforward to do but I would highly recommend against as I don't see any advantages in doing so. Adding the option would add complexity to the API (and a source of confusion) while not offering any tangible benefits. In principal, most of the computation time should be spent within the training function so, in the best case, I don't expect any significant performance gains to offset the added complexity (and source of bugs).

While the old version was asynchronous it was not parallel (due to the GIL). Making them fully asynchronous again would only provide performance benefits if logging had especially long blocking calls.

Given this change is only to provide consistency between the Trainable API, the new generator function API and the reporter function API, I'm not keen on allowing users to opt out of the predictable behavior but I'm open to arguments for asynchronicity. Even more so, if the reporter function API is to be deprecated in favor of the generator function API and is kept around only for backwards compatibility purposes.

(As a related aside, any idea how long python 2.7 will be supported by ray? An async/await implementation of the function runner would have been very compact and understandable but I couldn't use it because of the need for backward compatibility.)

gehring · 2019-02-13T18:27:22Z

jenkins retest this please

@richardliaw There are several tests that would require changing before they can be passed as they assume some weird corner case behavior. The new implementation avoids these cases by always reporting done=True. Pretty much all failing tests seemed to use this corner cases as a way to avoid manually reporting done=True and should be easy to fix. Once I've addressed all your concerns with the FunctionWrapper, I'll go through and add the changes to the tests to this pull request.

Alternatively, if returning a simple result, {"done": True}, is not desirable. We could remove the requirement that the FunctionWrapper returns done. I've only kept it in because there was a special block of code in the old version dedicated to preventing this from happening, going as far a repeating the last result to avoid it.

richardliaw · 2019-02-13T22:24:09Z

In principal, most of the computation time should be spent within the training function so, in the best case, I don't expect any significant performance gains to offset the added complexity (and source of bugs).

That's a good point.

any idea how long python 2.7 will be supported by ray?

No clue... although I think the broader community would be really interested in this discussion (enough to warrant a post on the dev list I think!)

Alternatively, if returning a simple result, {"done": True}, is not desirable.

I think returning a simple result, {"done": True} is fine. It would probably be good to add a comment or log something to tell the user that a done=True result is being provided, and may affect result output (specifically for users that have their own logging mechanisms).

We should also add test we don't report done=True twice, in the case the user sets done=True themselves.

richardliaw

At a high level, this change of behavior is good; thanks for making this PR!

Overall, I would prefer removing some complexity (i.e., removing code paths like should_stop and cleanups from _stop, as these have not been issues in the past - this is a pretty small edge case and people will probably checkpoint their stuff with some frequency anyways).

Let me know.

python/ray/tune/function_runner.py

Co-Authored-By: gehring <[email protected]>

…redudant _started attribute

gehring · 2019-02-16T07:33:44Z

At a high level, this change of behavior is good; thanks for making this PR!

Overall, I would prefer removing some complexity (i.e., removing code paths like should_stop and cleanups from _stop, as these have not been issues in the past - this is a pretty small edge case and people will probably checkpoint their stuff with some frequency anyways).

Let me know.

I agree with you. It should all be stripped away now. Let me know if there is anything else! I should be able to get around to fixing the tests in the next couple of days.

AmplabJenkins · 2019-02-16T07:49:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12025/
Test FAILed.

richardliaw · 2019-02-16T08:54:40Z

Thanks! I think this looks good and would be ready after the tests are added.

AmplabJenkins · 2019-02-16T10:25:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12029/
Test FAILed.

AmplabJenkins · 2019-03-06T00:58:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12574/
Test FAILed.

richardliaw · 2019-03-13T02:23:07Z

I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report done = i == 99, which is rather cumbersome and a common failure mode:

one user problem that this introduces is that trial.last_result becomes effectively useless
on_trial_complete no longer receives anything useful; which would break a lot of current search algorithms.

One workaround is for the function_runner to call reporter(**last_result, __done__=True). This result will not be exposed to the user, and will skip the logging, but will be passed into on_trial_complete for the scheduler and search algorithms. Let me know what you think, and I can try to implement it if necessary.

EDIT: I've made a PR against your fork, let me know what you think

AmplabJenkins · 2019-03-13T03:33:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12822/
Test FAILed.

AmplabJenkins · 2019-03-13T10:10:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12837/
Test FAILed.

gehring · 2019-03-13T20:38:41Z

I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report done = i == 99, which is rather cumbersome and a common failure mode:

one user problem that this introduces is that trial.last_result becomes effectively useless

on_trial_complete no longer receives anything useful; which would break a lot of current search algorithms.

One workaround is for the function_runner to call reporter(**last_result, __done__=True). This result will not be exposed to the user, and will skip the logging, but will be passed into on_trial_complete for the scheduler and search algorithms. Let me know what you think, and I can try to implement it if necessary.

EDIT: I've made a PR against your fork, let me know what you think

@richardliaw Thanks for the PR. I won't be able to look at it closely for a day or two, but here are my high level thoughts.

I think that a reporter(..., __done__=True) could work fine but I think given there is the possibility of changing the API to a generator based API (i.e., without a reporter), we might just be kicking the problem further down the road. Additionally, it seems to me that it would require reimplementing some of the behavior provided by Trial (or some other class) for keeping track of the last result, editing it when necessary. It might be best to try to find a more general solution.

Without having thought through the details, I'm thinking using the standard StopIteration exception might solve this problem with little effort. If we allow Trainable._train() and Trainable.train() to raise a StopIteration exception, we would only need to modify the class in charge of manipulating the last results (not sure on the top of my head if that is Trainable, Trial, or one of the runner/executors).

The logic to do this would look like this:

class ClassThatHandlesTrainCalls(object):

    ...

    def function_responsible_for_last_result(self, ...):
        # any necessary work to do before the next iteration call
        ...
        try:
            self._last_results = self._trianable.train().copy()
        except StopIteration:
            self._last_results[DONE] = True
        # whatever else this is responsible for updating
        ...

A StopIteration solution would make the implementation of generator based API very simple with the added benefit of allowing custom Trainable classes to implement their own termination mechanics by raising StopIteration. Plus, we avoid having to manage last results in several places in the code base, dodging future issues with bugs and inconsistencies. The only issue with this approach I could foresee would come the fact that the logic is distributed through ray. What do you think?

richardliaw · 2019-03-13T20:57:36Z

Let me know when you take a look at the PR! The main notions I'm sticking with are the following: 1. Whatever we do needs to be very simple for users; I would like as much as possible to avoid any new concepts or gotchas (i.e., StopIteration, remembering to set Done, etc -- this is not great IMO), especially in places like Trainable where it doesn't make sense to raise StopIteration. My PR against your branch explicitly avoids this without requiring to reimplement any of the other parts of Tune. 2. We can do away with the "reporter" specifically, but keep in mind, backwards compat is important at least for a while after introducing a new "recommended" API. There is another API that we are considering internally that does not *require* `reporter` nor a generator-based API. The high level idea is to use some global context within each worker. We will push a PR soon, and I'd love to get your feedback then. Thanks, Richard

…

On Wed, Mar 13, 2019 at 1:38 PM gehring ***@***.***> wrote: I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report done = i == 99, which is rather cumbersome and a common failure mode: 1. one user problem that this introduces is that trial.last_result becomes effectively useless 2. on_trial_complete no longer receives anything useful; which would break a lot of current search algorithms. One workaround is for the function_runner to call reporter(**last_result, __done__=True). This result will not be exposed to the user, and will skip the logging, but will be passed into on_trial_complete for the scheduler and search algorithms. Let me know what you think, and I can try to implement it if necessary. EDIT: I've made a PR against your fork, let me know what you think @richardliaw <https://github.com/richardliaw> Thanks for the PR. I won't be able to look at it closely for a day or two, but here are my high level thoughts. I think that a reporter(..., __done__=True) could work fine but I think given there is the possibility of changing the API to a generator based API (i.e., without a reporter), we might just be kicking the problem further down the road. Additionally, it seems to me that it would require reimplementing some of the behavior provided by Trial (or some other class) for keeping track of the last result, editing it when necessary. It might be best to try to find a more general solution. Without having thought through the details, I'm thinking using the standard StopIteration exception might solve this problem with little effort. If we allow Trainable._train() and Trainable.train() to raise a StopIteration exception, we would only need to modify the class in charge of manipulating the last results (not sure on the top of my head if that is Trainable, Trial, or one of the runner/executors). The logic to do this would look like this: class ClassThatHandlesTrainCalls(object): ... def function_responsible_for_last_result(self, ...): # any necessary work to do before the next iteration call ... try: self._last_results = self._trianable.train().copy() except StopIteration: self._last_results[DONE] = True # whatever else this is responsible for updating ... A StopIteration solution would make the implementation of generator based API very simple with the added benefit of allowing custom Trainable classes to implement their own termination mechanics by raising StopIteration. Plus, we avoid having to manage last results in several places in the code base, dodging future issues with bugs and inconsistencies. The only issue with this approach I could foresee would come the fact that the logic is distributed through ray. What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4011 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5dyc7cG-tw-7YV5yE-M0Vfj6Qlo9ks5vWWHVgaJpZM4azZg_> .

richardliaw · 2019-03-16T08:00:22Z

This now looks good to me and I think we can merge, although it'd be great to get your eyes on some of the minor implementation changes that I've made.

AmplabJenkins · 2019-03-16T20:46:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12921/
Test FAILed.

gehring · 2019-03-16T21:09:17Z

@richardliaw

Now that this is blocking another PR, I am happy ~~merging~~ going with (edit: just noticed it was already merged) your __duplicate__ approach and finally merging this PR. Though, I do have some follow up for the StopIteration discussion. We could open a new issue/PR to continue this discussion and avoid delaying this PR further.

Whatever we do needs to be very simple for users; I would like as much
as possible to avoid any new concepts or gotchas (i.e., StopIteration,
remembering to set Done, etc -- this is not great IMO), especially in
places like Trainable where it doesn't make sense to raise StopIteration.
My PR against your branch explicitly avoids this without requiring to
reimplement any of the other parts of Tune.

I wasn't suggesting that the user is meant to raise StopIteration, simply that its use is supported. More generally, I think the logic supporting trials to terminate on their own should be independent from the logic in charge of the results.

To be clear, I think the current {"done": True, ...} API is fine and can definitely stay the encouraged /public way of terminating trials from within. However, fully supporting unexpected trial termination would give a lot more flexibility both to the public and private API. In a way, this is already supported since regular exceptions need to be caught and reported so all is needed is a way to trigger this without propagating an error. Support for StopIteration would simply involve treating those particular exceptions the same way python control flow does.

Supporting StopIteration would allow for a very straightforward function runner:

class FunctionRunnerV2(Trainable):
    def _setup(self, config):
        if is_reporter_required(self._train_func):
            # create a backwards compatible generator
       else:
           self._train_iter = self._train_func(config)

    def _train(self):
        # raises StopIteration when exhausted, i.e., when the training function terminates/returns
        return next(self._train_iter)

Finally, supporting StopIteration should be quite simple. It would probably only require the following changes to TrialRunner:

class TrialRunner(object):
    # some other methods
    ...
    def _process_trial(self, trial):
        try:
            result = self.trial_executor.fetch_result(trial)
            # process result
            ...
        # extra code for StopIteration
        except StopIteration:
            # trial is exhausted, do any necessary bookkeeping 
            trial.set_last_result_done(True)
            ...
        except Exception:
            # process other exceptions normally
            logger.exception("Error processing event.")
            ...

We can do away with the "reporter" specifically, but keep in mind,
backwards compat is important at least for a while after introducing a new
"recommended" API. There is another API that we are considering internally
that does not require reporter nor a generator-based API. The high
level idea is to use some global context within each worker. We will push a
PR soon, and I'd love to get your feedback then.

Any reason to prefer a global context over an iterator/generator? I'm usually a little apprehensive when relying on some global state, but I'm excited to see what you guys are working on! Do you have a design I can check out?

As for backwards compat, I am definitely keeping it in mind! I don't think it would be too complicated to maintain it.

gehring · 2019-03-16T21:17:33Z

This now looks good to me and I think we can merge, although it'd be great to get your eyes on some of the minor implementation changes that I've made.

Just went through, as long as the __duplicate__ approach hasn't broken any test then everything looks all good to me!

richardliaw · 2019-03-17T21:31:08Z

I wasn't suggesting that the user is meant to raise StopIteration, simply that its use is supported. More generally, I think the logic supporting trials to terminate on their own should be independent from the logic in charge of the results.

OK I agree with this and also agree it shouldn't be hard to add. Instead of a FunctionRunnerV2, I'd maybe create a IteratorRunner and have a check if the input is an IteratorRunner. Sorry for the miscommunication! Let's open up a new issue in reference to the Generator/Iterator support.

Any reason to prefer a global context over an iterator/generator? I'm usually a little apprehensive when relying on some global state, but I'm excited to see what you guys are working on! Do you have a design I can check out?

Yes! It's a very light impl actually but maybe @noahgolmant can comment (or I'll open a new Issue as RFC in a bit).

Just went through, as long as the duplicate approach hasn't broken any test then everything looks all good to me!

Thanks a bunch! I'll fix up the tests and get this merged; I think this will clean up a lot. Thanks!

…logging.

AmplabJenkins · 2019-03-18T23:33:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12987/
Test FAILed.

AmplabJenkins · 2019-03-19T00:45:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12994/
Test FAILed.

AmplabJenkins · 2019-03-19T02:09:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12995/
Test PASSed.

richardliaw · 2019-03-19T02:14:45Z

Thanks a bunch for this contribution @gehring!

gehring · 2019-03-19T17:40:20Z

@richardliaw No problem! Ray/tune/rllib is a great set of libraries and I am happy to contribute!

gehring and others added 8 commits February 8, 2019 21:10

rewrote the function runner to make the logging behavior consistent a…

7f29e2d

…cross both the Trainable class APU and the function API.

fixed missing keyword

e58d58c

bug fixes

7b0b5e1

fixed wrong order in timing calculation

5704324

Merge remote-tracking branch 'upstream/master'

2652d46

Changed wrapped function to always report done=True before terminating

73add8a

Fixed typo in new function wrapper, fixed some formatting and unused …

5667149

…imports

fixed race condition when reporting results in _train and simplified …

941182d

…how functions are wrapped

richardliaw changed the title ~~Make the logging of the function API consistent and predictable~~ [tune] Make the logging of the function API consistent and predictable Feb 11, 2019

robertnishihara assigned richardliaw Feb 11, 2019

andrewztan reviewed Feb 13, 2019

View reviewed changes

python/ray/tune/trainable.py Outdated Show resolved Hide resolved

richardliaw reviewed Feb 13, 2019

View reviewed changes

richardliaw and others added 3 commits February 16, 2019 01:01

Better wording

8bf0f87

Co-Authored-By: gehring <[email protected]>

Fixed typo

c57bfac

Co-Authored-By: gehring <[email protected]>

Simplified the new function runner by removing should_stop logic and …

554f71a

…redudant _started attribute

merged upstream branch

64a32ef

richardliaw added 2 commits March 12, 2019 23:44

fixup reporter duplication

33b47b2

Fix lint for membership.

ce2b13f

noahgolmant mentioned this pull request Mar 14, 2019

[tune] Initial track integration #4362

Merged

richardliaw added 4 commits March 16, 2019 00:10

Merge branch 'master' into gehring-master

103785d

comments and move code into function_runner

b522a60

nit

d782262

Comments

c13911a

richardliaw added 2 commits March 18, 2019 13:26

Merge branch 'master' into gehring-master

dd798d8

Fix up tests to make sure scheduler notified correctly and no double-…

bed1523

…logging.

fix Lint

01b1a5e

richardliaw approved these changes Mar 19, 2019

View reviewed changes

richardliaw merged commit 7c3274e into ray-project:master Mar 19, 2019

richardliaw mentioned this pull request Mar 20, 2019

[tune] Fix tests for Function API for better consistency #4421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Make the logging of the function API consistent and predictable #4011

[tune] Make the logging of the function API consistent and predictable #4011

gehring commented Feb 11, 2019 •

edited by richardliaw

Loading

gehring commented Feb 11, 2019

AmplabJenkins commented Feb 11, 2019

richardliaw commented Feb 12, 2019

richardliaw commented Feb 13, 2019

richardliaw commented Feb 13, 2019 via email

AmplabJenkins commented Feb 13, 2019

gehring commented Feb 13, 2019

gehring commented Feb 13, 2019

richardliaw commented Feb 13, 2019

richardliaw left a comment •

edited

Loading

gehring commented Feb 16, 2019 •

edited

Loading

AmplabJenkins commented Feb 16, 2019

richardliaw commented Feb 16, 2019

AmplabJenkins commented Feb 16, 2019

AmplabJenkins commented Mar 6, 2019

richardliaw commented Mar 13, 2019 •

edited

Loading

AmplabJenkins commented Mar 13, 2019

AmplabJenkins commented Mar 13, 2019

gehring commented Mar 13, 2019

richardliaw commented Mar 13, 2019 via email

richardliaw commented Mar 16, 2019

AmplabJenkins commented Mar 16, 2019

gehring commented Mar 16, 2019 •

edited

Loading

gehring commented Mar 16, 2019

richardliaw commented Mar 17, 2019 •

edited

Loading

AmplabJenkins commented Mar 18, 2019

AmplabJenkins commented Mar 19, 2019

AmplabJenkins commented Mar 19, 2019

richardliaw commented Mar 19, 2019

gehring commented Mar 19, 2019

[tune] Make the logging of the function API consistent and predictable #4011

[tune] Make the logging of the function API consistent and predictable #4011

Conversation

gehring commented Feb 11, 2019 • edited by richardliaw Loading

What do these changes do?

Related issue number

gehring commented Feb 11, 2019

AmplabJenkins commented Feb 11, 2019

richardliaw commented Feb 12, 2019

richardliaw commented Feb 13, 2019

richardliaw commented Feb 13, 2019 via email

AmplabJenkins commented Feb 13, 2019

gehring commented Feb 13, 2019

gehring commented Feb 13, 2019

richardliaw commented Feb 13, 2019

richardliaw left a comment • edited Loading

Choose a reason for hiding this comment

gehring commented Feb 16, 2019 • edited Loading

AmplabJenkins commented Feb 16, 2019

richardliaw commented Feb 16, 2019

AmplabJenkins commented Feb 16, 2019

AmplabJenkins commented Mar 6, 2019

richardliaw commented Mar 13, 2019 • edited Loading

AmplabJenkins commented Mar 13, 2019

AmplabJenkins commented Mar 13, 2019

gehring commented Mar 13, 2019

richardliaw commented Mar 13, 2019 via email

richardliaw commented Mar 16, 2019

AmplabJenkins commented Mar 16, 2019

gehring commented Mar 16, 2019 • edited Loading

gehring commented Mar 16, 2019

richardliaw commented Mar 17, 2019 • edited Loading

AmplabJenkins commented Mar 18, 2019

AmplabJenkins commented Mar 19, 2019

AmplabJenkins commented Mar 19, 2019

richardliaw commented Mar 19, 2019

gehring commented Mar 19, 2019

gehring commented Feb 11, 2019 •

edited by richardliaw

Loading

richardliaw left a comment •

edited

Loading

gehring commented Feb 16, 2019 •

edited

Loading

richardliaw commented Mar 13, 2019 •

edited

Loading

gehring commented Mar 16, 2019 •

edited

Loading

richardliaw commented Mar 17, 2019 •

edited

Loading