-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Make the logging of the function API consistent and predictable #4011
Conversation
…cross both the Trainable class APU and the function API.
…how functions are wrapped
@richardliaw Would you mind checking this out. I didn't want to go ahead with the generator based API until I could guarantee that both the old and new function runner would behave mostly the same. This implementation seems to work as expected. Most test pass at the moment with the exception of the tests which expect the last result to be repeated when |
Test FAILed. |
Sorry for the wait, I'll get to this today! (And feel free to ping if you don't hear) |
jenkins retest this please |
I think it looks good! What do you think about having an option that
toggles asynchronous vs synchronous behavior for the reporter?
…On Mon, Feb 11, 2019 at 12:27 PM Robert Nishihara ***@***.***> wrote:
Assigned #4011 <#4011> to
@richardliaw <https://github.com/richardliaw>.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#4011 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5VoBpd0W_rGjxn1OcGBBN8gtCb9nks5vMdIxgaJpZM4azZg_>
.
|
Test FAILed. |
This would be fairly straightforward to do but I would highly recommend against as I don't see any advantages in doing so. Adding the option would add complexity to the API (and a source of confusion) while not offering any tangible benefits. In principal, most of the computation time should be spent within the training function so, in the best case, I don't expect any significant performance gains to offset the added complexity (and source of bugs). While the old version was asynchronous it was not parallel (due to the GIL). Making them fully asynchronous again would only provide performance benefits if logging had especially long blocking calls. Given this change is only to provide consistency between the Trainable API, the new generator function API and the reporter function API, I'm not keen on allowing users to opt out of the predictable behavior but I'm open to arguments for asynchronicity. Even more so, if the reporter function API is to be deprecated in favor of the generator function API and is kept around only for backwards compatibility purposes. (As a related aside, any idea how long python 2.7 will be supported by ray? An |
@richardliaw There are several tests that would require changing before they can be passed as they assume some weird corner case behavior. The new implementation avoids these cases by always reporting Alternatively, if returning a simple result, |
That's a good point.
No clue... although I think the broader community would be really interested in this discussion (enough to warrant a post on the dev list I think!)
I think returning a simple result, We should also add test we don't report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, this change of behavior is good; thanks for making this PR!
Overall, I would prefer removing some complexity (i.e., removing code paths like should_stop
and cleanups from _stop
, as these have not been issues in the past - this is a pretty small edge case and people will probably checkpoint their stuff with some frequency anyways).
Let me know.
Co-Authored-By: gehring <[email protected]>
Co-Authored-By: gehring <[email protected]>
…redudant _started attribute
I agree with you. It should all be stripped away now. Let me know if there is anything else! I should be able to get around to fixing the tests in the next couple of days. |
Test FAILed. |
Thanks! I think this looks good and would be ready after the tests are added. |
Test FAILed. |
Test FAILed. |
I was just debugging a Jenkins issue and I realized that there are two problems if the user doesn't report
One workaround is for the function_runner to call EDIT: I've made a PR against your fork, let me know what you think |
Test FAILed. |
Test FAILed. |
@richardliaw Thanks for the PR. I won't be able to look at it closely for a day or two, but here are my high level thoughts. I think that a Without having thought through the details, I'm thinking using the standard The logic to do this would look like this: class ClassThatHandlesTrainCalls(object):
...
def function_responsible_for_last_result(self, ...):
# any necessary work to do before the next iteration call
...
try:
self._last_results = self._trianable.train().copy()
except StopIteration:
self._last_results[DONE] = True
# whatever else this is responsible for updating
... A |
Let me know when you take a look at the PR!
The main notions I'm sticking with are the following:
1. Whatever we do needs to be very simple for users; I would like as much
as possible to avoid any new concepts or gotchas (i.e., StopIteration,
remembering to set Done, etc -- this is not great IMO), especially in
places like Trainable where it doesn't make sense to raise StopIteration.
My PR against your branch explicitly avoids this without requiring to
reimplement any of the other parts of Tune.
2. We can do away with the "reporter" specifically, but keep in mind,
backwards compat is important at least for a while after introducing a new
"recommended" API. There is another API that we are considering internally
that does not *require* `reporter` nor a generator-based API. The high
level idea is to use some global context within each worker. We will push a
PR soon, and I'd love to get your feedback then.
Thanks,
Richard
…On Wed, Mar 13, 2019 at 1:38 PM gehring ***@***.***> wrote:
I was just debugging a Jenkins issue and I realized that there are two
problems if the user doesn't report done = i == 99, which is rather
cumbersome and a common failure mode:
1. one user problem that this introduces is that trial.last_result
becomes effectively useless
2. on_trial_complete no longer receives anything useful; which would
break a lot of current search algorithms.
One workaround is for the function_runner to call reporter(**last_result,
__done__=True). This result will not be exposed to the user, and will
skip the logging, but will be passed into on_trial_complete for the
scheduler and search algorithms. Let me know what you think, and I can try
to implement it if necessary.
EDIT: I've made a PR against your fork, let me know what you think
@richardliaw <https://github.com/richardliaw> Thanks for the PR. I won't
be able to look at it closely for a day or two, but here are my high level
thoughts.
I think that a reporter(..., __done__=True) could work fine but I think
given there is the possibility of changing the API to a generator based API
(i.e., without a reporter), we might just be kicking the problem further
down the road. Additionally, it seems to me that it would require
reimplementing some of the behavior provided by Trial (or some other
class) for keeping track of the last result, editing it when necessary. It
might be best to try to find a more general solution.
Without having thought through the details, I'm thinking using the
standard StopIteration exception might solve this problem with little
effort. If we allow Trainable._train() and Trainable.train() to raise a
StopIteration exception, we would only need to modify the class in charge
of manipulating the last results (not sure on the top of my head if that is
Trainable, Trial, or one of the runner/executors).
The logic to do this would look like this:
class ClassThatHandlesTrainCalls(object):
...
def function_responsible_for_last_result(self, ...):
# any necessary work to do before the next iteration call
...
try:
self._last_results = self._trianable.train().copy()
except StopIteration:
self._last_results[DONE] = True
# whatever else this is responsible for updating
...
A StopIteration solution would make the implementation of generator based
API very simple with the added benefit of allowing custom Trainable
classes to implement their own termination mechanics by raising
StopIteration. Plus, we avoid having to manage last results in several
places in the code base, dodging future issues with bugs and
inconsistencies. The only issue with this approach I could foresee would
come the fact that the logic is distributed through ray. What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4011 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5dyc7cG-tw-7YV5yE-M0Vfj6Qlo9ks5vWWHVgaJpZM4azZg_>
.
|
This now looks good to me and I think we can merge, although it'd be great to get your eyes on some of the minor implementation changes that I've made. |
Test FAILed. |
Now that this is blocking another PR, I am happy
I wasn't suggesting that the user is meant to raise To be clear, I think the current Supporting class FunctionRunnerV2(Trainable):
def _setup(self, config):
if is_reporter_required(self._train_func):
# create a backwards compatible generator
else:
self._train_iter = self._train_func(config)
def _train(self):
# raises StopIteration when exhausted, i.e., when the training function terminates/returns
return next(self._train_iter) Finally, supporting class TrialRunner(object):
# some other methods
...
def _process_trial(self, trial):
try:
result = self.trial_executor.fetch_result(trial)
# process result
...
# extra code for StopIteration
except StopIteration:
# trial is exhausted, do any necessary bookkeeping
trial.set_last_result_done(True)
...
except Exception:
# process other exceptions normally
logger.exception("Error processing event.")
...
Any reason to prefer a global context over an iterator/generator? I'm usually a little apprehensive when relying on some global state, but I'm excited to see what you guys are working on! Do you have a design I can check out? As for backwards compat, I am definitely keeping it in mind! I don't think it would be too complicated to maintain it. |
Just went through, as long as the |
OK I agree with this and also agree it shouldn't be hard to add. Instead of a FunctionRunnerV2, I'd maybe create a IteratorRunner and have a check if the input is an IteratorRunner. Sorry for the miscommunication! Let's open up a new issue in reference to the Generator/Iterator support.
Yes! It's a very light impl actually but maybe @noahgolmant can comment (or I'll open a new Issue as RFC in a bit).
Thanks a bunch! I'll fix up the tests and get this merged; I think this will clean up a lot. Thanks! |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Thanks a bunch for this contribution @gehring! |
@richardliaw No problem! Ray/tune/rllib is a great set of libraries and I am happy to contribute! |
What do these changes do?
This is a re-implementation of the
FunctionRunner
which enforces some synchronicity between the thread running the training function and the thread running the Trainable which logs results. The main purpose is to make logging consistent across APIs in anticipation of a new function API which will be generator based (throughyield
statements). Without these changes, it will be impossible for the (possibly soon to be) deprecated reporter based API to behave the same as the generator based API.This new implementation provides additional guarantees to prevent results from being dropped. This makes the logging behavior more intuitive and consistent with how results are handled in custom subclasses of Trainable.
New guarantees for the tune function API:
reporter(**kwargs)
calls, is forwarded to the appropriate loggers instead of being dropped if not enough time has elapsed since the last results.FunctionRunner
expects a result, i.e., whenFunctionRunner._train()
has been called. This removes the possibility that a result will be generated by the function but never logged._train()
call. Currently, the wrapped function is started during the setup phase which could result in dropped results if the trial is cancelled between_setup()
and the first_train()
call.FunctionRunner
is stopped with_stop()
.done=True
, a duplicate result with{"done": True}
, is reported to explicitly terminate the trial, but will not be logged.Related issue number
#3956
#3949
#3834