-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Allow for evaluation to run by timesteps
(alternative to episodes
) and add auto-setting to make sure train doesn't ever have to wait for eval (e.g. long episodes) to finish.
#20757
Conversation
…uple_evaluation_and_training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few questions. thanks.
doc/source/rllib-training.rst
Outdated
However, you can switch off any exploration behavior for the evaluation workers | ||
via: | ||
of environment configurations). You can activate evaluating policies during training by setting | ||
the ``evaluation_interval`` to an int value (> 0) indicating every how many training calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. does training call basically mean iterations here? just say 'indicating the number of iterations before an "evaluation steps" is run'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
clarified
doc/source/rllib-training.rst
Outdated
For ``evaluation_interval=1``, the sequence is: ``train, eval, train, eval, ...``. | ||
Before each evaluation step, weights from the main model are synchronized to all evaluation workers. | ||
However, it is possible to run evaluation parallel to training via the ``evaluation_parallel_to_training=True`` | ||
config flag. In this case, both steps (train and eval) are run at the same time via threading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot, these eval workers can be remote as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, eval workers are a completely separate WorkerSet (separate from the "normal" WorkerSet used to collect training data).
rllib/agents/trainer.py
Outdated
# Set `batch_mode=truncate_episodes` and set | ||
# `rollout_fragment_length` such that desired steps are divided | ||
# equally amongst workers or - in auto duration mode - set it | ||
# to a reasonable small number (10). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually any chance you can expand this comment a little bit, so we understand why a reasonable small number works for auto mode ... thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -914,19 +952,32 @@ def step_attempt(self) -> ResultDict: | |||
with concurrent.futures.ThreadPoolExecutor() as executor: | |||
train_future = executor.submit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice this is still not fully parallelized, like our eval() call will overlap with at most 1 train_exec_impl() right?
I imagine the logic may be cleaner if we always have train_exec_impl() run in the main thread, and simply throw self.evaluate() into a ThreadPoolExecutor() if we want ""evaluation_parallel_to_training"".
Something like:
while True:
evaluate_this_iter = ...
if evaluate_this_iter:
if evaluation_parallel_to_training:
self.thread_executor.submit(self.evaluate)
else:
self.evaluate()
self.train_exec_impl()
Something like this. Just an idea.
This is definitely out of scope for this PR, but would love to see what you think or what I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is how it used to work, but the problem is that the eval
step needs to check, whether the train
step (which is running in a thread) is done, not the other way around. So the logic is:
- start train step in a thread
- do eval (10 timesteps as per rollout_frag_len setting)
- check if train is done
-- no? -> do another 10 steps
-- yes? -> return eval results immediately, such that the next call totrain
doesn't have to be blocked by a still running eval step
Note that we probably should not use the phases "complete decoupling" as this is misleading: We still need some form of coupling as we have to synch weights before each train+eval step from local worker to all evaluation workers. It's merely the duration of the eval step that we more closely align with the train duration in this PR such that it feels like no one has to wait for the other one anymore.
I'll add some example settings to the docs as well and change the title of this PR to clarify this. I hope this explains the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again totally out of scope for this PR, just discussion at this point. I am curious:
the assumption here seems to be that eval step is much faster than train step. so we let eval workers check whether train is done. kind of feel like that's not necessarily true actually. also why do another 10 steps ... won't that make different eval steps run for different amount of steps, and introduce variance to the eval result?
I think evaluation_parallel_to_training=False is a simple case. eval simply blocks train, no controversial there.
but when evaluation_parallel_to_training=True, we gotta make sure eval runs in a separate thread so the main training thread never gets blocked?
something like:
while True:
train()
if eval_is_currently_running:
# do nothing.
continue
else:
if is_there_new_eval_result:
# report freshly available eval results.
eval_result = ...
else:
eval_result = None
if should_we_kick_off_a_new_eval:
sync_weights
kick_off_a_new_eval_in_thread_pool.
again, not nit picking, but saying, and it feels like train in main and eval on the side is easier for users' mental model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also more likely to be CORRECT... 😏
rllib/agents/trainer.py
Outdated
|
||
# Run at least one `evaluate()`, even if the training | ||
# is very fast. | ||
def duration_fn(remaining_duration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we define this duration_fn outside of all these if clauses, higher up in this function, so it doesn't have to live in such a nested place?
you can partial bind the train_future instance for example.
rllib/agents/trainer.py
Outdated
else: | ||
# Count by episodes. -> Run n more | ||
# (n=num eval workers). | ||
elif unit == "episodes": | ||
return self.config["evaluation_num_workers"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I don't quite get the logic here anymore. duration_fn gets called over and over to see whether the evaluation should stop, right? why do we always return self.config["evaluation_num_workers"] here, which is a positive number?
as written, eval only stops as soon as remaining_duration > 0 and train_future is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it gets called over and over to determine whether there is some eval duration left (>0) to run.
For evaluation_duration_unit=episodes
(and auto duration!):
- Run n more episodes: Each worker will - on a call to its
sample()
run exactly one episode (as this is configured via the worker's batch_mode=complete_episodes and rollout_fragment_length=1 settings). - Then check again, whether training is done and we can stop
For evaluation_duration_unit=timesteps
(and auto duration!):
- Run [num-workers * fragment-len * num_envs_per_worker] more timesteps: note that this is the minimum that one
sample()
on all workers will do.
In the non-auto case:
The result of the duration_fn
is used to determine, how many more sample()
(on which eval workers) we need to call. This is important to reach the exact desired number of timesteps/episodes with n workers.
round_ = 0 | ||
while True: | ||
episodes_left_to_do = episodes_left_fn(num_episodes_done) | ||
if episodes_left_to_do <= 0: | ||
units_left_to_do = duration_fn(num_units_done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duration_fn takes a parameter remaining_duration, but we pass num_units_done here.
I feel like one of them is wrong ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Just affected the naming of the function arg. Fixed! :)
timesteps
(alternative to episodes
) and add auto-setting to make sure train doesn't ever have to wait for eval (e.g. long episodes) to finish.
Hey @gjoliver , I addressed all your questions and concerns. Could you take another look? |
…uple_evaluation_and_training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise, this looks cool. thanks.
there are test failures though. |
…uple_evaluation_and_training
Great feature @sven1977!!. Should this be expected in ray 1.9.2 release? |
1.10.0 should come out soon :)
…On Mon, Jan 3, 2022 at 9:12 PM Farrukh Ali ***@***.***> wrote:
Great feature @sven1977 <https://github.com/sven1977>!!. Should this be
expected in ray 1.9.2 release?
—
Reply to this email directly, view it on GitHub
<#20757 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQNWQPDPKD6HTHX7RHUANLUUJ6T7ANCNFSM5I6Z7PJQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This PR introduces 2 new evaluation control config keys, which are:
evaluation_duration_unit
: "episodes|timesteps": Evaluation can now also be configured by timesteps (no more hard requirement of counting in episodes).evaluation_duration
: replacesevaluation_num_episodes
(which is now soft-deprecated). Either an int indicating the timesteps|episodes to run each evaluation run OR "auto" for automatically running evaluation for as long as the parallel training loop runs. "auto" is only supported ifevaluation_parallel_to_training=True
. Note that usingevaluation_duration_unit=timesteps
withevaluation_duration=auto
results in a more accurate control over the evaluation duration by RLlib (as episodes can unexpectedly take long times, which would make the train loop have to wait for the evaluation loop to finish).Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.