Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Allow for evaluation to run by timesteps (alternative to episodes) and add auto-setting to make sure train doesn't ever have to wait for eval (e.g. long episodes) to finish. #20757

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 29, 2021

This PR introduces 2 new evaluation control config keys, which are:

  • evaluation_duration_unit: "episodes|timesteps": Evaluation can now also be configured by timesteps (no more hard requirement of counting in episodes).
  • evaluation_duration: replaces evaluation_num_episodes (which is now soft-deprecated). Either an int indicating the timesteps|episodes to run each evaluation run OR "auto" for automatically running evaluation for as long as the parallel training loop runs. "auto" is only supported if evaluation_parallel_to_training=True. Note that using evaluation_duration_unit=timesteps with evaluation_duration=auto results in a more accurate control over the evaluation duration by RLlib (as episodes can unexpectedly take long times, which would make the train loop have to wait for the evaluation loop to finish).

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions. thanks.

However, you can switch off any exploration behavior for the evaluation workers
via:
of environment configurations). You can activate evaluating policies during training by setting
the ``evaluation_interval`` to an int value (> 0) indicating every how many training calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. does training call basically mean iterations here? just say 'indicating the number of iterations before an "evaluation steps" is run'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
clarified

For ``evaluation_interval=1``, the sequence is: ``train, eval, train, eval, ...``.
Before each evaluation step, weights from the main model are synchronized to all evaluation workers.
However, it is possible to run evaluation parallel to training via the ``evaluation_parallel_to_training=True``
config flag. In this case, both steps (train and eval) are run at the same time via threading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot, these eval workers can be remote as well right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, eval workers are a completely separate WorkerSet (separate from the "normal" WorkerSet used to collect training data).

# Set `batch_mode=truncate_episodes` and set
# `rollout_fragment_length` such that desired steps are divided
# equally amongst workers or - in auto duration mode - set it
# to a reasonable small number (10).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually any chance you can expand this comment a little bit, so we understand why a reasonable small number works for auto mode ... thanks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -914,19 +952,32 @@ def step_attempt(self) -> ResultDict:
with concurrent.futures.ThreadPoolExecutor() as executor:
train_future = executor.submit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this is still not fully parallelized, like our eval() call will overlap with at most 1 train_exec_impl() right?
I imagine the logic may be cleaner if we always have train_exec_impl() run in the main thread, and simply throw self.evaluate() into a ThreadPoolExecutor() if we want ""evaluation_parallel_to_training"".

Something like:

while True:
    evaluate_this_iter = ...
    if evaluate_this_iter:
        if evaluation_parallel_to_training:
            self.thread_executor.submit(self.evaluate)
        else:
            self.evaluate()
    self.train_exec_impl()

Something like this. Just an idea.
This is definitely out of scope for this PR, but would love to see what you think or what I missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is how it used to work, but the problem is that the eval step needs to check, whether the train step (which is running in a thread) is done, not the other way around. So the logic is:

  • start train step in a thread
  • do eval (10 timesteps as per rollout_frag_len setting)
  • check if train is done
    -- no? -> do another 10 steps
    -- yes? -> return eval results immediately, such that the next call to train doesn't have to be blocked by a still running eval step

Note that we probably should not use the phases "complete decoupling" as this is misleading: We still need some form of coupling as we have to synch weights before each train+eval step from local worker to all evaluation workers. It's merely the duration of the eval step that we more closely align with the train duration in this PR such that it feels like no one has to wait for the other one anymore.

I'll add some example settings to the docs as well and change the title of this PR to clarify this. I hope this explains the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again totally out of scope for this PR, just discussion at this point. I am curious:

the assumption here seems to be that eval step is much faster than train step. so we let eval workers check whether train is done. kind of feel like that's not necessarily true actually. also why do another 10 steps ... won't that make different eval steps run for different amount of steps, and introduce variance to the eval result?

I think evaluation_parallel_to_training=False is a simple case. eval simply blocks train, no controversial there.
but when evaluation_parallel_to_training=True, we gotta make sure eval runs in a separate thread so the main training thread never gets blocked?
something like:

while True:
    train()

    if eval_is_currently_running:
        # do nothing.
        continue
    else:
        if is_there_new_eval_result:
            # report freshly available eval results.
            eval_result = ...
        else:
            eval_result = None

        if should_we_kick_off_a_new_eval:
            sync_weights
            kick_off_a_new_eval_in_thread_pool.

again, not nit picking, but saying, and it feels like train in main and eval on the side is easier for users' mental model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also more likely to be CORRECT... 😏


# Run at least one `evaluate()`, even if the training
# is very fast.
def duration_fn(remaining_duration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we define this duration_fn outside of all these if clauses, higher up in this function, so it doesn't have to live in such a nested place?
you can partial bind the train_future instance for example.

else:
# Count by episodes. -> Run n more
# (n=num eval workers).
elif unit == "episodes":
return self.config["evaluation_num_workers"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't quite get the logic here anymore. duration_fn gets called over and over to see whether the evaluation should stop, right? why do we always return self.config["evaluation_num_workers"] here, which is a positive number?
as written, eval only stops as soon as remaining_duration > 0 and train_future is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it gets called over and over to determine whether there is some eval duration left (>0) to run.

For evaluation_duration_unit=episodes (and auto duration!):

  • Run n more episodes: Each worker will - on a call to its sample() run exactly one episode (as this is configured via the worker's batch_mode=complete_episodes and rollout_fragment_length=1 settings).
  • Then check again, whether training is done and we can stop

For evaluation_duration_unit=timesteps (and auto duration!):

  • Run [num-workers * fragment-len * num_envs_per_worker] more timesteps: note that this is the minimum that one sample() on all workers will do.

In the non-auto case:
The result of the duration_fn is used to determine, how many more sample() (on which eval workers) we need to call. This is important to reach the exact desired number of timesteps/episodes with n workers.

round_ = 0
while True:
episodes_left_to_do = episodes_left_fn(num_episodes_done)
if episodes_left_to_do <= 0:
units_left_to_do = duration_fn(num_units_done)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duration_fn takes a parameter remaining_duration, but we pass num_units_done here.
I feel like one of them is wrong ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Just affected the naming of the function arg. Fixed! :)

@sven1977 sven1977 changed the title [RLlib] Allow for complete decoupling of evaluation and training via more finegrained eval config options. [RLlib] Allow for evaluation to run by timesteps (alternative to episodes) and add auto-setting to make sure train doesn't ever have to wait for eval (e.g. long episodes) to finish. Dec 3, 2021
@sven1977
Copy link
Contributor Author

sven1977 commented Dec 3, 2021

Hey @gjoliver , I addressed all your questions and concerns. Could you take another look?
Thanks! :)

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, this looks cool. thanks.

@gjoliver
Copy link
Member

gjoliver commented Dec 3, 2021

there are test failures though.

@sven1977 sven1977 merged commit 60b2219 into ray-project:master Dec 4, 2021
@rfali
Copy link
Contributor

rfali commented Jan 4, 2022

Great feature @sven1977!!. Should this be expected in ray 1.9.2 release?

@gjoliver
Copy link
Member

gjoliver commented Jan 4, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants