Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][rllib/tune] Deprecate RLLib's rollout/evaluate in favor of tune.run(training=False) #18758

Closed
2 tasks done
andras-kth opened this issue Sep 20, 2021 · 19 comments
Closed
2 tasks done
Labels
enhancement Request for new feature and/or capability rllib RLlib related issues stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@andras-kth
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, there's a major gap, or feature disparity, between running standalone evaluations with RLlib's rollout command vs. doing the same using the various evaluation_* options in tune.run. While the intention of the latter is to allow evaluation during training, with a careful selection of parameters training can be turned off, resulting in a much more complete and flexible approach to evaluation than the somewhat barebones implementation in rollout.py (or, more recently, evaluate.py).

This feature request targets the introduction of an explicit boolean flag, advantageously called training, to allow disabling training without having to fiddle with other parameters, and switch RLlib's evaluation to use tune.run.

If implemented, the following hackish way to disable training:

# evaluation ONLY: avoid MultiGPU optimizer, set all relevant sizes to 0
config.update(
    simple_optimizer=True,
    num_workers=0,
    train_batch_size=0,
    rollout_fragment_length=0,
    timesteps_per_iteration=0,
    evaluation_interval=1,
    # evaluation_num_workers=...,
    # evaluation_config=dict(explore=False),
    # evaluation_num_episodes=...,
)

could be replaced by a single parameter

config.update(training=False)

which would -hopefully- be much more robust, and would NOT rely on apparently deprecated features (simple_optimizer=True triggers a deprecation warning, but without that the default Multi-GPU optimizer breaks on the batch size being 0; which shouldn't matter in this case, since no training is needed).

Finally, having one proper implementation to perform evaluation instead of two would also reduce user confusion about which one should be used when and why (cf. e.g. https://discuss.ray.io/t/recommended-way-to-evaluate-training-results/2502).

Use case

For a fair comparison of different trained policies, evaluating them under identical conditions is a common scenario.

Instantiating agents from checkpoints and running rollout in a loop will result in

WARNING import_thread.py:123 -- The actor 'RolloutWorker' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See https://github.com/ray-project/ray/issues/6240 for more discussion.

predictably with the associated suboptimal performance implications; AND, that's when the implementation happens to work. Some policies/trainers appear to rely on the episodes parameter to their compute_actions method having been properly initialized, which AFAICT doesn't happen when rollout is used, but works fine with tune.run (cf. #13177).

Additionally, spanning parameter ranges can be more compactly expressed as tune.gridsearch as opposed to nested loops, where parallelism is usually lost the inner loop rollout "stages" run sequentially.

Related issues

As mentioned above:

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@andras-kth andras-kth added the enhancement Request for new feature and/or capability label Sep 20, 2021
@andras-kth andras-kth changed the title [Feature][rllib] Deprecate RLLib's rollout/evaluate in favor of tune.run(training=False) [Feature][rllib/tune] Deprecate RLLib's rollout/evaluate in favor of tune.run(training=False) Sep 20, 2021
@sven1977 sven1977 self-assigned this Sep 23, 2021
@sven1977 sven1977 added the rllib RLlib related issues label Sep 23, 2021
@sven1977
Copy link
Contributor

Hey @mehes-kth , any reason why this needs to be done in tune? Why shouldn't we just have an RLlib config flag that disables the training loop? Maybe with a check that evaluation - in this case - would have to be properly setup (evaluation_interval != None, etc..).

This is hacky, but a quick workaround for now would be to set in your config:

multiagent:
    policies_to_train: []

# to avoid spending too much time on sampling
rollout_fragment_length: 1
train_batch_size: 1
num_workers: 0

# evaluation setup
evaluation_interval: 1
evaluation_num_episodes: some_value
evaluation_num_workers: some_value

@andras-kth
Copy link
Author

andras-kth commented Sep 23, 2021

@sven1977

any reason why this needs to be done in tune?

You tell me! 😏

That just seemed more natural, since in RLlib I can already separate agent.train and agent.evaluate.
I'm in no way partial to one or another. I'll test the multiagent hack and see if that does any better than
my equally (if not more!) hacky solution outlined in the original issue.

BTW, having played around more with rollout and agent.evaluate, it seems to me that

  • most of the checks and alternative rollout implementations in rollout.py/evaluate.py
    appear mostly unnecessary, since agent.evaluate seems to "do the right thing" in all
    cases, whereas the other implementations in rollout do NOT.
  • in the conditional branch of rollout where agent.evaluate is called, the results are not
    actually SAVED. The saver only records rollout stop/start, but no relevant data...
    (AND, eval_result is essentially discarded, except for the episode_reward_mean
    included in the annoying print statements.)

@andras-kth
Copy link
Author

andras-kth commented Sep 23, 2021

BTW, I think I may have confused where the flag was going to be.

If I understand correctly, my suggestion to pass training=False in config is, in fact, an RLlib flag.
Having said that, what I care most about is that tune.run (with all its bells and whistles) takes that flag
into account when generating experiments/trials.

Which brings me to another, probably quite naïve, question regarding these settings in your proposed hack:

# to avoid spending too much time on sampling
rollout_fragment_length: 1
train_batch_size: 1
num_workers: 0

Do rollout_fragment_length and train_batch_size affect evaluation as well as training?
I suppose they must. I just never realized they did.

@ericl @richardliaw
Shouldn't train_batch_size be called sample_batch_size in that case?

Here the inline documentation explicitly talks about learning (should be corrected?):

# combined into a larger batch of `train_batch_size` for learning.

Similarly, at the very definition, training and SGD are mentioned:

ray/rllib/agents/trainer.py

Lines 108 to 111 in 944309c

# Training batch size, if applicable. Should be >= rollout_fragment_length.
# Samples batches will be concatenated together to a batch of this size,
# which is then passed to SGD.
"train_batch_size": 200,

Is this correct? Is @sven1977 wrong? Is training_batch_size not actually used during evaluation?

Or was it just me jumping to the wrong conclusion that training would be completely avoided using @sven1977 's "workaround"? That's apparently not the case...

@gjoliver
Copy link
Member

If training=False flag is an RLlib flag, how does Tune take it into consideration?
Large parallelized evaluation sounds like a common and very important use case. Should we create a nice evaluate.py that actually works?
Re-purpose Tune for this feels like a bigger hack, imho.

@andras-kth
Copy link
Author

andras-kth commented Sep 24, 2021

If training=False flag is an RLlib flag, how does Tune take it into consideration?

In a similar fashion it appears to respect multiagent.policies_to_train: [], I suppose...

Large parallelized evaluation sounds like a common and very important use case.

Indeed! AND, the current rollout.py/evaluate.py caters mostly/only to one-off evaluations.

Should we create a nice evaluate.py that actually works?

Most definitely, but doing so re-using what already exists and works in Tune sounds most efficient.

Re-purpose Tune for this feels like a bigger hack, imho.

Given that tune.run already works and can do exactly that, I'm inclined to disagree.

@gjoliver
Copy link
Member

glad to see we agree on most of things :) let me explain my position:

tune is a distributed runner (10% shell) wrapping around a bunch of search algorithms (90% meat).
it's easy to re-purpose the distributed runner. but as soon as you do that, somebody would want reproducible eval, deterministic seed, parameter set input, better analysis support, auto cross-validation, etc.
it's not impossible, but we will have difficult time to justify them to the tune team.
as an example, you want a little custom logic to make Tune aware of the fact that there is no training going on :)

To be clear, this is all about re-purposing Tune, I think adding a single flag to disable training altogether sounds like a very reasonable thing.

Should we open a feature request for an actually useful eval.py? I don't think it's hard/slow to do at all. We may just need to copy&paste tune.run code here.

@andras-kth
Copy link
Author

it's easy to re-purpose the distributed runner. but as soon as you do that, somebody would want reproducible eval, deterministic seed, parameter set input, better analysis support, auto cross-validation, etc.
it's not impossible, but we will have difficult time to justify them to the tune team.

I'd certainly want reproducibility and seed control, but AFAIK tune.run handles that already just fine.
And, it seems to cater for parameter sets as well, at least,tune.gridsearch should work in simple settings.

BUT, you certainly have a point: there may, in fact, be a limit how far this can be stretched.
Or, not... I'd be interested to hear what the Tune team thinks, considering that in the general
AI/ML/RL context, training and evaluation are both crucially important, they may wish to
take a more balanced approach and support both on equal footing.

as an example, you want a little custom logic to make Tune aware of the fact that there is no training going on :)
To be clear, this is all about re-purposing Tune, I think adding a single flag to disable training altogether sounds like a very reasonable thing.

Right.

Should we open a feature request for an actually useful eval.py? I don't think it's hard/slow to do at all. We may just need to copy&paste tune.run code here.

See... "copy&paste tune.run code" is exactly what I'm afraid of and would vehemently oppose, if asked.
Ray is complex enough already without such duplication, that is virtually guaranteed to diverge in the long run.
One good (well, quite horrible, TBH) example for such is the different levels of Kubernetes support code.
Some parts use the kubernetes module in Python, others need kubectl, and there's quite a bit of overlapping
functionality in different places. A nightmare in my humble opinion...

@gjoliver
Copy link
Member

For parameter set, I meant the case where I created a large set of starting conditions for my Env. And I want to evaluate multiple policies against this same set of benchmark envs.
Actually this sounds like something that's even more out of Tune's scope. It's not hyper-params. It's eval trial conditioning.

I think we will copy the trail runner to start. The whole point I am trying to make is that over time, these things will diverge enough to make the effort worth it.

@andras-kth
Copy link
Author

@sven1977

I finally found some time to test your proposed quick workaround.

This is hacky, but a quick workaround for now would be to set in your config:

multiagent:
    policies_to_train: []

# to avoid spending too much time on sampling
rollout_fragment_length: 1
train_batch_size: 1
num_workers: 0

# evaluation setup
evaluation_interval: 1
evaluation_num_episodes: some_value
evaluation_num_workers: some_value

This is not only hacky, but it doesn't actually work, either...

  File ".../lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 426, in learn_on_batch
    grads, grad_info = self.compute_gradients(samples)
  File ".../lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 446, in compute_gradients
    raise NotImplementedError

@sven1977 sven1977 removed their assignment Oct 6, 2021
@krfricke
Copy link
Contributor

krfricke commented Oct 6, 2021

Ray Tune as a distributed runner does not have a concept of training/evaluation. It is a blackbox optimizer that does not impose any structure on the trainable apart from requiring a point of entry and feedback. This becomes even more apparent when considering function trainables, which only invoke a single function that takes care of running the full training loop, including potential evaluation.

Thus, a training flag in tune.run would only serve a subset of trainables and introduce a meaningless flag for a large portion of workloads.

If this is a common use case for RLLib trainables, it should thus be reflected in a RLLib configuration flag which disables training.

By the way, this would presumably be the way it would be implemented anyway - Trainer.evaluate() is called in the trainer's step() function, so all tune.run() would do is set a config flag to pass to RLLib trainables to disable this.

@andras-kth
Copy link
Author

andras-kth commented Oct 6, 2021

Ray Tune as a distributed runner does not have a concept of training/evaluation. It is a blackbox optimizer

In my mind, the two clauses here are in direct contradiction with each other.

Evaluation does NOT need an optimizer, blackbox or otherwise, but it could make very good use of a distributed runner...
The use case being, as discussed above, "mass" evaluation of different trained policies/algorithms against the same "matrix" of preconditions (environment settings, seeds, etc).

The "hack" outlined in the original issue formulation seems to accomplish this fairly well.

@krfricke
Copy link
Contributor

krfricke commented Oct 6, 2021

Yes, as I said Ray Tune does not care about the blackbox it is running, as long as it has an entry point and reports some kind of metric (even that could be omitted). If this is an evaluation function, that's fine.

This is however a request specific to certain kinds of trainables - those that do have an evaluation function. We should not generalize from this use case to other blackbox functions.

Introducing a config flag to RLLib seems like a reasonable way to enable your use case without hacks.

@andras-kth
Copy link
Author

This is however a request specific to certain kinds of trainables - those that do have an evaluation function. We should not generalize from this use case to other blackbox functions.

Right. I'm not even sure I'd call them Trainable:s, more like Experiment:s
or Runnable:s, if we were to generalize to a pure distributed runner,
devoid of any notion of "training".

BUT, I could easily see how that may fall outside the scope of Tune per se.

Introducing a config flag to RLLib seems like a reasonable way to enable your use case without hacks.

I'm fine with that, too. In fact, my proposed code sample does appear to be
an RLlib flag (since it's updating config), albeit my text is a bit confused... 😸

@krfricke
Copy link
Contributor

krfricke commented Oct 6, 2021

Agreed, the naming is not optimal here - we'll leave it as is for historic reasons for now but might reconsider in the future.

Awesome, a RLLib config parameter it is, then. Would you like to try to contribute this yourself? Otherwise we can put this on our backlog and slot this in sometime (cc @avnishn maybe something you could look into?)

@andras-kth
Copy link
Author

andras-kth commented Oct 6, 2021

Would you like to try to contribute this yourself?

I could try, but I'm afraid that I'd end up with something that institutionalizes the "hack" I found,
i.e. set those parameters when training==True, which is likely far from optimal, but a proper
solution is outside my comfort zone due to my current level of [un]familiarity with Tune's internals.

@rfali
Copy link
Contributor

rfali commented Jan 4, 2022

@andras-kth thank you for referencing my post on ray discourse (which ironically still has no replies, but a recent rllib feature should resolve most of my questions). I thought to answer (to the best of my knowledge) a few questions you had raised in this issue (since I still consider myself an rllib beginner).
You wrote:

Do rollout_fragment_length and train_batch_size affect evaluation as well as training?
I suppose they must. I just never realized they did.
Shouldn't train_batch_size be called sample_batch_size in that case?

From this issue, it reads sample_batch_size determines how many steps to sample from the env, and train_batch_size determines how many steps to train on from the replay buffer.

On further digging, I discovered that sample_batch_size is the old name for rollout_fragment_length, as can be seen in this PR.

@andras-kth
Copy link
Author

thank you for referencing my post on ray discourse (which ironically still has no replies, but a recent rllib feature should resolve most of my questions).

Interesting... That PR is not something I find useful, but I'm glad it helps you. 😺
I'd rather have a proper parallelized standalone implementation of evaluate.
There's many ways to get this wrong and it seems that the Ray project hasn't exhausted all those options yet.

I thought to answer (to the best of my knowledge) a few questions you had raised in this issue (since I still consider myself an rllib beginner).

Thanks!
So, what's your interpretation of your findings?
In my mind, they imply that train_batch_size should be completely irrelevant for evaluation.
Would you concur?

@stale
Copy link

stale bot commented May 4, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 4, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability rllib RLlib related issues stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

5 participants