Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Allow for evaluation to run by timesteps (alternative to episodes) and add auto-setting to make sure train doesn't ever have to wait for eval (e.g. long episodes) to finish. #20757

Merged
83 changes: 75 additions & 8 deletions doc/source/rllib-training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -729,19 +729,75 @@ Customized Evaluation During Training

RLlib will report online training rewards, however in some cases you may want to compute
rewards with different settings (e.g., with exploration turned off, or on a specific set
of environment configurations). You can evaluate policies during training by setting
the ``evaluation_interval`` config, and optionally also ``evaluation_num_episodes``,
``evaluation_config``, ``evaluation_num_workers``, and ``custom_eval_function``
(see `trainer.py <https://github.com/ray-project/ray/blob/master/rllib/agents/trainer.py>`__ for further documentation).
of environment configurations). You can activate evaluating policies during training (``Trainer.train()``) by setting
the ``evaluation_interval`` to an int value (> 0) indicating every how many ``Trainer.train()``
calls an "evaluation step" is run:

.. code-block:: python

# Run one evaluation step on every 3rd `Trainer.train()` call.
{
"evaluation_interval": 3,
}


One such evaluation step runs over ``evaluation_duration`` episodes or timesteps, depending
on the ``evaluation_duration_unit`` setting, which can be either "episodes" (default) or "timesteps".


.. code-block:: python

# Every time we do run an evaluation step, run it for exactly 10 episodes.
{
"evaluation_duration": 10,
"evaluation_duration_unit": "episodes",
}
# Every time we do run an evaluation step, run it for close to 200 timesteps.
{
"evaluation_duration": 200,
"evaluation_duration_unit": "timesteps",
}

By default, exploration is left as-is within ``evaluation_config``.
However, you can switch off any exploration behavior for the evaluation workers
via:

Before each evaluation step, weights from the main model are synchronized to all evaluation workers.

Normally, the evaluation step is run right after the respective train step. For example, for
``evaluation_interval=2``, the sequence of steps is: ``train, train, eval, train, train, eval, ...``.
For ``evaluation_interval=1``, the sequence is: ``train, eval, train, eval, ...``.

However, it is possible to run evaluation in parallel to training via the ``evaluation_parallel_to_training=True``
config setting. In this case, both steps (train and eval) are run at the same time via threading.
This can speed up the evaluation process significantly, but leads to a 1-iteration delay between reported
training results and evaluation results (the evaluation results are behind b/c they use slightly outdated
model weights).

When running with the ``evaluation_parallel_to_training=True`` setting, a special "auto" value
is supported for ``evaluation_duration``. This can be used to make the evaluation step take
roughly as long as the train step:

.. code-block:: python

# Run eval and train at the same time via threading and make sure they roughly
# take the same time, such that the next `Trainer.train()` call can execute
# immediately and not have to wait for a still ongoing (e.g. very long episode)
# evaluation step:
{
"evaluation_interval": 1,
"evaluation_parallel_to_training": True,
"evaluation_duration": "auto", # automatically end evaluation when train step has finished
"evaluation_duration_unit": "timesteps", # <- more fine grained than "episodes"
}


The ``evaluation_config`` key allows you to override any config settings for
the evaluation workers. For example, to switch off exploration in the evaluation steps,
do:

.. code-block:: python

# Switching off exploration behavior for evaluation workers
# (see rllib/agents/trainer.py)
# (see rllib/agents/trainer.py). Use any keys in this sub-dict that are
# also supported in the main Trainer config.
"evaluation_config": {
"explore": False
}
Expand All @@ -752,6 +808,17 @@ via:
policy, even if this is a stochastic one. Setting "explore=False" above
will result in the evaluation workers not using this stochastic policy.

Parallelism for the evaluation step is determined via the ``evaluation_num_workers``
setting. Set this to larger values if you want the desired evaluation episodes or timesteps to
run as much in parallel as possible. For example, if your ``evaluation_duration=10``,
``evaluation_duration_unit=episodes``, and ``evaluation_num_workers=10``, each eval worker
only has to run 1 episode in each eval step.

In case you would like to entirely customize the evaluation step, set ``custom_eval_function`` in your
config to a callable taking the Trainer object and a WorkerSet object (the evaluation WorkerSet)
and returning a metrics dict. See `trainer.py <https://github.com/ray-project/ray/blob/master/rllib/agents/trainer.py>`__
for further documentation.

There is an end to end example of how to set up custom online evaluation in `custom_eval.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_eval.py>`__. Note that if you only want to eval your policy at the end of training, you can set ``evaluation_interval: N``, where ``N`` is the number of training iterations before stopping.

Below are some examples of how the custom evaluation metrics are reported nested under the ``evaluation`` key of normal training results:
Expand Down
14 changes: 7 additions & 7 deletions rllib/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -2335,34 +2335,34 @@ py_test(
tags = ["team:ml", "examples", "examples_P"],
size = "medium",
srcs = ["examples/parallel_evaluation_and_training.py"],
args = ["--as-test", "--stop-reward=50.0", "--num-cpus=6", "--evaluation-num-episodes=13"]
args = ["--as-test", "--stop-reward=50.0", "--num-cpus=6", "--evaluation-duration=13"]
)

py_test(
name = "examples/parallel_evaluation_and_training_auto_num_episodes_tf",
name = "examples/parallel_evaluation_and_training_auto_episodes_tf",
main = "examples/parallel_evaluation_and_training.py",
tags = ["team:ml", "examples", "examples_P"],
size = "medium",
srcs = ["examples/parallel_evaluation_and_training.py"],
args = ["--as-test", "--stop-reward=50.0", "--num-cpus=6", "--evaluation-num-episodes=auto"]
args = ["--as-test", "--stop-reward=50.0", "--num-cpus=6", "--evaluation-duration=auto"]
)

py_test(
name = "examples/parallel_evaluation_and_training_11_episodes_tf2",
name = "examples/parallel_evaluation_and_training_211_ts_tf2",
main = "examples/parallel_evaluation_and_training.py",
tags = ["team:ml", "examples", "examples_P"],
size = "medium",
srcs = ["examples/parallel_evaluation_and_training.py"],
args = ["--as-test", "--framework=tf2", "--stop-reward=30.0", "--num-cpus=6", "--evaluation-num-episodes=11"]
args = ["--as-test", "--framework=tf2", "--stop-reward=30.0", "--num-cpus=6", "--evaluation-num-workers=3", "--evaluation-duration=211", "--evaluation-duration-unit=timesteps"]
)

py_test(
name = "examples/parallel_evaluation_and_training_14_episodes_torch",
name = "examples/parallel_evaluation_and_training_auto_ts_torch",
main = "examples/parallel_evaluation_and_training.py",
tags = ["team:ml", "examples", "examples_P"],
size = "medium",
srcs = ["examples/parallel_evaluation_and_training.py"],
args = ["--as-test", "--framework=torch", "--stop-reward=30.0", "--num-cpus=6", "--evaluation-num-episodes=14"]
args = ["--as-test", "--framework=torch", "--stop-reward=30.0", "--num-cpus=6", "--evaluation-num-workers=3", "--evaluation-duration=auto", "--evaluation-duration-unit=timesteps"]
)

py_test(
Expand Down
2 changes: 1 addition & 1 deletion rllib/agents/cql/tests/test_cql.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def test_cql_compilation(self):
config["input_evaluation"] = ["is"]

config["evaluation_interval"] = 2
config["evaluation_num_episodes"] = 10
config["evaluation_duration"] = 10
config["evaluation_config"]["input"] = "sampler"
config["evaluation_parallel_to_training"] = False
config["evaluation_num_workers"] = 2
Expand Down
2 changes: 1 addition & 1 deletion rllib/agents/ddpg/ddpg.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": None,
# Number of episodes to run per evaluation period.
"evaluation_num_episodes": 10,
"evaluation_duration": 10,

# === Model ===
# Apply a state preprocessor with spec given by the "model" config option
Expand Down
2 changes: 1 addition & 1 deletion rllib/agents/marwil/tests/test_bc.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def test_bc_compilation_and_learning_from_offline_file(self):

config["evaluation_interval"] = 3
config["evaluation_num_workers"] = 1
config["evaluation_num_episodes"] = 5
config["evaluation_duration"] = 5
config["evaluation_parallel_to_training"] = True
# Evaluate on actual environment.
config["evaluation_config"] = {"input": "sampler"}
Expand Down
4 changes: 2 additions & 2 deletions rllib/agents/marwil/tests/test_marwil.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def test_marwil_compilation_and_learning_from_offline_file(self):
config["num_workers"] = 2
config["evaluation_num_workers"] = 1
config["evaluation_interval"] = 3
config["evaluation_num_episodes"] = 5
config["evaluation_duration"] = 5
config["evaluation_parallel_to_training"] = True
# Evaluate on actual environment.
config["evaluation_config"] = {"input": "sampler"}
Expand Down Expand Up @@ -100,7 +100,7 @@ def test_marwil_cont_actions_from_offline_file(self):
config["num_workers"] = 1
config["evaluation_num_workers"] = 1
config["evaluation_interval"] = 3
config["evaluation_num_episodes"] = 5
config["evaluation_duration"] = 5
config["evaluation_parallel_to_training"] = True
# Evaluate on actual environment.
config["evaluation_config"] = {"input": "sampler"}
Expand Down
2 changes: 1 addition & 1 deletion rllib/agents/qmix/qmix.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": None,
# Number of episodes to run per evaluation period.
"evaluation_num_episodes": 10,
"evaluation_duration": 10,
# Switch to greedy actions in evaluation workers.
"evaluation_config": {
"explore": False,
Expand Down
8 changes: 4 additions & 4 deletions rllib/agents/tests/test_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from ray.rllib.agents.trainer import COMMON_CONFIG
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.rllib.examples.parallel_evaluation_and_training import \
AssertNumEvalEpisodesCallback
AssertEvalCallback
from ray.rllib.utils.metrics.learner_info import LEARNER_INFO
from ray.rllib.utils.test_utils import framework_iterator

Expand Down Expand Up @@ -131,13 +131,13 @@ def test_evaluation_option(self):
config.update({
"env": "CartPole-v0",
"evaluation_interval": 2,
"evaluation_num_episodes": 2,
"evaluation_duration": 2,
"evaluation_config": {
"gamma": 0.98,
},
# Use a custom callback that asserts that we are running the
# configured exact number of episodes per evaluation.
"callbacks": AssertNumEvalEpisodesCallback,
"callbacks": AssertEvalCallback,
})

for _ in framework_iterator(config, frameworks=("tf", "torch")):
Expand Down Expand Up @@ -169,7 +169,7 @@ def test_evaluation_wo_evaluation_worker_set(self):
"evaluation_interval": None,
# Use a custom callback that asserts that we are running the
# configured exact number of episodes per evaluation.
"callbacks": AssertNumEvalEpisodesCallback,
"callbacks": AssertEvalCallback,
})
for _ in framework_iterator(frameworks=("tf", "torch")):
# Setup trainer w/o evaluation worker set and still call
Expand Down
Loading