-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib]: Move OPE to evaluation config #25911
Conversation
@@ -841,6 +842,18 @@ def evaluation( | |||
IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal | |||
policy, even if this is a stochastic one. Setting "explore=False" here | |||
will result in the evaluation workers not using this optimal policy! | |||
off_policy_estimation_methods: Specify how to evaluate the current policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we produce a deprecation error if a user still uses either the old input_evaluation
OR the new off_policy_estimation_methods
inside the offline_data()
method?
@@ -45,14 +44,12 @@ def test_cql_compilation(self): | |||
env="Pendulum-v1", | |||
) | |||
.offline_data( | |||
input_=[data_file], | |||
input_=data_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure: Is this properly described in the docs?
Which options do the users have here (and did this change recently)?
List of str, str, "datasets", "sampler", etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently support str, List of str, Dict of different inputs, dataset, sampler, and a few others. Str and list of str are both backed by JsonReader so there's no difference for this example, but we should migrate all of these to DatasetReader eventually to avoid the multiple input reader bug I metnioned earlier.
rllib/evaluation/metrics.py
Outdated
@@ -147,8 +147,8 @@ def summarize_episodes( | |||
if new_episodes is None: | |||
new_episodes = episodes | |||
|
|||
episodes, estimates = _partition(episodes) | |||
new_episodes, _ = _partition(new_episodes) | |||
episodes, _ = _partition(episodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a bug? Can we add a one-line comment on why we are ignoring the estimates on episodes
, but not new_episodes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should be running OPE on the new episodes from the current training epoch only, not all of the episodes so far. (One of the next PRs removes this entire section of code, though, so I don't think we need to put a comment)
logger.warning( | ||
"Requested 'simulation' input evaluation method: " | ||
"will discard all sampler outputs and keep only metrics." | ||
if method_type == "simulation": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -761,7 +759,7 @@ def wrap(env): | |||
else: | |||
raise ValueError( | |||
f"Unknown off_policy_estimation type: {method_type}! Must be " | |||
"either `simulation|is|wis|dm|dr` or a sub-class of ray.rllib." | |||
"either a class path or a sub-class of ray.rllib." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Better to force explicitness and not allow too many shortcut options for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have the class path as a TODO, will fix that after the Trainer PR, but agree
soft_horizon=soft_horizon, | ||
no_done_at_end=no_done_at_end, | ||
observation_fn=observation_fn, | ||
sample_collector_class=policy_config.get("sample_collector"), | ||
render=render, | ||
blackhole_outputs="simulation" in off_policy_estimation_methods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODO to deprecate this once we completely disallow "simulation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already handled in the next PR since that moves OPE out to the Algorithm from the eval rolloout worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is blackhole_outputs
?
train_test_split_val: float = 0.0, | ||
k: int = 0, | ||
) -> Generator[Tuple[List[SampleBatch]], None, None]: | ||
"""Utility function that returns either a train/test split or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Our docstrings should always just have a single sentence one-line at the top, then an empty line, then any other information. E.g.:
def xyz(..):
"""Some single sentence description.
Anything else blabla
Args:
...
"""
`train_test_split_val * n_episodes` episodes and an evaluation batch | ||
with `(1 - train_test_split_val) * n_episodes` episodes. If not | ||
specified, use `k` for k-fold cross validation instead. | ||
k: k-fold cross validation for training model and evaluating OPE. | ||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not: Empty line before Returns:
.
return | ||
|
||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry, but let's not use ExperimentalAPI anymore. We are urged by other libraries teams to move RLlib to the generic Ray API annotations (which doesn't have ExperimentalAPI
). Let's leave as DeveloperAPI for now.
from ray.rllib.utils.typing import SampleBatchType | ||
from ray.rllib.policy.sample_batch import SampleBatch | ||
from ray.rllib.utils.numpy import convert_to_numpy | ||
import numpy as np | ||
|
||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (see above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for the cleanup @Rohan138 ! Just a few questions and nits to fix.
from ray.rllib.utils.framework import try_import_torch | ||
from ray.rllib.utils.typing import ModelConfigDict, TensorType | ||
|
||
torch, nn = try_import_torch() | ||
|
||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
from ray.rllib.utils.typing import SampleBatchType | ||
from typing import List | ||
import numpy as np | ||
|
||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
namedtuple("OffPolicyEstimate", ["estimator_name", "metrics"]) | ||
) | ||
|
||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
class OffPolicyEstimator: | ||
"""Interface for an off policy reward estimator.""" | ||
|
||
@DeveloperAPI | ||
@ExperimentalAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
rllib/tests/test_io.py
Outdated
def test_agent_input_eval_sim(self): | ||
for fw in framework_iterator(): | ||
def test_agent_input_eval_sampler(self): | ||
for fw in ["torch"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use
config = ...
for fw in framework_iterator(config, frameworks="torch"):
instead for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was supposed to have TF too, fixed now
"off_policy_estimation_methods": { | ||
"simulation": {"type": "simulation"} | ||
}, | ||
"input": self.test_dir + fw, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: We still support both a) a dir string (read all JSON files in the dir) and b) a list of filenames, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we support both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this p much looks good to me.
I left quite a few questions, if you could answer those first and then lets move forward.
# Offline RL settings. | ||
input_evaluation = config.get("input_evaluation") | ||
if input_evaluation is not None and input_evaluation is not DEPRECATED_VALUE: | ||
ope_dict = {str(ope): {"type": ope} for ope in input_evaluation} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I maintain that this is a wierd format and we should change this at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I had to use the same format for the q_model as well in the other PR, it's super awkward
old='off_policy_estimation_methods={"simulation"}', | ||
new='input="sampler"', | ||
help="The `simulation` estimation method has been deprecated." | ||
"If you want to run online evaluation on your data, use" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you say simulation, or sampler, doesn't the RW just rely on the sampler, which could be backed by a dataset or an env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simulation
was backed by AsyncSampler with blackhole_outputs=True
, but yes. blackhole_outputs
seems to ignore the samples collected by the RW for everything except computing episode reward metrics in metrics.py
soft_horizon=soft_horizon, | ||
no_done_at_end=no_done_at_end, | ||
observation_fn=observation_fn, | ||
sample_collector_class=policy_config.get("sample_collector"), | ||
render=render, | ||
blackhole_outputs="simulation" in off_policy_estimation_methods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is blackhole_outputs
?
@@ -682,7 +677,7 @@ def valid_module(class_path): | |||
log_level=config["log_level"], | |||
callbacks=config["callbacks"], | |||
input_creator=input_creator, | |||
off_policy_estimation_methods=off_policy_estimation_methods, | |||
off_policy_estimation_methods=config["off_policy_estimation_methods"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me -- should be able to use OPE even if the input is a sampler (for debug purposes, or if working with a separate eval ds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, only thing that probably doesn't work currently is mixed input e.g. ("sampler" + json)
@@ -20,23 +20,27 @@ | |||
logger = logging.getLogger() | |||
|
|||
|
|||
@ExperimentalAPI | |||
@DeveloperAPI | |||
def train_test_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so where does this get invoked, now that we will require train and test datasets to be specified ahead of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole function gets deleted in the next PR; the user should split them ahead of time, or we can make this a utility in Ray Train/Datasets.
Looks like the failing tests are unrelated. This is ok to merge |
Signed-off-by: Stefan van der Kleij <[email protected]>
This PR moves the OPE methods to the evaluation config (and thus the evaluation workers).
Note that I had to remove the OPE from some of the CQL and MARWIL tests because they use evaluation input = "sampler", so using OPE is redundant. However, user code using
config.input_evaluation
should still work with a deprecation warning.Previous PR: #25899 (already merged)
Next PR: #26279
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.