Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib]: Move OPE to evaluation config #25911

Merged
merged 24 commits into from
Jul 12, 2022

Conversation

Rohan138
Copy link
Contributor

@Rohan138 Rohan138 commented Jun 18, 2022

This PR moves the OPE methods to the evaluation config (and thus the evaluation workers).
Note that I had to remove the OPE from some of the CQL and MARWIL tests because they use evaluation input = "sampler", so using OPE is redundant. However, user code using config.input_evaluation should still work with a deprecation warning.

Previous PR: #25899 (already merged)
Next PR: #26279

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -841,6 +842,18 @@ def evaluation(
IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
policy, even if this is a stochastic one. Setting "explore=False" here
will result in the evaluation workers not using this optimal policy!
off_policy_estimation_methods: Specify how to evaluate the current policy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we produce a deprecation error if a user still uses either the old input_evaluation OR the new off_policy_estimation_methods inside the offline_data() method?

@@ -45,14 +44,12 @@ def test_cql_compilation(self):
env="Pendulum-v1",
)
.offline_data(
input_=[data_file],
input_=data_file,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure: Is this properly described in the docs?
Which options do the users have here (and did this change recently)?
List of str, str, "datasets", "sampler", etc..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently support str, List of str, Dict of different inputs, dataset, sampler, and a few others. Str and list of str are both backed by JsonReader so there's no difference for this example, but we should migrate all of these to DatasetReader eventually to avoid the multiple input reader bug I metnioned earlier.

@@ -147,8 +147,8 @@ def summarize_episodes(
if new_episodes is None:
new_episodes = episodes

episodes, estimates = _partition(episodes)
new_episodes, _ = _partition(new_episodes)
episodes, _ = _partition(episodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug? Can we add a one-line comment on why we are ignoring the estimates on episodes, but not new_episodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be running OPE on the new episodes from the current training epoch only, not all of the episodes so far. (One of the next PRs removes this entire section of code, though, so I don't think we need to put a comment)

logger.warning(
"Requested 'simulation' input evaluation method: "
"will discard all sampler outputs and keep only metrics."
if method_type == "simulation":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -761,7 +759,7 @@ def wrap(env):
else:
raise ValueError(
f"Unknown off_policy_estimation type: {method_type}! Must be "
"either `simulation|is|wis|dm|dr` or a sub-class of ray.rllib."
"either a class path or a sub-class of ray.rllib."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Better to force explicitness and not allow too many shortcut options for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have the class path as a TODO, will fix that after the Trainer PR, but agree

soft_horizon=soft_horizon,
no_done_at_end=no_done_at_end,
observation_fn=observation_fn,
sample_collector_class=policy_config.get("sample_collector"),
render=render,
blackhole_outputs="simulation" in off_policy_estimation_methods,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODO to deprecate this once we completely disallow "simulation".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already handled in the next PR since that moves OPE out to the Algorithm from the eval rolloout worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is blackhole_outputs ?

train_test_split_val: float = 0.0,
k: int = 0,
) -> Generator[Tuple[List[SampleBatch]], None, None]:
"""Utility function that returns either a train/test split or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Our docstrings should always just have a single sentence one-line at the top, then an empty line, then any other information. E.g.:

def xyz(..):
    """Some single sentence description.

    Anything else blabla

    Args:
        ...
    """

`train_test_split_val * n_episodes` episodes and an evaluation batch
with `(1 - train_test_split_val) * n_episodes` episodes. If not
specified, use `k` for k-fold cross validation instead.
k: k-fold cross validation for training model and evaluating OPE.
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not: Empty line before Returns:.

return


@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

@sven1977 sven1977 Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, but let's not use ExperimentalAPI anymore. We are urged by other libraries teams to move RLlib to the generic Ray API annotations (which doesn't have ExperimentalAPI). Let's leave as DeveloperAPI for now.

from ray.rllib.utils.typing import SampleBatchType
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.utils.numpy import convert_to_numpy
import numpy as np


@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (see above).

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for the cleanup @Rohan138 ! Just a few questions and nits to fix.

from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.typing import ModelConfigDict, TensorType

torch, nn = try_import_torch()


@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

from ray.rllib.utils.typing import SampleBatchType
from typing import List
import numpy as np


@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

namedtuple("OffPolicyEstimate", ["estimator_name", "metrics"])
)


@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

class OffPolicyEstimator:
"""Interface for an off policy reward estimator."""

@DeveloperAPI
@ExperimentalAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

def test_agent_input_eval_sim(self):
for fw in framework_iterator():
def test_agent_input_eval_sampler(self):
for fw in ["torch"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use

config = ...

for fw in framework_iterator(config, frameworks="torch"):

instead for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was supposed to have TF too, fixed now

"off_policy_estimation_methods": {
"simulation": {"type": "simulation"}
},
"input": self.test_dir + fw,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: We still support both a) a dir string (read all JSON files in the dir) and b) a list of filenames, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we support both.

Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this p much looks good to me.

I left quite a few questions, if you could answer those first and then lets move forward.

# Offline RL settings.
input_evaluation = config.get("input_evaluation")
if input_evaluation is not None and input_evaluation is not DEPRECATED_VALUE:
ope_dict = {str(ope): {"type": ope} for ope in input_evaluation}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maintain that this is a wierd format and we should change this at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I had to use the same format for the q_model as well in the other PR, it's super awkward

old='off_policy_estimation_methods={"simulation"}',
new='input="sampler"',
help="The `simulation` estimation method has been deprecated."
"If you want to run online evaluation on your data, use"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you say simulation, or sampler, doesn't the RW just rely on the sampler, which could be backed by a dataset or an env?

Copy link
Contributor Author

@Rohan138 Rohan138 Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simulation was backed by AsyncSampler with blackhole_outputs=True, but yes. blackhole_outputs seems to ignore the samples collected by the RW for everything except computing episode reward metrics in metrics.py

soft_horizon=soft_horizon,
no_done_at_end=no_done_at_end,
observation_fn=observation_fn,
sample_collector_class=policy_config.get("sample_collector"),
render=render,
blackhole_outputs="simulation" in off_policy_estimation_methods,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is blackhole_outputs ?

@@ -682,7 +677,7 @@ def valid_module(class_path):
log_level=config["log_level"],
callbacks=config["callbacks"],
input_creator=input_creator,
off_policy_estimation_methods=off_policy_estimation_methods,
off_policy_estimation_methods=config["off_policy_estimation_methods"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me -- should be able to use OPE even if the input is a sampler (for debug purposes, or if working with a separate eval ds)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, only thing that probably doesn't work currently is mixed input e.g. ("sampler" + json)

@@ -20,23 +20,27 @@
logger = logging.getLogger()


@ExperimentalAPI
@DeveloperAPI
def train_test_split(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so where does this get invoked, now that we will require train and test datasets to be specified ahead of time?

Copy link
Contributor Author

@Rohan138 Rohan138 Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole function gets deleted in the next PR; the user should split them ahead of time, or we can make this a utility in Ray Train/Datasets.

@avnishn
Copy link
Member

avnishn commented Jul 8, 2022

Looks like the failing tests are unrelated. This is ok to merge

@richardliaw richardliaw merged commit 09ce471 into ray-project:master Jul 12, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rllib RLlib related issues rllib-offline-rl Offline RL problems
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants