Support for RE3 exploration algorithm #19551

n30111 · 2021-10-20T10:58:16Z

Why are these changes needed?

This PR adds support for RE3 (Random Encoders for Efficient Exploration). RE3 is a simple and efficient algorithm for off-policy RL algorithms. This can also be used with on-policy RL algorithms.

This is the implementation of State entropy maximization with random encoders for efficient exploration. Seo, Chen, Shin, Lee, Abbeel, & Lee, (2021). arXiv preprint arXiv:2102.09430.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gjoliver · 2021-11-04T04:54:47Z

rllib/agents/trainer.py

+            )
+            from ray.rllib.policy.sample_batch import SampleBatch
+
+            embeds_dim = self.config["exploration_config"].get(


why don't we define all these configurations and the UpdateCallback with the RandomEncoder in that re3.py file under rllib/utils/exploration?
I don't think we need to configure the callback here automatically?
all we need is an example script showing how to construct a trainer with RE3 exploration using the exploration type field and a RE3UpdateCallback callback?

Yes, that would be cleaner. I will make the changes.

gjoliver · 2021-11-04T04:56:29Z

rllib/policy/policy.py

@@ -698,6 +698,7 @@ def _get_default_view_requirements(self):
            SampleBatch.UNROLL_ID: ViewRequirement(),
            SampleBatch.AGENT_INDEX: ViewRequirement(),
            "t": ViewRequirement(),
+            SampleBatch.OBS_EMBEDS: ViewRequirement(),


move above "t" maybe.

gjoliver · 2021-11-04T04:57:14Z

rllib/policy/sample_batch.py

@@ -40,6 +40,7 @@ class SampleBatch(dict):
    DONES = "dones"
    INFOS = "infos"
    SEQ_LENS = "seq_lens"
+    OBS_EMBEDS = "obs_embeds"


please comment that this field is only computed and used when RE3 exploration strategy is enabled.

gjoliver · 2021-11-04T05:05:57Z

rllib/agents/trainer.py

+                def on_train_result(self, *, trainer, result: dict,
+                                    **kwargs) -> None:
+                    # Keep track of the training iteration for beta decay.
+                    UpdateCallbacks._step = result["training_iteration"]


I think you can just use policy.global_timestep and there is no need to track this yourself?
another way is to Add a MixIn for this Beta schedule, much like LearningRateSchedule and EntropyCoeffSchedule. you can then just use the auto-decayed beta value here.

Here, we want to decay based on training_iteration. It seems policy.global_timestep depends on number of batch processed and not equal to training_iteration, just looking at source, not 100% sure.

gjoliver · 2021-11-04T05:11:38Z

rllib/agents/trainer.py

+                        **kwargs,
+                ):
+                    states_entropy = compute_states_entropy(
+                        train_batch[SampleBatch.OBS_EMBEDS], embeds_dim, k_nn)


this feels a bit weird.
I guess to calculate the knn distance, you want to use a batch that's randomly sampled from the ReplayBuffer? just so you can measure how different current sample batch is from all the things you have been seeing (RB)?
as written, this knn thing is calculated from a single batch. so naturally, you are gonna get a lot of similar samples from consecutive steps?

Yes, KNN is computed for a single batch and we are assuming that RLLib has sampled that batch randomly from replay buffer.

KNN distance and the randomness is limited to single batch only, not the full replay buffer.

jbedorf · 2021-11-29T07:10:47Z

@gjoliver Could you take another look and indicate if anything else needs to be changed? Thanks!

gjoliver · 2021-11-29T17:31:01Z

sorry about the delay, I will take another detailed look.

gjoliver

this looks very exciting. thanks for your awesome work.

I have gone through the PR carefully, and had a bunch of minor / detailed comments. nothing significant actually, except maybe for a TODO for us :)

The biggest blocker right now is to rebase to master, and make sure all the tests pass. there are too many failed CI tests right now.
This PR doesn't introduce any changes to existing codebase, so I don't expect anything tricky. Probably just need to retry by doing:

fetch upstream in your repo
git checkout master; git pull --rebase
git checkout re3; git rebase master
then git push to re-trigger the tests.

Also, please make sure you run ci/travis/format.sh so Lint test will pass too.

Thanks again. Let me know if you need any help with the tests actually.
Happy to help on those logistic things.

gjoliver · 2021-12-01T08:56:34Z

rllib/examples/re3_exploration.py

+    ray.init()
+
+    config = sac.DEFAULT_CONFIG.copy()
+    beta_schedule = "linear_decay"


this variable doesn't seem to be used.

gjoliver · 2021-12-01T09:03:23Z

rllib/examples/re3_exploration.py

+
+    # Patch user given callbacks with RE3 callbacks for using RE3 exploration
+    # strategy
+    class RE3Callbacks(RE3UpdateCallbacks, config["callbacks"]):


sac.DEFAULT_CONFIG doesn't specify callbacks?
it seems like you can simply do:

config["callbacks"] = partial(RE3Callbacks, embeds_dim=128, beta_schedule="linear_decay", k_nn=50)

here.

This was used for demonstration purpose, in case user wants to know how it works when callbacks is provided in configs.

gjoliver · 2021-12-01T09:08:32Z

rllib/agents/trainer.py

@@ -764,6 +764,7 @@ def env_creator_from_classpath(env_context):
                "`callbacks` must be a callable method that "
                "returns a subclass of DefaultCallbacks, got {}".format(
                    self.config["callbacks"]))
+


can you undo this empty line diff too? thanks a lot.

gjoliver · 2021-12-01T09:14:52Z