Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Make policies_to_train more flexible via callable option. #20735

Merged
merged 12 commits into from
Jan 27, 2022

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 26, 2021

This PR introduces the option to set config.multiagent.policies_to_train to a callable (alternatively to providing a list/set of PolicyIDs). This callable takes the policyID and an optional SampleBatch|MultiAgentBatch as args and returns a bool (trainable or not?). This will allow for a more fine-grained control over which policies need to be updated, for example in multi-agent scenarios where policy A should be trained if playing against policy B, not not if playing against policy C.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…cies_to_train_by_callable

# Conflicts:
#	rllib/agents/trainer.py
#	rllib/evaluation/rollout_worker.py
#	rllib/evaluation/worker_set.py
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions. thanks.

A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple "car"- and "traffic light" agents in the environment. The model for multi-agent in RLlib is as follows: (1) as a user, you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure:
In a multi-agent environment, there are more than one "agent" acting simultaneously, in a turn-based fashion, or in a combination of these two.

For example, in a traffic simulation, there may be multiple "car"- and "traffic light" agents in the environment, acting simultaneously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "car"-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

config={"gamma": 0.85}, # use main config plus <- this override here
), # alternatively, simply do: `PolicySpec(config={"gamma": 0.85})`

# Deprecated way: Specify class, obs-/action-spaces, config-overrides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, is "class is None" a deprecated way of specifying things or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should no longer be used. Better to use PolicySpec namedtuple. Changed comment.

# Specifies those policies that should be updated.
# Options are:
# - None for all policies.
# - An iterable of PolicyIDs to be updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit nit nit
maybe just "An iterable of PolicyIDs"
I am not sure if "to be updated" applies to the list of ids, or the policies specified via these ids. when reading this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and clarified.

"""
if policies_to_train is not None:
self.policies_to_train = policies_to_train
if is_policy_to_train is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should None reset self.is_policy_to_train to empty list?
I also feel like we shouldn't allow is_policy_to_train to have a default value, which does nothing. e.g., why would someone do

self.set_is_policy_to_train()

by itself ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, you are right. Not sure why this should be supported. I'll fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this non-Optional.

# Set of IDs of those policies, which should be trained. This property
# is optional and mainly used for backward compatibility.
self.policies_to_train: Optional[Set[PolicyID]] = None
self.is_policy_to_train: Callable[[PolicyID, SampleBatchType], bool]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like self._should_train_policy may be a better name than is_policy_to_train.
also, if these variables are private-ish, we should name them _policies_to_train and _should_train_policy
last thing is I feel like we may not want to keep a copy of policies_to_train on self. that just confuses things, since self.is_policy_to_train is the real source of truth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is just for backward compatibility. Some users may still use this property somewhere and those users are unlikely to ever change the policies_to_train list. So for them, it wouldn't matter, whether self.is_policy_to_train is the actual source of truth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But aren't functions that return bools always named: "is_...()"? Or "isPolicyToTrain()", etc..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. ok, will live with this then.
depending on the style guide :) I found our naming a bit less consistent sometimes, but that's a minor thing.

@@ -843,7 +851,7 @@ def learn_on_batch(self, samples: SampleBatchType) -> Dict:
builders = {}
to_fetch = {}
for pid, batch in samples.policy_batches.items():
if pid not in self.policies_to_train:
if not self.policies_to_train(pid, samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call self.is_policies_to_train() or self._should_train_policy() right?
there are quite a few places below that also use self.policies_to_train(), which is a list now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks!

# By default (None), use the set of all policies found in the
# policy_dict.
if policies_to_train is None:
policies_to_train = set(self.policy_dict.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be self.policies_to_train = set(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm

@@ -1132,7 +1146,7 @@ def add_policy(
self.observation_filter, new_policy.observation_space.shape)

self.set_policy_mapping_fn(policy_mapping_fn)
self.set_policies_to_train(policies_to_train)
self.set_is_policy_to_train(policies_to_train)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

self.local_worker = self.policy_ids = None
if local_worker:
self.local_worker = local_worker
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why else here?
why only set self.policy_ids if local_worker is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility again :)

  • New: If the local_worker is given -> Use it's is_policy_to_train() method.
  • Old: Use given PolicyID list to figure out whether a policy is trainable or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, it will take me forever to adjust ... 😭

…cies_to_train_by_callable

# Conflicts:
#	rllib/agents/trainer.py
#	rllib/evaluation/worker_set.py
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. have some random comments left, but this looks good. feel free to merge.

# Set of IDs of those policies, which should be trained. This property
# is optional and mainly used for backward compatibility.
self.policies_to_train: Optional[Set[PolicyID]] = None
self.is_policy_to_train: Callable[[PolicyID, SampleBatchType], bool]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. ok, will live with this then.
depending on the style guide :) I found our naming a bit less consistent sometimes, but that's a minor thing.

# By default (None), use the set of all policies found in the
# policy_dict.
if policies_to_train is None:
policies_to_train = set(self.policy_dict.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm

@@ -1150,7 +1165,8 @@ def remove_policy(
*,
policy_id: PolicyID = DEFAULT_POLICY_ID,
policy_mapping_fn: Optional[Callable[[AgentID], PolicyID]] = None,
policies_to_train: Optional[List[PolicyID]] = None,
policies_to_train: Optional[Union[Container[PolicyID], Callable[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's worth it to define some names for the 2 func types here, kinda long, and used multiple times.
totally up to you.

self.local_worker = self.policy_ids = None
if local_worker:
self.local_worker = local_worker
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, it will take me forever to adjust ... 😭

@sven1977 sven1977 merged commit 371fbb1 into ray-project:master Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants