-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Examples folder: All training_iteration
translations.
#23712
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, have some questions.
@@ -922,8 +922,9 @@ def learn_on_batch(self, samples: SampleBatchType) -> Dict: | |||
summarize(samples) | |||
) | |||
) | |||
|
|||
info_out = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug. There are some cases, where info_out
would not be defined later in the code. I wanted to make sure it's at least an empty dict :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's a good idea to define info_out here.
but now that we are doing this, we should try to make sure we always and only do info_out.update(...) in the subsequent code, like line 945.
@@ -36,7 +36,7 @@ | |||
|
|||
|
|||
@ExperimentalAPI | |||
def train_one_step(trainer, train_batch) -> Dict: | |||
def train_one_step(trainer, train_batch, policies_to_train=None) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this change implemented in the code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually, let me add a nice docstring for this function. ...
dqn_train_results = {} | ||
dqn_train_batch = self.local_replay_buffer.replay() | ||
if dqn_train_batch is not None: | ||
dqn_train_results = train_one_step(self, dqn_train_batch, ["dqn_policy"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice pr. I know it's already merged, but I have a few comments ...
@@ -922,8 +922,9 @@ def learn_on_batch(self, samples: SampleBatchType) -> Dict: | |||
summarize(samples) | |||
) | |||
) | |||
|
|||
info_out = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's a good idea to define info_out here.
but now that we are doing this, we should try to make sure we always and only do info_out.update(...) in the subsequent code, like line 945.
# Create a new Trainer using the Policy and config defined above and a new | ||
# execution plan. | ||
# Backward compatibility, just in case users want to use the erroneous old name. | ||
RandomParametriclPolicy = RandomParametricPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really necessary?? :) :)
this is an example script ... folks probably shouldn't import an example script and use it as a library?
# Return training metrics. | ||
return StandardMetricsReporting(rollouts, workers, config) | ||
# Return (empty) training metrics. | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not collect rollout related metrics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
B/c it's done automatically by RLlib after this. So we always just return the learner stats here.
But yes, we should start thinking about a way to customize this bit of the iteration.
|
||
return StandardMetricsReporting(train_op, workers, config) | ||
# Combine results for PPO and DQN into one results dict. | ||
results = dict(ppo_train_results, **dqn_train_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this small example is actually really cool!! 2 trainers train 2 policies for different agents in a same ma env. nice.
it doesn't feel exactly right here though, we should be combining the result dict? for example, not overwriting ppo steps with dqn steps, but summing them up?
I wonder how Concurrently does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not overridden. The PPO stats and DQN stats will reside under their policy ID keys.
results = {
"ppo_policy": [some stats],
"dqn_policy": [some other stats]
}
All scripts in rllib/examples that use
execution_plan
are translated to using the newtraining_iteration
API.Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.