You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
Bam4d
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 26, 2021
sven1977
added
P2
Important issue, but not time-critical
rllib
RLlib related issues
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Sep 24, 2021
Thanks for raising this issue. Great catch! We should add a check to all agent "compilation" tests to check the structure of the stats.
To solve this: I think we should rather fix the get_learner_stats() function to handle the multi-GPU case and then use that function in all execution ops. We can use the existing all-tower-reduce code in MultiGPUTrainOneStep and move that into get_learner_stats, then use get_learner_stats consistently everywhere.
Closing this issue. Please feel free to re-open it should there still be problems.
The above PR makes sure that all Trainer.train() returned results dict have the same structure (test for that added), regardless of the particular setup, like multi-GPU, tf/torch, multi-agent, num_sgd_iters>1.
What is the problem?
When multiple gpus are used, learner stats are gathered in the
learn_on_loaded_batch
method with a "tower_X" key before the stats:https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L645
for example:
This 'tower_0' is not taken into account when
get_learner_stats()
is used:ray/rllib/execution/multi_gpu_learner_thread.py
Line 102 in 089dd9b
This causes the policy learner_stats to get dropped when GPUs are used.
CPU does not drop these stats
This is different from when a single gpu/cpu is used, the
learn_on_loaded_batch
function will return:(note the lack of
tower_X
key)Similar code can then extract the policy metrics which works!
ray/rllib/execution/learner_thread.py
Line 80 in 089dd9b
Ray version and other system information (Python version, TensorFlow version, OS):
version: latest dev 2.0.0
python: 3.8
macosx + linux
torch + tensorflow
Reproduction (REQUIRED)
Run anything with GPU learners (specifically in my case I'm using IMPALA)
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
@sven1977
The text was updated successfully, but these errors were encountered: