[rllib] Policy `learner_stats` get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

Bam4d · 2021-08-26T12:12:36Z

What is the problem?

When multiple gpus are used, learner stats are gathered in the learn_on_loaded_batch method with a "tower_X" key before the stats:
https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L645

for example:

{'tower_0': {'learner_stats': {'cur_lr': 0.000495184, 'policy_loss': -40.517921447753906, 'entropy': 1.8380180597305298, 'entropy_coeff': 0.0005, 'var_gnorm': 17.741676330566406, 'vf_loss': 0.7066917419433594, 'vf_explained_var': array([0.5089742], dtype=float32), 'mean_rhos': 1.0025060176849365, 'std_rhos': 0.39850014448165894}}}

This 'tower_0' is not taken into account when get_learner_stats() is used:

ray/rllib/execution/multi_gpu_learner_thread.py

Line 102 in 089dd9b

self.stats = {DEFAULT_POLICY_ID: get_learner_stats(fetches)}

This causes the policy learner_stats to get dropped when GPUs are used.

CPU does not drop these stats

This is different from when a single gpu/cpu is used, the learn_on_loaded_batch function will return:

{'learner_stats': {'cur_lr': 0.000495184, 'policy_loss': -40.517921447753906, 'entropy': 1.8380180597305298, 'entropy_coeff': 0.0005, 'var_gnorm': 17.741676330566406, 'vf_loss': 0.7066917419433594, 'vf_explained_var': array([0.5089742], dtype=float32), 'mean_rhos': 1.0025060176849365, 'std_rhos': 0.39850014448165894}

(note the lack of tower_X key)

Similar code can then extract the policy metrics which works!

ray/rllib/execution/learner_thread.py

Line 80 in 089dd9b

self.stats = get_learner_stats(fetches)

Ray version and other system information (Python version, TensorFlow version, OS):

version: latest dev 2.0.0
python: 3.8
macosx + linux
torch + tensorflow

Reproduction (REQUIRED)

Run anything with GPU learners (specifically in my case I'm using IMPALA)

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

@sven1977

The text was updated successfully, but these errors were encountered:

mvindiola1 · 2021-09-04T11:35:47Z

@Bam4d

Does flipping the keys fix this?

batch_fetches[LEARNER_STATS_KEY]={} 
for i, batch in enumerate(device_batches):
    batch_fetches[LEARNER_STATS_KEY] [f"tower_{i}"] = self.extra_grad_info(batch)

sven1977 · 2021-09-24T06:29:13Z

Thanks for raising this issue. Great catch! We should add a check to all agent "compilation" tests to check the structure of the stats.

To solve this: I think we should rather fix the get_learner_stats() function to handle the multi-GPU case and then use that function in all execution ops. We can use the existing all-tower-reduce code in MultiGPUTrainOneStep and move that into get_learner_stats, then use get_learner_stats consistently everywhere.

sven1977 · 2021-09-24T11:24:45Z

@Bam4d @mvindiola1 ^

sven1977 · 2021-10-05T08:48:39Z

Closing this issue. Please feel free to re-open it should there still be problems.
The above PR makes sure that all Trainer.train() returned results dict have the same structure (test for that added), regardless of the particular setup, like multi-GPU, tf/torch, multi-agent, num_sgd_iters>1.

Bam4d added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 26, 2021

Bam4d mentioned this issue Sep 6, 2021

[rllib] Fix missing policy learner_stats in multi-gpu learners #18379

Closed

6 tasks

mvindiola1 mentioned this issue Sep 22, 2021

[Bug] [RLLIB] Race condition in stats_fn when using multi-gpu #18812

Closed

2 tasks

sven1977 self-assigned this Sep 24, 2021

sven1977 added P2 Important issue, but not time-critical rllib RLlib related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 24, 2021

sven1977 mentioned this issue Sep 24, 2021

[RLlib] Unify all RLlib Trainer.train() -> results[info][learner][policy ID][learner_stats] and add structure tests. #18879

Merged

6 tasks

sven1977 closed this as completed Oct 5, 2021

roireshef mentioned this issue Oct 12, 2021

[Bug] TorchPolicy doesn't report stats anymore #19304

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Policy `learner_stats` get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

[rllib] Policy `learner_stats` get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

Bam4d commented Aug 26, 2021 •

edited

Loading

mvindiola1 commented Sep 4, 2021 •

edited

Loading

sven1977 commented Sep 24, 2021

sven1977 commented Sep 24, 2021

sven1977 commented Oct 5, 2021

[rllib] Policy learner_stats get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

[rllib] Policy learner_stats get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

Comments

Bam4d commented Aug 26, 2021 • edited Loading

What is the problem?

CPU does not drop these stats

Reproduction (REQUIRED)

mvindiola1 commented Sep 4, 2021 • edited Loading

sven1977 commented Sep 24, 2021

sven1977 commented Sep 24, 2021

sven1977 commented Oct 5, 2021

[rllib] Policy `learner_stats` get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

[rllib] Policy `learner_stats` get dropped when multi_gpu_learner_thread.py is used (in GPU and multi-GPU use cases). #18116

Bam4d commented Aug 26, 2021 •

edited

Loading

mvindiola1 commented Sep 4, 2021 •

edited

Loading