self.minibatch_size #232

1tac11 · 2023-03-30T16:09:01Z

Hi there,

in a2c_common.py line 194:
self.minibatch_size = self.config.get('minibatch_size', self.num_actors * self.minibatch_size_per_env)
shouldn't it be
self.minibatch_size = self.config.get('minibatch_size', self.num_envs * self.minibatch_size_per_env)
instead?

The text was updated successfully, but these errors were encountered:

1tac11 · 2023-03-30T16:09:33Z

mainly for clarification please

1tac11 · 2023-03-30T16:48:20Z

ah ok, I see, num_actors = num_envs
sorry to bother again,
so how is self.seq_len connected to horizon_length ?
and shouldn't self.minibatch_size_per_env = self.config.get('minibatch_size_per_env', 0) be self.minibatch_size_per_env = self.config.get('minibatch_size_per_env', self.minibatch_size // self.num_actors) instead (also in a2c_common.py)?

1tac11 · 2023-03-31T19:09:41Z

One more question please:
What Does the parallel calculation with torchrun?
The problem was that when i let Ant run on 4 machines in parallel then it does not calculate four times as fast but only twice as fast.
As I understand in forward samples are created on every gpu while backward the batches are computed in parallel, right? Then there should almost be no overhead in parallelization.

ViktorM · 2023-03-31T20:29:34Z

Hi @1seck! horizon_length should be divisible by self.seq_len, so the maximum value it can take equals the horizon_length but can be a fraction of it.

As about self.minibatch_size_per_env it's not used anywhere except self.minibatch_size calculation when it's not set. With the default value 0 we could have some additional checks, in theory, not currently used.

ViktorM · 2023-03-31T20:36:25Z

What Does the parallel calculation with torchrun?
The problem was that when i let Ant run on 4 machines in parallel then it does not calculate four times as fast but only twice as fast.

What metrics are you talking about? FPS step and step_and_inference should scale almost linearly with a number of GPUs. Total FPS scaling won't be linear as additionally gradients are moved between different GPUs.

ViktorM · 2023-03-31T20:37:24Z

And what are the numbers you got?

1tac11 · 2023-04-02T06:24:39Z

Hi viktorM,
Thank you for responding. It seems fine as long as I am on one machine with multiple gpus, but when trying different machines with the master_addr and port args, the weights are not shared and the worker nodes have same best rewards output at step n as on single machine training . I am comparing the best reward at a certain step n.
Even if I have four gpus on one machine it seems like the best reward score is only twice as fast.
I will check again tomorrow to double-check but that’s how the training went last week.
Kind regards

1tac11 · 2023-04-05T17:02:43Z

8 GPUs: epoch 200:5900, epoch 500: 8400
1 GPU: epoch 200: 4100, epoch 500: 6637
Regards

1tac11 · 2023-04-08T19:25:04Z

I mean, I don’t know whether it syncs at all when distributing over several instances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self.minibatch_size #232

self.minibatch_size #232

1tac11 commented Mar 30, 2023 •

edited

Loading

1tac11 commented Mar 30, 2023

1tac11 commented Mar 30, 2023 •

edited

Loading

1tac11 commented Mar 31, 2023

ViktorM commented Mar 31, 2023

ViktorM commented Mar 31, 2023 •

edited

Loading

ViktorM commented Mar 31, 2023

1tac11 commented Apr 2, 2023

1tac11 commented Apr 5, 2023

1tac11 commented Apr 8, 2023

self.minibatch_size #232

self.minibatch_size #232

Comments

1tac11 commented Mar 30, 2023 • edited Loading

1tac11 commented Mar 30, 2023

1tac11 commented Mar 30, 2023 • edited Loading

1tac11 commented Mar 31, 2023

ViktorM commented Mar 31, 2023

ViktorM commented Mar 31, 2023 • edited Loading

ViktorM commented Mar 31, 2023

1tac11 commented Apr 2, 2023

1tac11 commented Apr 5, 2023

1tac11 commented Apr 8, 2023

1tac11 commented Mar 30, 2023 •

edited

Loading

1tac11 commented Mar 30, 2023 •

edited

Loading

ViktorM commented Mar 31, 2023 •

edited

Loading