Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self.minibatch_size #232

Open
1tac11 opened this issue Mar 30, 2023 · 9 comments
Open

self.minibatch_size #232

1tac11 opened this issue Mar 30, 2023 · 9 comments

Comments

@1tac11
Copy link

1tac11 commented Mar 30, 2023

Hi there,

in a2c_common.py line 194:
self.minibatch_size = self.config.get('minibatch_size', self.num_actors * self.minibatch_size_per_env)
shouldn't it be
self.minibatch_size = self.config.get('minibatch_size', self.num_envs * self.minibatch_size_per_env)
instead?

@1tac11
Copy link
Author

1tac11 commented Mar 30, 2023

mainly for clarification please

@1tac11
Copy link
Author

1tac11 commented Mar 30, 2023

ah ok, I see, num_actors = num_envs
sorry to bother again,
so how is self.seq_len connected to horizon_length ?
and shouldn't self.minibatch_size_per_env = self.config.get('minibatch_size_per_env', 0) be self.minibatch_size_per_env = self.config.get('minibatch_size_per_env', self.minibatch_size // self.num_actors) instead (also in a2c_common.py)?

@1tac11
Copy link
Author

1tac11 commented Mar 31, 2023

One more question please:
What Does the parallel calculation with torchrun?
The problem was that when i let Ant run on 4 machines in parallel then it does not calculate four times as fast but only twice as fast.
As I understand in forward samples are created on every gpu while backward the batches are computed in parallel, right? Then there should almost be no overhead in parallelization.

@ViktorM
Copy link
Collaborator

ViktorM commented Mar 31, 2023

Hi @1seck! horizon_length should be divisible by self.seq_len, so the maximum value it can take equals the horizon_length but can be a fraction of it.

As about self.minibatch_size_per_env it's not used anywhere except self.minibatch_size calculation when it's not set. With the default value 0 we could have some additional checks, in theory, not currently used.

@ViktorM
Copy link
Collaborator

ViktorM commented Mar 31, 2023

What Does the parallel calculation with torchrun?
The problem was that when i let Ant run on 4 machines in parallel then it does not calculate four times as fast but only twice as fast.

What metrics are you talking about? FPS step and step_and_inference should scale almost linearly with a number of GPUs. Total FPS scaling won't be linear as additionally gradients are moved between different GPUs.

@ViktorM
Copy link
Collaborator

ViktorM commented Mar 31, 2023

And what are the numbers you got?

@1tac11
Copy link
Author

1tac11 commented Apr 2, 2023

Hi viktorM,
Thank you for responding. It seems fine as long as I am on one machine with multiple gpus, but when trying different machines with the master_addr and port args, the weights are not shared and the worker nodes have same best rewards output at step n as on single machine training . I am comparing the best reward at a certain step n.
Even if I have four gpus on one machine it seems like the best reward score is only twice as fast.
I will check again tomorrow to double-check but that’s how the training went last week.
Kind regards

@1tac11
Copy link
Author

1tac11 commented Apr 5, 2023

8 GPUs: epoch 200:5900, epoch 500: 8400
1 GPU: epoch 200: 4100, epoch 500: 6637
Regards

@1tac11
Copy link
Author

1tac11 commented Apr 8, 2023

I mean, I don’t know whether it syncs at all when distributing over several instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants