[RLlib] To improve performance, do not wait for sync weight calls by default. #30509

gjoliver · 2022-11-19T11:34:54Z

Signed-off-by: Jun Gong [email protected]

Why are these changes needed?

This improves throughput by almost 2x for many of our algorithms.
As an example,
A3C:

This was also the default behavior before elastic training PR.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [*] Unit tests
- [*] Release tests
- This PR is not tested :(

…default. Signed-off-by: Jun Gong <[email protected]>

kouroshHakha

a couple of Qs. looks good o.w.

kouroshHakha · 2022-11-19T17:32:43Z

rllib/evaluation/worker_set.py

@@ -397,6 +398,10 @@ def sync_weights(
                weights to. If None (default), sync to all remote workers.
            global_vars: An optional global vars dict to set this
                worker to. If None, do not update the global_vars.
+            timeout_seconds: Timeout in seconds to wait for the sync weights


What would None do then?

indefinitely until all the object_refs are ready.
this is actually all standard ray.wait behavior. look at documentation for ray.wait(timeout=...).
https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait
the default timeout is None.

kouroshHakha · 2022-11-19T17:35:41Z

rllib/evaluation/worker_set.py

            def set_weight(w):
-                w.set_weights(ray.get(weights_ref), global_vars)
+                w.set_weights(weights, global_vars)


would from_worker.get_weights(policies) return an object ref, or the actual weights?

If it's an object ref, would set_weight work for the case where from_worker is a remote worker? If this is the case do we test this behavior in a unit test?

from_worker has to be a local RolloutWorker. so weights here must be raw weights.

the reason we have from_worker is that oftentimes, evaluation_worker_set doesn't have a local worker to sync from.
you need to sync weights from rollout_workers.local_worker() to evaluation_workers.remote_workers().

also, the reason I got rid of ray.get/put here is that, when testing everything, I noticed some slight improvements if we don't force ray.put on every single weights dict. seems like Ray core may optimize things and say if all the remote workers are on the same instance, skip serialization and simply copy over the data. need to confirm this though.

Signed-off-by: Jun Gong <[email protected]>

kouroshHakha · 2022-11-20T00:30:07Z

I confirmed that this PR fixes the A3C regression issues.

tensorboard link: https://tensorboard.dev/experiment/xx7RMLEjRZqv1hWeR3PkPA/#scalars&_smoothingWeight=0&tagFilter=reward_mean&runSelectionState=eyJhM2MtcG9uZ2RldGVybWluaXN0aWMtdjQtb2xkLXRmL0EzQ19Qb25nRGV0ZXJtaW5pc3RpYy12NF8wNzU2OF8wMDAwMF8wXzIwMjItMTEtMTlfMTEtMjYtMTciOmZhbHNlLCJhM2MtcG9uZ2RldGVybWluaXN0aWMtdjQtd2l0aC13ZWlnaHQtc3luYy0yLXRmL0EzQ19Qb25nRGV0ZXJtaW5pc3RpYy12NF9kNjRiY18wMDAwMF8wXzIwMjItMTEtMTlfMTQtMDItMjQiOmZhbHNlLCJhM2MtcG9uZ2RldGVybWluaXN0aWMtdjQtd2l0aC13ZWlnaHQtc3luYy10Zi9BM0NfUG9uZ0RldGVybWluaXN0aWMtdjRfMjM4OWVfMDAwMDBfMF8yMDIyLTExLTE5XzEyLTI0LTIxIjp0cnVlfQ%3D%3D

…default. (ray-project#30509) Also batch weight sync calls, and skip synching to local worker. Signed-off-by: Jun Gong <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

[RLlib] To improve performance, do not wait for sync weight calls by …

fd0442f

…default. Signed-off-by: Jun Gong <[email protected]>

gjoliver assigned kouroshHakha Nov 19, 2022

gjoliver requested review from sven1977, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners November 19, 2022 11:34

kouroshHakha approved these changes Nov 19, 2022

View reviewed changes

Jun Gong added 2 commits November 19, 2022 10:50

lint

a46b14f

Signed-off-by: Jun Gong <[email protected]>

Batch weight sync calls. Skip synching to local worker.

c8bd405

Signed-off-by: Jun Gong <[email protected]>

kouroshHakha mentioned this pull request Nov 20, 2022

[RLlib] Fix A3C release tests (removing tf2 due to some shape mismatch error for eager mode) #30279

Merged

7 tasks

kouroshHakha added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 20, 2022

gjoliver merged commit 3fa43e8 into ray-project:master Nov 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] To improve performance, do not wait for sync weight calls by default. #30509

[RLlib] To improve performance, do not wait for sync weight calls by default. #30509

gjoliver commented Nov 19, 2022

kouroshHakha left a comment

kouroshHakha Nov 19, 2022

gjoliver Nov 19, 2022

kouroshHakha Nov 19, 2022

gjoliver Nov 19, 2022

kouroshHakha commented Nov 20, 2022 •

edited

Loading

[RLlib] To improve performance, do not wait for sync weight calls by default. #30509

[RLlib] To improve performance, do not wait for sync weight calls by default. #30509

Conversation

gjoliver commented Nov 19, 2022

Why are these changes needed?

Related issue number

Checks

kouroshHakha left a comment

Choose a reason for hiding this comment

kouroshHakha Nov 19, 2022

Choose a reason for hiding this comment

gjoliver Nov 19, 2022

Choose a reason for hiding this comment

kouroshHakha Nov 19, 2022

Choose a reason for hiding this comment

gjoliver Nov 19, 2022

Choose a reason for hiding this comment

kouroshHakha commented Nov 20, 2022 • edited Loading

kouroshHakha commented Nov 20, 2022 •

edited

Loading