[RLlib] Enable cloud checkpointing. #47682

simonsays1980 · 2024-09-16T12:23:35Z

Why are these changes needed?

At the actual state RLlib cannot directly use cloud checkpointing or load checkpoints from the cloud, but needs to use ray.tune. However, ray.tune cannot load a checkpoint from the cloud, but can only create it. This PR adds fucntionality for cloud checkpointing and loading checkpoints/components from any filesystem supported by pyarrow.

The main functionality in how checkpointing, saving, restoring, or retrieving checkpoint info works remains as is, but uses instead of a basic filesystem an pyarrow.fs.FileSystem. This

Enables users to load and store checkpoints to cloud storage like S3, GCS, or ABS, even Hadoop systems
Enables users to use their related cloud accounts when storing to or loading from cloud.
Enables users to also load and store old stack checkpoints to/from cloud or other filesystems supported by pyarrow.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…tore to any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>

…estore from any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>

…ring paths before using PyArrow's filesystem detector. Furthermpore, added docstrings. Signed-off-by: simonsays1980 <[email protected]>

Signed-off-by: simonsays1980 <[email protected]>

sven1977 · 2024-09-16T12:43:59Z

rllib/utils/checkpoints.py

@@ -313,10 +345,21 @@ def restore_from_path(
                the subcomponent and thus, only that subcomponent's state is
                restored/loaded. All other state of `self` remains unchanged in this
                case.
+            filesystem: PyArrow FileSystem to use to access data at the path. If not
+                specified, this is inferred from the URI scheme.


Can we add more details here?
In particular: "What URI scheme?" :) The one provided in path?

I will add this in the next iteration. We should also add additional information about how to use cloud storage when the bucket is not public.

sven1977 · 2024-09-16T12:44:19Z

rllib/utils/checkpoints.py

@@ -313,10 +345,21 @@ def restore_from_path(
                the subcomponent and thus, only that subcomponent's state is
                restored/loaded. All other state of `self` remains unchanged in this
                case.
+            filesystem: PyArrow FileSystem to use to access data at the path. If not


Maybe use backticks around path to indicate that we mean the value of the path arg.

Good catch. Maybe I also turn "PyArrow" to pyarrow.

Signed-off-by: simonsays1980 <[email protected]>

sven1977

LGTM! Thanks for this PR @simonsays1980 !

Signed-off-by: simonsays1980 <[email protected]>

…#47670) Signed-off-by: Weichen Xu <[email protected]>

…ay-project#47673) While reading or writing files with Ray Data, S3 might raise a transient SERVICE_UNAVAILABLE error. This PR adds the error to the list of retried transient errors. Signed-off-by: Balaji Veeramani <[email protected]>

…ay-project#47680) Fixes ray-project#47679 --------- Signed-off-by: Balaji Veeramani <[email protected]>

## Why are these changes needed? Add some additional items to replica metadata and request context. --------- Signed-off-by: Cindy Zhang <[email protected]>

There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that

After multi ref PR, we cannot just do await on returned value when it is multi ref output

…ect#47686)

``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]>

…6767) Signed-off-by: prithvi-mac <[email protected]>

Signed-off-by: simonsays1980 <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

simonsays1980 added 5 commits September 13, 2024 15:27

Interchanged local filesystem with PyArrow filesystem to be able to s…

dd040de

…tore to any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>

Interchanged local filesystem with PyArrow filesystem to be able to r…

1ce7acf

…estore from any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>

Merge branch 'master' into enable-cloud-checkpointing

0c51f7e

Added filesystem to all subcomponent calls and added conversion to st…

6416f08

…ring paths before using PyArrow's filesystem detector. Furthermpore, added docstrings. Signed-off-by: simonsays1980 <[email protected]>

Added pyarrow FileSystem to 'from_checkpoint' and 'get_checkpoint_info'.

2367b56

Signed-off-by: simonsays1980 <[email protected]>

simonsays1980 added rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers. labels Sep 16, 2024

simonsays1980 marked this pull request as ready for review September 16, 2024 12:23

simonsays1980 requested review from sven1977 and ArturNiederfahrenhorst as code owners September 16, 2024 12:23

sven1977 changed the title ~~[RLlib] - Enable cloud checkpointing.~~ [RLlib] Enable cloud checkpointing. Sep 16, 2024

sven1977 reviewed Sep 16, 2024

View reviewed changes

Added suggestions from @sven1977's review and fixed a small path error.

147f1ea

Signed-off-by: simonsays1980 <[email protected]>

sven1977 approved these changes Sep 16, 2024

View reviewed changes

sven1977 enabled auto-merge (squash) September 16, 2024 16:12

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 16, 2024

simonsays1980 added 2 commits September 16, 2024 19:53

Fixed a unit test in 'checkpoint_utils'.

39b5619

Signed-off-by: simonsays1980 <[email protected]>

Fixed bug in doctests of 'rllib-learner.rst'.

8013d7b

Signed-off-by: simonsays1980 <[email protected]>

simonsays1980 requested review from maxpumperla and a team as code owners September 17, 2024 09:12

github-actions bot disabled auto-merge September 17, 2024 09:12

WeichenXu123 and others added 8 commits September 17, 2024 13:58

[spark] Refine comment in Starting ray worker spark task (ray-project…

9fa4ed9

…#47670) Signed-off-by: Weichen Xu <[email protected]>

[Data] Fix bug where Ray Data incorrectly emits progress bar warning (r…

05fd902

…ay-project#47680) Fixes ray-project#47679 --------- Signed-off-by: Balaji Veeramani <[email protected]>

[serve] Additional metadata and context (ray-project#47652)

3efd47f

## Why are these changes needed? Add some additional items to replica metadata and request context. --------- Signed-off-by: Cindy Zhang <[email protected]>

[core][aDAG] Fix microbenchmark regression adag 2 (ray-project#47683)

3929ce6

After multi ref PR, we cannot just do await on returned value when it is multi ref output

[RLlib] Fix/remove some CI tests and many_ppo release test. (ray-proj…

b52a38f

…ect#47686)

prithvi081099 and others added 2 commits September 17, 2024 13:58

[Core] Added spaces to disallowed char for working dir (ray-project#4…

05c866c

…6767) Signed-off-by: prithvi-mac <[email protected]>

Indented code in docs as CI tests were raising an error.

75ddea8

Signed-off-by: simonsays1980 <[email protected]>

simonsays1980 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners September 17, 2024 11:58

simonsays1980 added 2 commits September 25, 2024 09:53

Merged Master

7d35bf7

Signed-off-by: simonsays1980 <[email protected]>

Removed indentation in 'rllib-learner.rst'.

306df5b

Signed-off-by: simonsays1980 <[email protected]>

sven1977 merged commit a6cf9d7 into ray-project:master Sep 25, 2024
5 checks passed

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Enable cloud checkpointing. (ray-project#47682)

9e25407

Signed-off-by: ujjawal-khare <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Enable cloud checkpointing. #47682

[RLlib] Enable cloud checkpointing. #47682

simonsays1980 commented Sep 16, 2024 •

edited

Loading

sven1977 Sep 16, 2024

simonsays1980 Sep 16, 2024

sven1977 Sep 16, 2024

simonsays1980 Sep 16, 2024

sven1977 left a comment

[RLlib] Enable cloud checkpointing. #47682

[RLlib] Enable cloud checkpointing. #47682

Conversation

simonsays1980 commented Sep 16, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 Sep 16, 2024

Choose a reason for hiding this comment

simonsays1980 Sep 16, 2024

Choose a reason for hiding this comment

sven1977 Sep 16, 2024

Choose a reason for hiding this comment

simonsays1980 Sep 16, 2024

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

simonsays1980 commented Sep 16, 2024 •

edited

Loading