Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Enable cloud checkpointing. #47682

Merged
merged 20 commits into from
Sep 25, 2024

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Sep 16, 2024

Why are these changes needed?

At the actual state RLlib cannot directly use cloud checkpointing or load checkpoints from the cloud, but needs to use ray.tune. However, ray.tune cannot load a checkpoint from the cloud, but can only create it. This PR adds fucntionality for cloud checkpointing and loading checkpoints/components from any filesystem supported by pyarrow.

The main functionality in how checkpointing, saving, restoring, or retrieving checkpoint info works remains as is, but uses instead of a basic filesystem an pyarrow.fs.FileSystem. This

  • Enables users to load and store checkpoints to cloud storage like S3, GCS, or ABS, even Hadoop systems
  • Enables users to use their related cloud accounts when storing to or loading from cloud.
  • Enables users to also load and store old stack checkpoints to/from cloud or other filesystems supported by pyarrow.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…tore to any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS.

Signed-off-by: simonsays1980 <[email protected]>
…estore from any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS.

Signed-off-by: simonsays1980 <[email protected]>
…ring paths before using PyArrow's filesystem detector. Furthermpore, added docstrings.

Signed-off-by: simonsays1980 <[email protected]>
@simonsays1980 simonsays1980 added rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers. labels Sep 16, 2024
@simonsays1980 simonsays1980 marked this pull request as ready for review September 16, 2024 12:23
@sven1977 sven1977 changed the title [RLlib] - Enable cloud checkpointing. [RLlib] Enable cloud checkpointing. Sep 16, 2024
@@ -313,10 +345,21 @@ def restore_from_path(
the subcomponent and thus, only that subcomponent's state is
restored/loaded. All other state of `self` remains unchanged in this
case.
filesystem: PyArrow FileSystem to use to access data at the path. If not
specified, this is inferred from the URI scheme.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more details here?
In particular: "What URI scheme?" :) The one provided in path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this in the next iteration. We should also add additional information about how to use cloud storage when the bucket is not public.

@@ -313,10 +345,21 @@ def restore_from_path(
the subcomponent and thus, only that subcomponent's state is
restored/loaded. All other state of `self` remains unchanged in this
case.
filesystem: PyArrow FileSystem to use to access data at the path. If not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use backticks around path to indicate that we mean the value of the path arg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Maybe I also turn "PyArrow" to pyarrow.

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this PR @simonsays1980 !

@sven1977 sven1977 enabled auto-merge (squash) September 16, 2024 16:12
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 16, 2024
WeichenXu123 and others added 8 commits September 17, 2024 13:58
…ay-project#47673)

While reading or writing files with Ray Data, S3 might raise a transient SERVICE_UNAVAILABLE error. This PR adds the error to the list of retried transient errors.

Signed-off-by: Balaji Veeramani <[email protected]>
## Why are these changes needed?

Add some additional items to replica metadata and request context.

---------

Signed-off-by: Cindy Zhang <[email protected]>
There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation.
With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that
After multi ref PR, we cannot just do await on returned value when it is multi ref output
```
REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json
REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json
REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json
REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json
REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json
REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json
REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json
REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json
REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json
REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json
REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json
REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json
REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json
REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json
REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json
REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json
REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json
REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json
REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json
REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json
REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json
```

Signed-off-by: kevin <[email protected]>
@sven1977 sven1977 merged commit a6cf9d7 into ray-project:master Sep 25, 2024
5 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants