-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Enable cloud checkpointing. #47682
[RLlib] Enable cloud checkpointing. #47682
Conversation
…tore to any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>
…estore from any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS. Signed-off-by: simonsays1980 <[email protected]>
…ring paths before using PyArrow's filesystem detector. Furthermpore, added docstrings. Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
rllib/utils/checkpoints.py
Outdated
@@ -313,10 +345,21 @@ def restore_from_path( | |||
the subcomponent and thus, only that subcomponent's state is | |||
restored/loaded. All other state of `self` remains unchanged in this | |||
case. | |||
filesystem: PyArrow FileSystem to use to access data at the path. If not | |||
specified, this is inferred from the URI scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more details here?
In particular: "What URI scheme?" :) The one provided in path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this in the next iteration. We should also add additional information about how to use cloud storage when the bucket is not public.
rllib/utils/checkpoints.py
Outdated
@@ -313,10 +345,21 @@ def restore_from_path( | |||
the subcomponent and thus, only that subcomponent's state is | |||
restored/loaded. All other state of `self` remains unchanged in this | |||
case. | |||
filesystem: PyArrow FileSystem to use to access data at the path. If not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use backticks around path
to indicate that we mean the value of the path
arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Maybe I also turn "PyArrow" to pyarrow
.
Signed-off-by: simonsays1980 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this PR @simonsays1980 !
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
…#47670) Signed-off-by: Weichen Xu <[email protected]>
…ay-project#47673) While reading or writing files with Ray Data, S3 might raise a transient SERVICE_UNAVAILABLE error. This PR adds the error to the list of retried transient errors. Signed-off-by: Balaji Veeramani <[email protected]>
…ay-project#47680) Fixes ray-project#47679 --------- Signed-off-by: Balaji Veeramani <[email protected]>
## Why are these changes needed? Add some additional items to replica metadata and request context. --------- Signed-off-by: Cindy Zhang <[email protected]>
There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that
After multi ref PR, we cannot just do await on returned value when it is multi ref output
``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]>
…6767) Signed-off-by: prithvi-mac <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
At the actual state RLlib cannot directly use cloud checkpointing or load checkpoints from the cloud, but needs to use
ray.tune
. However,ray.tune
cannot load a checkpoint from the cloud, but can only create it. This PR adds fucntionality for cloud checkpointing and loading checkpoints/components from any filesystem supported bypyarrow
.The main functionality in how checkpointing, saving, restoring, or retrieving checkpoint info works remains as is, but uses instead of a basic filesystem an
pyarrow.fs.FileSystem
. Thispyarrow
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.