Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Enable cloud checkpointing. #47682

Merged
merged 20 commits into from
Sep 25, 2024

Commits on Sep 13, 2024

  1. Interchanged local filesystem with PyArrow filesystem to be able to s…

    …tore to any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS.
    
    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    dd040de View commit details
    Browse the repository at this point in the history
  2. Interchanged local filesystem with PyArrow filesystem to be able to r…

    …estore from any PyArrow filesystem, i.e. epsecially GCS/S3/ABS/NFS.
    
    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    1ce7acf View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2024

  1. Configuration menu
    Copy the full SHA
    0c51f7e View commit details
    Browse the repository at this point in the history
  2. Added filesystem to all subcomponent calls and added conversion to st…

    …ring paths before using PyArrow's filesystem detector. Furthermpore, added docstrings.
    
    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    6416f08 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2367b56 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    147f1ea View commit details
    Browse the repository at this point in the history
  5. Fixed a unit test in 'checkpoint_utils'.

    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    39b5619 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. Fixed bug in doctests of 'rllib-learner.rst'.

    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    8013d7b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9fa4ed9 View commit details
    Browse the repository at this point in the history
  3. [Data] Add SERVICE_UNAVAILABLE to list of retried transient errors (r…

    …ay-project#47673)
    
    While reading or writing files with Ray Data, S3 might raise a transient SERVICE_UNAVAILABLE error. This PR adds the error to the list of retried transient errors.
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    bveeramani authored and simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    8a4fe7a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    05fd902 View commit details
    Browse the repository at this point in the history
  5. [serve] Additional metadata and context (ray-project#47652)

    ## Why are these changes needed?
    
    Add some additional items to replica metadata and request context.
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    zcin authored and simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    3efd47f View commit details
    Browse the repository at this point in the history
  6. [Core][aDAG] Set buffer size to 1 for regression (ray-project#47639)

    There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation.
    With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that
    rkooo567 authored and simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    2216f2d View commit details
    Browse the repository at this point in the history
  7. [core][aDAG] Fix microbenchmark regression adag 2 (ray-project#47683)

    After multi ref PR, we cannot just do await on returned value when it is multi ref output
    rkooo567 authored and simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    3929ce6 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    b52a38f View commit details
    Browse the repository at this point in the history
  9. Add perf metrics for 2.36.0 (ray-project#47574)

    ```
    REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json
    REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json
    REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json
    REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json
    REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json
    REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json
    REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json
    REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json
    REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json
    REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json
    REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json
    REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json
    REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json
    REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json
    REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json
    REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json
    REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json
    REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json
    REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json
    REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json
    REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json
    ```
    
    Signed-off-by: kevin <[email protected]>
    khluu authored and simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    d5f1a01 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    05c866c View commit details
    Browse the repository at this point in the history
  11. Indented code in docs as CI tests were raising an error.

    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    75ddea8 View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. Merged Master

    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    7d35bf7 View commit details
    Browse the repository at this point in the history
  2. Removed indentation in 'rllib-learner.rst'.

    Signed-off-by: simonsays1980 <[email protected]>
    simonsays1980 committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    306df5b View commit details
    Browse the repository at this point in the history