Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Remove return_info from reset() in pettingzoo_env.py. #33470

Closed
wants to merge 959 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Apr 22, 2023

  1. Fix tensorarray to numpy conversion (ray-project#34115)

    * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"
    
    This reverts commit 5c79954.
    
    * Fix tensorarray to numpy conversion
    
    Signed-off-by: elliottower <[email protected]>
    jianoaix authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    982a4c0 View commit details
    Browse the repository at this point in the history
  2. [data] Fix test failure caused by lack of ordring in default streamin…

    …g executor (ray-project#34120)
    
    * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"
    
    This reverts commit 5c79954.
    
    * Fix test failure caused by lack of ordring in default streaming executor
    
    Signed-off-by: elliottower <[email protected]>
    jianoaix authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c8c2471 View commit details
    Browse the repository at this point in the history
  3. [ci/release] Add more GCE variants for tests (ray-project#34046)

    cluster_tune_scale_up_down
    long_running_horovod_tune_test
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9f35260 View commit details
    Browse the repository at this point in the history
  4. [CI] Migrate many_actors and many_tasks to v2 (ray-project#34123)

    Even though we have perf regression on v2 stack but at least they can run. Currently starting 65 nodes has very low success rate on v1 stack.
    
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c15a66f View commit details
    Browse the repository at this point in the history
  5. [Java] Don't load cpp library in dev model. (ray-project#33667)

    Don't load cpp library in dev model, because it will be error when nativeGetSystemConfig is invoked in local model on the Mac.
    
    Co-authored-by: XiaodongLv <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    47ece52 View commit details
    Browse the repository at this point in the history
  6. [Serve] Fix standalone3 tests (ray-project#34100)

    environment is not inherited in Windows with subprocess, we need to explicitly inject env variables.
    
    The reason we don't find it before is because the test is inside the standalone2.py, which is ignored for windows.
    
    windows passed.
    ```
    //python/ray/serve:test_standalone3                                      PASSED in 207.6s
    ```
    
    Signed-off-by: elliottower <[email protected]>
    sihanwang41 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6969f86 View commit details
    Browse the repository at this point in the history
  7. [docs] fix nav (ray-project#34133)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f2c5b1c View commit details
    Browse the repository at this point in the history
  8. [CI][Clean][2] Make s3 function names more agnostic (ray-project#33944)

    Some existing functions that work with both s3 and gs but has the word s3 in his name. Refactor those. Also create constants for commonly used values, reduce duplications, etc.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5e6ca87 View commit details
    Browse the repository at this point in the history
  9. [CI][GCE/5] Add GCE variations of GPU tests (ray-project#33946)

    Add GCE variations of GPU tests. Two key things we need to change in order for GPU tests to work:
    - Better concurrency control for GPU tests in GCE. GCE has low GPU quota, and between ray start up and auto-scale, jobs competing for resources tend to run into deadlock. With better concurrency control, they can now all run successfully
    - The 'dataset_shuffle_push_based_random_shuffle_100tb' test requires a 400TB storage in the cluster. GCE however, currently, has only 200TB. So I change this test to run with 50tb of data in GCE (for now).
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3b4d7cf View commit details
    Browse the repository at this point in the history
  10. [Data] Add output_arrow_format to from_items (ray-project#33837)

    DelegatingBlockBuilder does not have consistent behavior for dict inputs. It attempts to create an Arrow block, but will fall back to SimpleBlock if that fails.
    
    That has led to silent behavior changes such as ray-project#33789.
    
    In this PR, we add a flag to explicitly force Arrow block.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: Amog Kamsetty <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    amogkam authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    daf546e View commit details
    Browse the repository at this point in the history
  11. [data] Add streaming execution documentation (ray-project#33941)

    * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"
    
    This reverts commit 5c79954.
    
    * Add streaming execution documentation
    
    * fix
    
    * feedback
    
    * remove new file
    
    * fix
    
    * fix
    
    * key concept
    
    * fix
    
    * fix
    
    * fix
    
    * wording
    
    * feedback
    
    Signed-off-by: elliottower <[email protected]>
    jianoaix authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    235b314 View commit details
    Browse the repository at this point in the history
  12. [data] Remove datasets github workflow (ray-project#34138)

    This was added in ray-project#26127, but never successfully worked due to missing credentials.
    
    Signed-off-by: Matthew Deng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    matthewdeng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0d0eed9 View commit details
    Browse the repository at this point in the history
  13. [data] Add pydoc for ExecutionOptions (ray-project#34144)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5c25a74 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    6728d79 View commit details
    Browse the repository at this point in the history
  15. [Datasets] Improve formatting of DatasetStatsSummary, `StageStatsSu…

    …mmary`, `IterStatsSummary` (ray-project#34119)
    
    Similar to the Dataset.repr formatting improvements in ray-project#32722, improve the readability of DatasetStatsSummary, StageStatsSummary, IterStatsSummary when printed. See the included test case for examples.
    
    ---------
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3cc91d0 View commit details
    Browse the repository at this point in the history
  16. [RLlib] checkpoint learner (ray-project#33598)

    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    64b500b View commit details
    Browse the repository at this point in the history
  17. [ray.util.spark] Add warning if webui_url is None. optional dependenc…

    …ies for dashboard server might be missing. (ray-project#33521) (ray-project#34026)
    
    Just trying to be helpful and give distracted people like me a potential reason why the dashboard is not available.
    
    Signed-off-by: elliottower <[email protected]>
    ftrifoglio authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    976d114 View commit details
    Browse the repository at this point in the history
  18. [CI] Remove microbenchmark_staging (ray-project#34154)

    It duplicates microbenchmark.
    
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    91bdfb5 View commit details
    Browse the repository at this point in the history
  19. [RLlib][Docs] Added RLModule user-guide to the docs (ray-project#33909)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    kouroshHakha authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2698ef8 View commit details
    Browse the repository at this point in the history
  20. [RLlib] DreamerV3: Catalog enhancements (MLP/CNN encoders/heads compl…

    …eted and unified accross DL frameworks). (ray-project#33967)
    
    Signed-off-by: elliottower <[email protected]>
    sven1977 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    279a7d6 View commit details
    Browse the repository at this point in the history
  21. [Doc] Add Ray core fault tolerance guide for GCS and node (ray-projec…

    …t#33446)
    
    - Add fault tolerance guide for gcs and ray node
    - Remove dead RAY_num_heartbeats_timeout
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    95e0cca View commit details
    Browse the repository at this point in the history
  22. [RLlib] Change handling of try reset to support ASYNC_RESET_RETURN (r…

    …ay-project#33874)
    
    if a user is using remote base envs, then when reset/try_reset is called on the
    env then it returns the constant "async_reset_return".
    Our error handler for resets in the env runner v2 didn't catch this because
    it makes the assumption that returns from try reset are multi env dicts.
    
    Generally speaking we don't have good test coverage on the remote base env
    and we frankly don't plan to as it isn't api that we plan on supporting in
    future releases. however in the meantime we'll patch this bug because
    a user brought it up as an issue affecting them.
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ec3c102 View commit details
    Browse the repository at this point in the history
  23. [core] Improve the workflow finding Redis leader. (ray-project#34108)

    The current way of finding Redis leader sometimes giving error information. The main reason is because the ip address is not resolved.
    
    If in the initialization stage, it connect to the master, it'll use passed in address as the leader which later might make Ray pick follower redis.
    
    This PR fixed the issue and also uses the way redis-cli used to pick the leader.
    
    The PR makes the checking more strict to give better error message. New protocol as below:
    
    - use boost to resolve the ip address from domain name.
    - connect to the first ip address
    - if it's cluster mode,
      - make sure it's healthy; make sure only 1 shard
      - send a dummy write and check the return
        - if return OK, use the ip address directly
        - otherwise, use the one mentioned in the error message
    - if not cluster mode, just use the ip address
    
    Refactoring is also done in this PR. Moving connection related information to redis context
    
    Signed-off-by: elliottower <[email protected]>
    fishbone authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8afd070 View commit details
    Browse the repository at this point in the history
  24. [release] fix tune_scalability_network_overhead by adding `--smoke-te…

    …st`. (ray-project#34167)
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2188ad2 View commit details
    Browse the repository at this point in the history
  25. Revert "Global logging format changes" (ray-project#34126)

    Adter manual bisection, I think this PR may be causing the "Documentation" tests to fail.
    
    The failure was previously masked by an actual failing doctest, but after this commit, actor outputs clutter the doctests and lead to mismatches in expected and actual output.
    
    Let's see if reverting fixes these problems.
    
    Reverts ray-project#32741
    
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c5bc0a5 View commit details
    Browse the repository at this point in the history
  26. [CoreWorker] Partially address Ray child process leaks by killing all…

    … child processes in the CoreWorker shutdown sequence. (ray-project#33976)
    
    We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.
    
    Signed-off-by: elliottower <[email protected]>
    cadedaniel authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6b6a7a6 View commit details
    Browse the repository at this point in the history
  27. [data] Rename .cache() to .materialize() (ray-project#34169)

    Based on discussion with @c21 @jjyao , as well as the new "MaterializedDatastream" class name, materialize makes more sense as an action than cache. Furthermore, we don't need an is_cached method as the new type information suffices.
    
    We will have to pick a variation of this PR into 2.4 as well, which introduces the original fully_executed() -> cache() rename.
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ff0aa1c View commit details
    Browse the repository at this point in the history
  28. [Dataset] Add FromXXX operators (ray-project#32959)

    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c3d4358 View commit details
    Browse the repository at this point in the history
  29. [data] [streaming] Simplify progress bar reporting and integrate with…

    … Jupyter (ray-project#34150)
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    518e0cf View commit details
    Browse the repository at this point in the history
  30. [Data] Fix '_unwrap_protocol' for Windows systems (ray-project#31296)

    The `_unwrap_protocol` method uses the `urllib.parse.urlparse` library function to split out the path and protocol. On Windows however this function returns the path with a `/` added before the drive letter. This type of path can't be used by any other functions. The solution is to strip the `/`. The logic is similar to what is used in the `pip` package, see [here](https://github.com/pypa/pip/blob/22.3.1/src/pip/_internal/utils/urls.py#L49).
    
    Signed-off-by: Jeroen Bédorf <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jbedorf authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    54403c6 View commit details
    Browse the repository at this point in the history
  31. [Java] Update coding style for RuntimeEnvTest.java (ray-project#34160)

    Update coding style for RuntimeEnvTest.java
    
    Co-authored-by: XiaodongLv <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    aa8c5d9 View commit details
    Browse the repository at this point in the history
  32. [Java] make java worker log file prefix (default to java-worker) conf…

    …igurable (ray-project#33797)
    
    For java worker, it's log file always being prefixed with "java-worker". And in python log_monitor.py, it hardcodes "java-worker*.log" to be polled for new log msg periodically. Some configs, like log_to_driver and RAY_BACKEND_LOG_LEVEL, don't prevent the log monitor from polling and publishing logs to gcs. To save some CPU cycle and network bandwidth, especially if there is large amount of logs produced from JVM, we can have an option. like a JVM system property, to set log file prefix for java worker instead of hard coded to "java-worker".
    
    Signed-off-by: elliottower <[email protected]>
    jiafuzha authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7210c3d View commit details
    Browse the repository at this point in the history
  33. [core] Fix std::move without std namespace (ray-project#34149)

    This is preventing build from newer mac.
    Signed-off-by: rickyyx <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2b79733 View commit details
    Browse the repository at this point in the history
  34. [serve] Document multi-app support (ray-project#33496)

    Documentation for Serve multi-application support.
    
    - Separates the Serve REST API page into V1 (single-application) and V2 (multi-application) REST API
    - Adds API ref pages for all config schemas
      - `ServeDeploySchema` - top level multi-application config
      - `HTTPOptionsSchema` - options to start the HTTP Proxy with
      - `ServeApplicationSchema` - single-application config
      - `DeploymentSchema` - deployment override options
      - `RayActorOptionsSchema` - options to start a replica actor with
    <img width="780" alt="image" src="https://user-images.githubusercontent.com/15851518/228297681-1f777219-8694-44e1-ad85-30a5a993e6e6.png">
    
    - Adds API ref pages for all response schemas returned from GET endpoints
      - `ServeStatusSchema` - response format of old endpoint `GET /api/serve/deployments/status`
      - `ServeInstanceDetails` and all it's sub-schemas - response format of new endpoint `GET /api/serve/applications/`
    <img width="786" alt="Screen Shot 2023-03-28 at 8 35 27 AM" src="https://user-images.githubusercontent.com/15851518/228290829-b100b373-9951-4a74-b84b-646f20d7803d.png">
    
    - Adds a user guide called "Deploying Multiple Serve Applications" under "User Guides" that covers using the serve CLI to interact with multiple applications.
    
    Signed-off-by: elliottower <[email protected]>
    zcin authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    07cb238 View commit details
    Browse the repository at this point in the history
  35. [RLlib] Add a flag to allow disabling initialize_loss_from_dummy_batc…

    …h logit. (ray-project#34208)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    kouroshHakha authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    746578e View commit details
    Browse the repository at this point in the history
  36. [RLlib][RLModule] Abstract the build stage of RLModule to make them m…

    …ore extendable (ray-project#34205)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    kouroshHakha authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7d1420f View commit details
    Browse the repository at this point in the history
  37. Update codeowners (ray-project#34214)

    Signed-off-by: Shreyas Krishnaswamy <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    shrekris-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6892eb8 View commit details
    Browse the repository at this point in the history
  38. [core] Mark raylet unhealthy if GCS can't recognize it. (ray-project#…

    …34087)
    
    When GCS can't recognize the raylet, Raylet will just hang there and never exits. There is also no way to tell whether this raylet is healthy or not.
    
    This could happen when some incorrect setup. For example, data is lost in the DB. When Raylet detect the issue, it should just exit itself or mark itself as unhealthy.
    
    This PR will mark raylet mark itself unhealthy and the upper layer can choose what to do for this case. This is useful for Serve HA's usecase because as long as the raylet is alive, the actors will be still able to serve traffic and the upper layer can do more operations, like starting a new cluster and shutdown the current one later.
    
    Signed-off-by: elliottower <[email protected]>
    fishbone authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b9df7dc View commit details
    Browse the repository at this point in the history
  39. [Data] Improve state initialization for ActorPoolMapOperator (ray-p…

    …roject#34037)
    
    ActorPoolMapOperator takes in a Callable class which initializes some state to be reused for every batch.
    
    In the current implementation, this state is initialized on the first batch, rather than during actor init.
    
    In this PR, we separate the state initialization and actually call it during Actor init. This allows state to be initialized for fixed size actor pools, even when tasks are not ready to be dispatched for better pipelining. It also supports using multithreaded actors, so state gets initialized once per actor instead of once per thread.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    amogkam authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b480937 View commit details
    Browse the repository at this point in the history
  40. [ci] No early kickoff by default for workflow test (ray-project#34213)

    Signed-off-by: rickyyx <[email protected]>
    
    We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image.
    - ray-project#34101
    
    Seems the workflow test does have a dependency on the wheels built.
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d49858e View commit details
    Browse the repository at this point in the history
  41. don't trigger execution in ipython repr (ray-project#34219)

    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a41264b View commit details
    Browse the repository at this point in the history
  42. [doc] [data] Fix autosummary issues (ray-project#34220)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    98632f1 View commit details
    Browse the repository at this point in the history
  43. Revert "[doc] [data] Fix autosummary issues (ray-project#34220)" (ray…

    …-project#34227)
    
    This reverts commit 45b9067.
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d1ac0c9 View commit details
    Browse the repository at this point in the history
  44. Configuration menu
    Copy the full SHA
    cac4fdc View commit details
    Browse the repository at this point in the history
  45. [core] Fix the placement group stress test regression. (ray-project#3…

    …4192)
    
    Signed-off-by: Yi Cheng <[email protected]>
    
    The regression is because of enabling ray syncer. In the code, whenever the pg is created and deleted, raylet will actively send a message to GCS and this introduced a lot of workload to the GCS and thus make the code run slow.
    If disable ray syncer, raylet won't create this message and not sending it to GCS.
    
    There is no need doing this since when new resource is added to local node, ray syncer will be able to notice this and the resource will be pushed to GCS after 100ms.
    
    This PR deleted this logic and thus fix the regression.
    
    ```
    before: placement group create/removal per second 1271.32 +- 8.27
    after: placement group create/removal per second 1282.83 +- 3.99
    ```
    
    For release test:
    ```
    perf_metrics = [{'perf_metric_name': 'pgs_per_second', 'perf_metric_value': 17.061243668170643, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'dashboard_p50_latency_ms', 'perf_metric_value': 3.261, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p95_latency_ms', 'perf_metric_value': 129.682, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p99_latency_ms', 'perf_metric_value': 141.648, 'perf_metric_type': 'LATENCY'}]
    ```
    
    Signed-off-by: elliottower <[email protected]>
    fishbone authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7ce3b0b View commit details
    Browse the repository at this point in the history
  46. [Core] lazy import autoscaler + don't import opentelemetry unless set…

    …up hook (ray-project#33964)
    
    This will improve startup time almost 2X and reduce memory usage by 2X. If it is combined with numpy lazy import, it will improve everything 3X.
    
    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2881f2f View commit details
    Browse the repository at this point in the history
  47. [docs][KubeRay] Update KubeRay doc for release v0.5.0 (ray-project#34178

    )
    
    index.md: No code here to test and verify.
    
     getting-started.ipynb: Test manually.
    
     user-guides.md: No code here to test and verify
    
     k8s-cluster-setup.md: No code here to test and verify
     config.md: No code here to test and verify
     configuring-autoscaling.md: Test manually.
     logging.md: Test manually.
     gpu.rst: I did not verify code snippets, but GPU usage will be verified in gpu-training-example.md.
     experimental.md: No code here to test and verify
     static-ray-cluster-without-kuberay.md: Skip this. This document has no relationship with KubeRay.
     examples.md
    
     ml-example.md: (Will update in [docs][KubeRay] Provide some GKE instructions in KubeRay example ray-project#33339)
     gpu-training-example.md (Will update in [docs][KubeRay] Provide some GKE instructions in KubeRay example ray-project#33339)
     references.md
    
    Ray Serve
     kubernetes.md: Test manually.
    
     fault-tolerance.md: I do not test all serve's recovery procedures. I make sure the RayService can be created as expected.
    
    helm repo add kuberay https://ray-project.github.io/kuberay-helm/
    helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
    # path: doc/
    kubectl apply -f source/serve/doc_code/fault_tolerance/k8s_config.yaml
    
    # port forward
    kubectl port-forward service/rayservice-sample-serve-svc 8000
    
    # Test the serve deployment
    curl localhost:8000
    
    # Delete a worker Pod
    kubectl delete pod ${WORKER_POD}
    
    # Test the serve deployment again
    curl localhost:8000
     run_gcs_ft_on_k8s.py
    
    Signed-off-by: elliottower <[email protected]>
    kevin85421 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    018abaa View commit details
    Browse the repository at this point in the history
  48. [Actor] [Code Quality] Add Unit Tests for Actors Sorting (ray-project…

    …#34058)
    
    Following with ray-project#33395 (comment), add a component test to improve the code quality
    
    Signed-off-by: elliottower <[email protected]>
    chaowanggg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    896df4d View commit details
    Browse the repository at this point in the history
  49. [Dataset] Fix breaking Data CI tests (ray-project#34195)

    - ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases.
    - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests.
    - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    89e4a3b View commit details
    Browse the repository at this point in the history
  50. [RLlib] Change broken link in parameter_noise.py (ray-project#34231)

    Signed-off-by: Avnish <[email protected]>
    
    change broken open ai blog post link to a working one
    
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    385f0ad View commit details
    Browse the repository at this point in the history
  51. [Serve] [Docs] Clarify that the Serve config only supports remote URIs (

    ray-project#34212)
    
    The Serve config only supports remote URIs within its `runtime_env` for safety purposes. However, this behavior is poorly documented and only guarded by a pydantic validator with an unclear error message.
    
    This change documents the remote URI requirements and clarifies the error message.
    
    Behavior when you run the following config with an invalid `runtime_env`:
    
    ```yaml
    import_path: fruit:graph
    
    runtime_env: {
      "working_dir": "src"
    }
    ```
    
    1. Without the change:
    
    ```console
    % serve run config.yaml
    ...
    pydantic.error_wrappers.ValidationError: 1 validation error for ServeApplicationSchema
    runtime_env
      Invalid protocol for runtime_env URI src. Supported protocols: ['GCS', 'CONDA', 'PIP', 'HTTPS', 'S3', 'GS', 'FILE']. Original error: '' is not a valid Protocol (type=value_error)
    ```
    
    2. With the change:
    
    ```console
    % serve run config.yaml
    ...
    pydantic.error_wrappers.ValidationError: 1 validation error for ServeApplicationSchema
    runtime_env
      runtime_envs in the Serve config support only remote URIs in working_dir and py_modules. Got error when parsing URI: Invalid protocol for runtime_env URI "src". Supported protocols: ['GCS', 'CONDA', 'PIP', 'HTTPS', 'S3', 'GS', 'FILE']. Original error: '' is not a valid Protocol (type=value_error)
    ```
    
    Signed-off-by: elliottower <[email protected]>
    shrekris-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7231190 View commit details
    Browse the repository at this point in the history
  52. [core][ci] Fix test_fault_tolerance_actor_tasks_failed for test_task_…

    …events_2.py (ray-project#34237)
    
    Closes ray-project#34229
    
    Or if we could merge ray-project#33818
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    bdaf578 View commit details
    Browse the repository at this point in the history
  53. [core] prestart worker on node startup (ray-project#33623)

    Always prestart num_cpu of workers when raylet starts up. Previously we only start python workers on driver registration, or another worker submits a new task to this raylet. This has caused cold start issues described in ray-project#26262.
    
    As part of this change, also did some needed cleanup to simplify the code / make this work
    
    removed start_initial_python_workers_for_first_job from ray.init(...)
    this is causing prestart to not work, since start_initial_python_workers_for_first_job defaults to false and is defaulted to true by ray client if there is no runtime env
    -- there is no behavior change in this PR to how worker prestart interacts with runtime env
    -- this doesn't seem to be big of a change to warrant api review : if customer is setting this in ray client, they should remove it
    -- if someone wants to turn off worker prestart, they can do so by setting RAY_enable_worker_prestart to false
    Benchmark : on prestarted raylet, measure time to start driver and start num_cpu tasks
    
    Master: 2.09 sec
    PR: 1.18 sec(56% of original startup time)
    
    We don't measure other cases due to restrictions with today's working re-use due to worker cache key - something that needs to be addressed in follow up
    
    Signed-off-by: elliottower <[email protected]>
    clarng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c8b0c7a View commit details
    Browse the repository at this point in the history
  54. [RLlib] Fixed a bug with kl divergence calculation of torch.Dirichlet…

    … distribution within RLlib (ray-project#34209)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    kouroshHakha authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    13c9059 View commit details
    Browse the repository at this point in the history
  55. [Core] Fix ray start command output (ray-project#34081)

    With ray-project#32409, we stopped printing out information like dashboard url when creating a single node ray cluster on OSX and windows. This is a regression and this PR reverts back to the old behavior.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9573ee2 View commit details
    Browse the repository at this point in the history
  56. [RLlib] Remove infos dict before Json_writer writes sample batches (r…

    …ay-project#33896)
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    86ae35c View commit details
    Browse the repository at this point in the history
  57. Configuration menu
    Copy the full SHA
    e60cdee View commit details
    Browse the repository at this point in the history
  58. [core] Task backend - Add worker died info to failed tasks when job e…

    …xits. (ray-project#34166)
    
    This adds the additional error_type + error_message info to non-terminal tasks (not finished and not failed) when a job exits.
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    cc69ce3 View commit details
    Browse the repository at this point in the history
  59. [Data] Update path expansion warning (ray-project#34221)

    The warning for path expansion during metadata fetching is inaccurate with recent changes. This PR updates the warning.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    amogkam authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9c3adec View commit details
    Browse the repository at this point in the history
  60. [docs][KubeRay] Provide some GKE instructions in KubeRay example (ray…

    …-project#33339)
    
    ml-example.md: I used a GKE cluster without autopilot in this example. As there are some dependency issues on my Mac M1 at this moment, I made some slight modifications to the reproduction instructions. Instead of running the job locally, I used kubectl exec to log in to the head Pod and submit the XGBoost job. This change should not have impact on this document.
    
    Screen Shot 2023-04-10 at 3 34 47 PM Screen Shot 2023-04-10 at 3 34 37 PM
    gpu-training-example.md
    
    Protobuf issue (ray-ml:2.3.0-gpu): ray-ml docker images - TypeError: Descriptors cannot not be created directly ray-project#31309 (comment) => Choose a Ray 2.2 image.
    TorchVisionPreprocessor (Ray 2.2 does not support TorchVisionPreprocessor. Hence, I used pytorch_training_e2e.py in the branch ray-2.2.0)
    
    Signed-off-by: elliottower <[email protected]>
    kevin85421 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6f74e42 View commit details
    Browse the repository at this point in the history
  61. [data] Add take_batch API for collecting data in the same format as i…

    …ter_batches and map_batches (ray-project#34217)
    
    There isn't any convenient way to take just a single batch today, which is confusing. Introduce ds.take_batch(n, batch_format="default"), which returns a batch of n records as next(ds.iter_batches(batch_size=n, batch_format="default")) would.
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    47add11 View commit details
    Browse the repository at this point in the history
  62. [serve] Log to file on LongPollClient update (ray-project#34204)

    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    edoakes authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f32b525 View commit details
    Browse the repository at this point in the history
  63. [try 2] [doc] [data] Fix autosummary issues (ray-project#34228)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    825f5c2 View commit details
    Browse the repository at this point in the history
  64. [RLlib] Change occurences of `"_observation_space_in_preferred_format…

    …"` to `"_obs_space_in_preferred_format"` (ray-project#33907)
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f877ee3 View commit details
    Browse the repository at this point in the history
  65. [Core] Introduce spill_on_unavailable option for soft NodeAffinitySch…

    …edulingStrategy (ray-project#34224)
    
    Introduce a private _spill_on_unavailable semantic for soft NodeAffinitySchedulingStrategy.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    73ce168 View commit details
    Browse the repository at this point in the history
  66. [Data] Support using concurrent actors for ActorPool (ray-project#3…

    …4253)
    
    Support using concurrent actors for ActorPool. We do this by gating the user UDF in a separate threadpool of max size 1.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    amogkam authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    89f4193 View commit details
    Browse the repository at this point in the history
  67. [Part 2/n] Rename Dataset => Datastream (DataContext, DataIterator, G…

    …roupedDatastream) (ray-project#34186)
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8d5263c View commit details
    Browse the repository at this point in the history
  68. [ci/release] Migrate GBDT tests (xgboost/lightgbm) to GCE (ray-projec…

    …t#34264)
    
    Continuing the effort to migrate tests to GCE, this introduces variations for xgboost_ and lightgbm_ tests.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    31ef76b View commit details
    Browse the repository at this point in the history
  69. Configuration menu
    Copy the full SHA
    5b50270 View commit details
    Browse the repository at this point in the history
  70. [data] Make sure the tf and tensor iteration work in dataset pipeline (

    …ray-project#34248)
    
    * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"
    
    This reverts commit 5c79954.
    
    * make sure tf and tensor iteration in datapipeline work
    
    * Fix
    
    * fix
    
    * fix
    
    * fix
    
    * feedback
    
    * feedback
    
    * fix
    
    Signed-off-by: elliottower <[email protected]>
    jianoaix authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    46012cc View commit details
    Browse the repository at this point in the history
  71. [Jobs] Fix race condition in supervisor actor creation and add timeou…

    …t for pending jobs (ray-project#34223)
    
    @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:
    
    A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
    In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
    The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
    If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.
    
    The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.
    
    This PR adds a unit test that fails with ray-project#34190 (which forces the race condition).
    
    This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.
    
    Some questions are still open:
    
    Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job.
    What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
    
    Signed-off-by: elliottower <[email protected]>
    architkulkarni authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    bf671ef View commit details
    Browse the repository at this point in the history
  72. [RLlib] Actually save the optimizer state for tf learners (ray-projec…

    …t#34252)
    
    It turns out you can get the actual optimizer state by calling optimizer.variables for tf keras.
    this pr enables us to save the full optimizer state and restore it. To do this I added a new
    file called optimizer_name_state.txt to the checkpoint. This holds a bytestring serialized
    representation of the optimizer's state. It looks like the optimizer's variable state doesn't include
    things like the learning rate, so I still need to save those as a separate file and
    reconstruct the optimizer first before loading the state.
    
    ---------
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    bbd512d View commit details
    Browse the repository at this point in the history
  73. [RLlib] Change broken doc name: MultiAgentRLModule.build->MultiAgentR…

    …LModule.setup (ray-project#34291)
    
    Signed-off-by: Avnish <[email protected]>
    
    fix in the title. We had a autogenerated doc that was broken because the name of a function changed.
    
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0b087b7 View commit details
    Browse the repository at this point in the history
  74. Add Cython wrapper for GcsClient (ray-project#33769)

    This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient.
    
    This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes.
    
    This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code.
    
    We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working.
    
    Signed-off-by: elliottower <[email protected]>
    pcmoritz authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3c9641e View commit details
    Browse the repository at this point in the history
  75. [Doc] Rewrite the placement group documentation (ray-project#33518)

    This PR rewrites the existing placement group documentation that is confusing (sorry I wrote the original version).
    
    The new doc will start from the simplest example -> explaining the advanced concepts. Also, all the concepts are more thoroughly explained with examples.
    
    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    4a0100c View commit details
    Browse the repository at this point in the history
  76. [Serve][Doc] Update metrics & log doc (ray-project#34222)

    Update the logging & metrics for the 2.4. change.
    
    Co-authored-by: angelinalg <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3ea1655 View commit details
    Browse the repository at this point in the history
  77. [core] fix windows node manager test (ray-project#34304)

    one of the test uses command that doesn't work on windows, disable it for now
    
    Signed-off-by: Clarence Ng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    clarng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    67019be View commit details
    Browse the repository at this point in the history
  78. [Doc] Make front page images non clickable (ray-project#32738)

    A Sphinx issue automatically makes images clickable whenever they're scaled (see https://stackoverflow.com/questions/40096251/disable-click-behavior-for-images). Clicking takes you to a full size version of the image.
    
    On the front page of the docs, there are four prominent images that look like buttons. The user would expect clicking them to take you to a docs page, but instead it just takes you to the image. (See the linked issue for details)
    
    Since there's no single docs page corresponding to each of these four images, in this PR we opt to make these images non clickable.
    
    Signed-off-by: elliottower <[email protected]>
    architkulkarni authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9feabc4 View commit details
    Browse the repository at this point in the history
  79. [ci/mac] Fix arm64 wheels builds (ray-project#34268)

    The conda setup in test_wheels seems to fail from leftover state from previous python installs. This PR updates the test wheels script to create a new conda environment with the respective Python version which should not interfere with previous virtual envs.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9d312c6 View commit details
    Browse the repository at this point in the history
  80. [core][state] Add head node flag is_head_node to state API and GcsN…

    …odeInfo (ray-project#34299)
    
    There have been requests for checking which/if a node is the head node.
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    033a3af View commit details
    Browse the repository at this point in the history
  81. [Release] Fix dask dependencies (ray-project#34261)

    Some if not all of the dask release tests are failing because of dependency hell. In short, boto, s3sf does not work well with boto3 that is installed in anyscale dataplane. Good news is these tests do not need these dependencies anyway (since anyscale already installed them properly).
    
    Related issue number
    Closes ray-project#19399
    
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    de72c25 View commit details
    Browse the repository at this point in the history
  82. [CI][Clean][03] Break run_release_test into smaller functions (ray-pr…

    …oject#33951)
    
    A purely refactor diff. Break run_release_test in glue.py into smaller functions so they are easier to read, test and change. It helps me to make future change easier too.
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0e40f9c View commit details
    Browse the repository at this point in the history
  83. [CI][Clean][4] Add exception-free functions (ray-project#34099)

    Add exception-free APIs for some classes. This helps client with the option to use them without having to worry about exception handling repetitively. Make the client code a bit easier to read.
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6a70c86 View commit details
    Browse the repository at this point in the history
  84. [serve] Remove pointless asyncio.Lock (ray-project#34314)

    This is a relic of a forgotten era.
    
    None of the calls it is "guarding" `await` so it is currently a no-op.
    
    Signed-off-by: elliottower <[email protected]>
    edoakes authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    04fbf96 View commit details
    Browse the repository at this point in the history
  85. [Doc] Fix typo in Tune restore guide (ray-project#34247)

    Signed-off-by: Justin Yu <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    justinvyu authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a8a1927 View commit details
    Browse the repository at this point in the history
  86. [data] Fix pyarrow numpy element issue (ray-project#34215)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3824410 View commit details
    Browse the repository at this point in the history
  87. [Doc] update workspace templates (ray-project#34289)

    Signed-off-by: Sofian Hnaide <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    sofianhnaide authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b2cfa6d View commit details
    Browse the repository at this point in the history
  88. [docs] fix build (ray-project#34265)

    * [docs] fix build
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    * fix doctests
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    * last test
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    * lint
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    * Update doc/source/rllib/package_ref/rl_modules.rst
    
    Co-authored-by: kourosh hakhamaneshi <[email protected]>
    Signed-off-by: Max Pumperla <[email protected]>
    
    * fixes
    
    * revert diff
    
    * whitespace
    
    ---------
    
    Signed-off-by: Max Pumperla <[email protected]>
    Signed-off-by: Philipp Moritz <[email protected]>
    Co-authored-by: kourosh hakhamaneshi <[email protected]>
    Co-authored-by: Philipp Moritz <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    3 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3b66972 View commit details
    Browse the repository at this point in the history
  89. [AIR][Doc] LightningTrainer Advanced Example (ray-project#34082)

    Signed-off-by: elliottower <[email protected]>
    woshiyyya authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    05eff03 View commit details
    Browse the repository at this point in the history
  90. [RLlib] External env is not compatible with the connectors API. (ray-…

    …project#33945)
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    36e23fd View commit details
    Browse the repository at this point in the history
  91. [Data] Cosmetic changes to Arrow Tensor __repr__ (ray-project#34286)

    Make it clear what the data type actually is -- a numpy array. Also make the argument ordering consistent between the two types.
    
    Signed-off-by: elliottower <[email protected]>
    pcmoritz authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ea0764a View commit details
    Browse the repository at this point in the history
  92. Configuration menu
    Copy the full SHA
    a236ec9 View commit details
    Browse the repository at this point in the history
  93. [Dashboard][Bug fix] When using an nginx proxy, the front-end may mis…

    …spell the URL when accessing the log. (ray-project#34130)
    
    In our use case, we need to access the dashboard in the online cluster through an nginx proxy from the intranet. We found that when accessing the log page under this scenario, the front-end would misspell the URL, resulting in a failure to load.
    ## Related issue number
    
    Closes ray-project#34043
    
    Signed-off-by: elliottower <[email protected]>
    Catch-Bull authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3bd41bf View commit details
    Browse the repository at this point in the history
  94. [Metrics] Fix shared memory is not displayed properly (ray-project#34301

    )
    
    Looks like we incorrectly recorded shared memory, and incorrectly displayed it to the metrics graph (I forgot to append ray_)
    
    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ef51e9f View commit details
    Browse the repository at this point in the history
  95. [Tune] Add support for nested hyperparams in PB2 (ray-project#31502)

    This PR enables nested passing hyperparameters for the PB2 scheduler.
    
    This PR also makes a few minor improvements to PB2 (happy to separate out these changes if needed):
    1. Hyperparameter initialization (if missing from param space) should be sampled uniformly between bounds. Currently, PB2 falls back to PBT for sampling initial hyperparameters, which will just choose between the low/high values.
    2. Allow `custom_explore_fn` to be passed into PB2 to match PBT functionality. This solves a user request here: https://discuss.ray.io/t/pb2-hyper-parameters-as-integers/8822.
    
    Signed-off-by: Justin Yu <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    justinvyu authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    22aa4b9 View commit details
    Browse the repository at this point in the history
  96. Configuration menu
    Copy the full SHA
    0437cb1 View commit details
    Browse the repository at this point in the history
  97. [serve] Revert info log line in LongPollClient (ray-project#34313)

    This is getting spammed to the driver console because it also has a `LongPollClient` :(
    
    Need to add a way to filter these messages before adding it back.
    
    Signed-off-by: elliottower <[email protected]>
    edoakes authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    278d89a View commit details
    Browse the repository at this point in the history
  98. [Release Test] Add GCE variation for core release tests [2/n] (ray-pr…

    …oject#34337)
    
    - single_node_oom
    - benchmark_worker_startup
    - Removed worker node types with max_worker = 0
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    120c34b View commit details
    Browse the repository at this point in the history
  99. [Serve] Remove smoke test from gce (ray-project#34319)

    We don't have smoke test for these release tests.
    
    Signed-off-by: elliottower <[email protected]>
    sihanwang41 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    89b9cff View commit details
    Browse the repository at this point in the history
  100. [build_base] Use bazelisk for better bazel version management. (ray-p…

    …roject#34246)
    
    Upgrading bazel require a lot of file changes:
    
    - update setup.py
    - update windows bazel fix
    - update workspace
    
    This PR make Ray use bazelisk instead of bazel to make the management easier.
    
    Signed-off-by: elliottower <[email protected]>
    fishbone authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    51bcdb9 View commit details
    Browse the repository at this point in the history
  101. [build] Fix build on latest clang (ray-project#34151)

    The latest clang just make some warning as error by default. This PR tries to fix that.
    
    More detail in https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213/1
    
    Signed-off-by: elliottower <[email protected]>
    fishbone authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0222468 View commit details
    Browse the repository at this point in the history
  102. [Data] Remove unnecessary setting of global logging level to INFO whe…

    …n using Ray Data (ray-project#34347)
    
    When initializing Ray Data, the global logging level is set to `INFO`, which causes non-Ray `INFO` logs to be unintentionally emitted (the default level in the `logging` library is `WARNING`, which would normally ignore `INFO`-level logs). We remove an unnecessary setting of the logging level in `DatasetLogger` which resolves this issue.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    75e17e4 View commit details
    Browse the repository at this point in the history
  103. [RLlib] Make the KL coefficient traced in appo tf (ray-project#34293)

    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    dba4144 View commit details
    Browse the repository at this point in the history
  104. [air] Move to new storage_path API in tests and examples (ray-project…

    …#34263)
    
    Following ray-project#33463, this PR updates our tests, examples, and docs to use the new `storage_path` API.
    
    The only locations where we continue to use the `local_dir` statement are tests where we specify both a local dir and a remote dir. For these tests, we can move to an environment-variable based wrapper in the future.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    074e976 View commit details
    Browse the repository at this point in the history
  105. [AIR] Experiment restore stress tests (ray-project#33706)

    Signed-off-by: Justin Yu <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    justinvyu authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    32b4b92 View commit details
    Browse the repository at this point in the history
  106. [RLlib] Fix two RL docs examples (ray-project#34353)

    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9d2e693 View commit details
    Browse the repository at this point in the history
  107. [air] Deflake test_e2e_train_flow.py (ray-project#34308)

    The test_e2e_train_flow test has been flaky. After some investigation this seems to be due to a race condition: The mock train flow would continue from the latest "checkpoint", but an actor restart could resolve before the next iteration finished. This triggers a new continuation, which increases the training iteration, leading to a mismatch.
    
    The fix in this mock flow is to only unset the "restore" instruction after the next round of training results came in.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    bff2b94 View commit details
    Browse the repository at this point in the history
  108. [Datasets] Use read stage name for naming Data-read tasks on Ray Dash…

    …board (ray-project#34341)
    
    This PR updates the naming so that we use the underlying read stage name, if available from the input `LazyBlockList`, as the resulting `MapOperator`; otherwise, we fall back to the existing `DoRead` name.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    99fd534 View commit details
    Browse the repository at this point in the history
  109. [train] Fix rendering of diff code-blocks (ray-project#34355)

    Signed-off-by: Matthew Deng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    matthewdeng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0ef6596 View commit details
    Browse the repository at this point in the history
  110. [RLlib] Check that results has learner info appo test (ray-project#34381

    )
    
    The appo kl coefficient learner test is flakey because
    we run training until there are some results. What can end up happening is that
    training is run for so long that eval results are available but not learner results
    This pr fixes this by training until there are learner results that are available
    not just evaluation results.
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    eb9fbf1 View commit details
    Browse the repository at this point in the history
  111. pull out shared deploy code into deploy utils (ray-project#34321)

    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    zcin authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5c71a2c View commit details
    Browse the repository at this point in the history
  112. [serve] Fix get endpoint when autoscaling config is set (ray-project#…

    …34377)
    
    If autoscaling config is set for a deployment, we can't set the num replicas when returning the deployment details of that deployment. Otherwise, it breaks the entirety of the get metadata endpoint.
    
    Signed-off-by: elliottower <[email protected]>
    zcin authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5c28386 View commit details
    Browse the repository at this point in the history
  113. add main for obod test (ray-project#34311)

    Signed-off-by: Catch-Bull <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    Catch-Bull authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    95bf952 View commit details
    Browse the repository at this point in the history
  114. [tune] fix a typo in tune/execution/checkpoint_manager state serial…

    …ization. (ray-project#34368)
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    453cced View commit details
    Browse the repository at this point in the history
  115. [air] DreamBooth example: Fix code for batch size > 1 (ray-project#34398

    )
    
    The DreamBooth finetuning example currently throws an error when batch size > 1, even when the GPU memory is large enough. This is because the training batches are currently not created correctly.
    
    This PR fixes the batch format and includes in-line comments to explain the new behavior.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d5a46e3 View commit details
    Browse the repository at this point in the history
  116. [Data] combine_chunks before chunking pyarrow.Table block into batches (

    ray-project#34352)
    
    pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    670686a View commit details
    Browse the repository at this point in the history
  117. [data] [streaming] [part 3/n] Rename Dataset => Datastream in interna…

    …l files (ray-project#34340)
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6ea21bd View commit details
    Browse the repository at this point in the history
  118. [Dataset] Validate sort key in Sort LogicalOperator (ray-project#34282

    )
    
    As a followup of ray-project#32133, we should validate key with block.py:_validate_key_fn(), in generate_sort_fn() before doing sort.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c860d17 View commit details
    Browse the repository at this point in the history
  119. [data] Add usage tag for which block formats are used (ray-project#34384

    )
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6294a84 View commit details
    Browse the repository at this point in the history
  120. [Dataset] Reset row count when filtering on Dataset reading from Parq…

    …uet (ray-project#34372)
    
    Previously, if we filter on a Dataset which read from a Parquet datasource, the row count on the resulting Dataset is the same as the unfiltered Dataset (see ray-project#33766 and modified test for example). This PR fixes the bug and gets the correct row count after applying the filter.
    
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    282424f View commit details
    Browse the repository at this point in the history
  121. Remove python 3.6 support [1/n] (ray-project#34373)

    Python 3.6 support will be removed in Ray 2.5
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e7dcdc4 View commit details
    Browse the repository at this point in the history
  122. [RLlib] Add 2D box example for PPO RL Modules (ray-project#33840)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    451150d View commit details
    Browse the repository at this point in the history
  123. Configuration menu
    Copy the full SHA
    c9ccbec View commit details
    Browse the repository at this point in the history
  124. Add GCE variation for core release tests [3/n] (ray-project#34425)

    - microbenchmark_38
    - shuffle_20gb_with_state_api
    - object_store
    - many_actors
    - many_tasks
    - many_pgs
    - chaos_many_tasks_no_object_store
    - chaos_many_actors
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f0ee586 View commit details
    Browse the repository at this point in the history
  125. [train] rename _base_dataset to _base_datastream (ray-project#34423)

    Signed-off-by: Matthew Deng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    matthewdeng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    420c60c View commit details
    Browse the repository at this point in the history
  126. [CI][Bisect][1] Skeleton for automated bisect of release tests (ray-p…

    …roject#34329)
    
    A script to bisect release test failures. This PR only contains a skeleton and unit-tests
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a45407f View commit details
    Browse the repository at this point in the history
  127. Configuration menu
    Copy the full SHA
    d4c9bc4 View commit details
    Browse the repository at this point in the history
  128. [Dataset] Validate aggregation key in Aggregate LogicalOperator (ra…

    …y-project#34292)
    
    As a followup of ray-project#32462, we should validate aggregate functions with `AggregateFn._validate`, in `generate_aggregate_fn()` before doing aggregate.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2ef7ec3 View commit details
    Browse the repository at this point in the history
  129. [requirements] Add PyArrow to ray[tune] dependencies (ray-project#34397)

    Ray Tune depends on PyArrow for filesyncing. However, `ray[tune]` currently does not include pyarrow as a dependency, which means version constraints are not enforced and syncing is not guaranteed to work out of the box.
    
    This surfaced as a problem when a user used poetry with `ray[tune]` as a constraint, but an incompatible version of pyarrow was installed. In this case, syncing to cloud storage was broken.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    64a5c78 View commit details
    Browse the repository at this point in the history
  130. [air] pin deepspeed version for now to unblock ci. (ray-project#34406)

    Deepspeed had a new release yesterday that broke our CI.
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5796620 View commit details
    Browse the repository at this point in the history
  131. Closing issue (ray-project#31926) about unknown windows crash when to…

    …o many arguments given in the config file (ray-project#32206)
    
    There is a crash that I encountered in Windows. It related to the fact that the path was too long for windows.
    So to allow the user to be aware of this issue, I added a check in the code that checks if the path is
    too long and warn it with a logger warning message.
    
    Signed-off-by: sahar <[email protected]>
    Signed-off-by: Sahar <[email protected]>
    Signed-off-by: Kai Fricke <[email protected]>
    Co-authored-by: sahar <[email protected]>
    Co-authored-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    3 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e566859 View commit details
    Browse the repository at this point in the history
  132. Serve Dashboard features polish (ray-project#34391)

    Filter out serve system endpoints from grafana dashboards
    Make it more clear when a log file is empty
    
    Signed-off-by: elliottower <[email protected]>
    alanwguo authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8988ded View commit details
    Browse the repository at this point in the history
  133. [core][state] Efficient get/list actors with filters on some high-car…

    …dinality fields ray-project#34348
    
    Signed-off-by: rickyyx <[email protected]>
    
    This improves the state API for listing/getting actors: if filtering by id/state/job, filtering is pushed down to the source (GCS).
    
    Other state API resources will be implemented in a similar way (e.g. tasks/workers).
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9552120 View commit details
    Browse the repository at this point in the history
  134. [CI][Bisect][2] Actually bisect test failures on buildkite (ray-proje…

    …ct#34331)
    
    Implement the actual functions to run test and bisect on buildkite. This first implementation is pretty naive in several ways:
    - It uses a main bisect orchestration step that waits for test steps. We can make it more efficient here by sub-bisect orchestration step
    - It only runs one test at a time, which is less effective when the range gets small
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e47ba3d View commit details
    Browse the repository at this point in the history
  135. [Doc] Fix linter (ray-project#34474)

    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9c6fc49 View commit details
    Browse the repository at this point in the history
  136. [RLlib] Try 8gpus_96cpus_gce with n1 and t4 nodes (ray-project#34459)

    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    edcc30a View commit details
    Browse the repository at this point in the history
  137. [RLlib] fix cartpole lstm string (ray-project#34458)

    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    12122d5 View commit details
    Browse the repository at this point in the history
  138. [Release Test] Add GCE variation for core release tests [4/n] (ray-pr…

    …oject#34442)
    
    - dask_on_ray_100gb_sort
    - stress_test_state_api_scale
    - stress_test_many_tasks
    - stress_test_dead_actors
    - threaded_actors_stress_test
    - many_nodes_actor_test_on_v2
    - placement_group_performance_test
    - scheduling_test_many_0s_tasks_many_nodes
    - agent_stress_test
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9e1c154 View commit details
    Browse the repository at this point in the history
  139. [RLlib] Throw meaningful error when trying to run DirectMethod OPE wi…

    …th TF (ray-project#34417)
    
    * Introduce error
    
    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ArturNiederfahrenhorst authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    986b2d1 View commit details
    Browse the repository at this point in the history
  140. [CI][Green-Ray][1] Automated retry of infra-error release tests (ray-…

    …project#34057)
    
    This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying!
    
    Concretely it will:
    
    - First, classify an infra test run as a transient infra issue
    - Instruct buildkite to automatically retry on transient issue
    - If retry runs out, classify the infra test run as infra issue
    
    Some other limitations that will be addressed in followup PRs:
    - Move infra-failure retry configuration into LaunchDarkly?
    - Limit auto-retry based on test cost or test runtime
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3f30c1b View commit details
    Browse the repository at this point in the history
  141. Remove python 3.6 support [2/n] (ray-project#34416)

    Removed some dead code for 3.6
    
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    233cd31 View commit details
    Browse the repository at this point in the history
  142. Fix backpressure handling of queued actor pool tasks (ray-project#34254)

    There is a bug in the backpressure implementation with regard to actor pools, in that once a task is queued for an actor pool, it is no longer subject to backpressure. This is problematic when the output size of a task is much bigger than the input size. In this situation, the actor pool will keep executing tasks (converting small objects into larger objects), even when this would grossly exceed memory limits.
    
    Put another way: it fixes the issue where the streaming executor queues tasks on an actor pool operator, but later on wants to "take it back" due to unexpectedly high memory usage. This avoids the issue by not queueing tasks that won't be immediately executed (so they won't need to be taken back).
    
    Example:
    1. Suppose there is an actor pool of size 10, each of which can take 1 active task each.
    2. Each input task is size 1GB. The memory limit is 100GB, so we add 100 of these inputs in an actor pool operator.
    3. When the tasks run, they expand into 100GB of output each. Now, the memory usage overall is 200GB (2x over our limit!).
    4. However, since we already added those 100 inputs to the actor pool, there is no way of the streaming scheduler to pause execution of those 90 remaining queued inputs.
    5. Now the 90 queued inputs execute and we end up using 1TB, or 10x our intended memory limit.
    
    We need to check for the memory limit right before executing a task in the actor pool; one way of doing this is to eliminate the internal queue in the actor pool operator and instead always queue work outside the operator.
    
    TODO:
    - [x] Performance testing
    - [x] Unit tests
    - [x] Perf test final version
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    56c8673 View commit details
    Browse the repository at this point in the history
  143. Deflake gcs_client_test.cc (ray-project#34411)

    Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure.
    
    Signed-off-by: elliottower <[email protected]>
    cadedaniel authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9839f6b View commit details
    Browse the repository at this point in the history
  144. release logs for 2.4.0 (ray-project#33905)

    Release logs perf benchmark for 2.4.0
    Also updated tool to sort the regressions
    
    Signed-off-by: Clarence Ng <[email protected]>
    Co-authored-by: Clarence Ng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6419eab View commit details
    Browse the repository at this point in the history
  145. Configuration menu
    Copy the full SHA
    d32d4b1 View commit details
    Browse the repository at this point in the history
  146. [no_early_kickoff][core][state] Make state api return results that ar…

    …e strongly typed (ray-project#34297)
    
    We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs.
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    bc75bff View commit details
    Browse the repository at this point in the history
  147. [core][state] Use --err flag to query stderr logs from worker/actor…

    …s instead of `--suffix=err` (ray-project#34300)
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d0a0ced View commit details
    Browse the repository at this point in the history
  148. Configuration menu
    Copy the full SHA
    4ded4da View commit details
    Browse the repository at this point in the history
  149. [CI] Fix shellcheck lint (ray-project#34488)

    * [CI] Fix shellcheck lint
    
    Signed-off-by: Antoni Baum <[email protected]>
    
    * More lint fixes
    
    Signed-off-by: Antoni Baum <[email protected]>
    
    * Revert "More lint fixes"
    
    This reverts commit 8d6f316.
    
    ---------
    
    Signed-off-by: Antoni Baum <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    Yard1 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    589d371 View commit details
    Browse the repository at this point in the history
  150. [CI] Add GCE variances to Data tests (ray-project#34105)

    This PR configures BuildKite to run Data release tests on GCE. I excluded the parquet_metadata_resolution and shuffle_data_loader release tests because more work is required to migrate those tests.
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    bveeramani authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7f00d44 View commit details
    Browse the repository at this point in the history
  151. [Core] convert gcs port read from env variable from str to int (ray-p…

    …roject#34482)
    
    convert the variable from str to int to close ray-project#33963
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    fc569e5 View commit details
    Browse the repository at this point in the history
  152. [Serve] gRPC Deployment schema check & minor improvements (ray-projec…

    …t#34210)
    
    Find issues as debugging gRPC, fixes in this pr:
    
    Fix options API is not set correctly.
    Add deployment attribute check.
    Remove the notification step in the deployment state.
    
    Signed-off-by: elliottower <[email protected]>
    sihanwang41 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3d0a89d View commit details
    Browse the repository at this point in the history
  153. Fix mutable dataclass attribute (ray-project#34339)

    This PR fixes an instance where a mutable attribute is used as a dataclass member, which causes an exception. See [this part of the docs](https://docs.python.org/3/library/dataclasses.html#mutable-default-values) for more information.
    
    Signed-off-by: pdmurray <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    peytondmurray authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ec0e813 View commit details
    Browse the repository at this point in the history
  154. [Event ]Fix incorrect event timestamp (ray-project#34402)

    We didn't use the correct system clock + always used UTC timestamp, which is bad. It fixes the issue.
    
    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    cc0bc5e View commit details
    Browse the repository at this point in the history
  155. [core][tests] Harden flaky pytest (ray-project#34480)

    I suspect the flaky cancellation test is due to an expectation that the final log message assumes a particular format. This may not be the last log message, so check backwards from the last message for this string.
    
    Signed-off-by: elliottower <[email protected]>
    vitsai authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    1200b20 View commit details
    Browse the repository at this point in the history
  156. [data] Experimental strict schema mode (ray-project#34336)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0b75855 View commit details
    Browse the repository at this point in the history
  157. [Datasets] Defer first block computation when reading a Datasource wi…

    …th schema information in metadata (ray-project#34251)
    
    In the current implementation of [ExecutionPlan._get_unified_blocks_schema](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L418), we force execution to compute the first block when given a `LazyBlockList`. However, when creating a Dataset from a datasource which have schema information available before reading (e.g. Parquet), this unnecessarily forces execution, since we already check for metadata in the subsequent [ensure_metadata_for_first_block](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/lazy_block_list.py#L379). Therefore, we can remove `blocks.compute_first_block()`.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ac117e2 View commit details
    Browse the repository at this point in the history
  158. [core] Task backend - marking tasks failed on worker death (ray-proje…

    …ct#33818)
    
    When a parent task tailed due to task execution error, we right now mark children tasks as failed (incorrectly).
    With this PR, we are marking task failure states properly, we will rely on worker exits to trigger the failure marking routine for tasks.
    This also aligns more correctly with ray's actual behaviour: relevant tests are changed to explicitly verify tasks are not running on the process.
    When a node fails, we rely on other parts of ray (gcs) to report the workers failure, which will trigger the task failure marking for the worker, and then mark tasks as failed properly.
    Tests are also added to verify the detached actor behaviour.
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3d902fd View commit details
    Browse the repository at this point in the history
  159. Configuration menu
    Copy the full SHA
    59be0e7 View commit details
    Browse the repository at this point in the history
  160. [Data] Update code owners of Ray Data (ray-project#34506)

    As title, to reflect the latest group actively working on Ray Data module.
    
    Signed-off-by: Cheng Su <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    c21 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d82f401 View commit details
    Browse the repository at this point in the history
  161. [CI][Green-Ray][2] Transient error release test needs to fail fast (r…

    …ay-project#34110)
    
    In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    95ef21a View commit details
    Browse the repository at this point in the history
  162. [data] Also improve repr of pandas dtype (ray-project#34502)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e182af5 View commit details
    Browse the repository at this point in the history
  163. [merge fix] Remove scripts again (ray-project#34513)

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6e75358 View commit details
    Browse the repository at this point in the history
  164. [ci] Remove scripts duplicates and symlinks except for format.sh (ray…

    …-project#34463)
    
    A year ago, ray-project#23866 moved our CI scripts into a more descriptive folder structure. Files in scripts/ were symlinks to the moved scripts. Even then, CI and documentation did not refer to any scripts in scripts/, with the exception of scripts/format.sh, which is referred to in pull request templates.
    
    Recently, ray-project#34340 overwrote some of the symlinks with their actual files. Since almost all of these scripts are only used in CI and not by users and developers, we should just get rid of the symlinks. The exception is format.sh which is actively used by developers.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    98afc2b View commit details
    Browse the repository at this point in the history
  165. [ci] Fix further linter errors (ray-project#34517)

    Some shell scripts are still failing. This PR tries to identify and fix the remaining linter errors.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    093c223 View commit details
    Browse the repository at this point in the history
  166. [air] Use Ray storage URI as default storage path, if configured [no_…

    …early_kickoff] (ray-project#34470)
    
    With this PR, we will use the configured Ray storage URI for syncing Ray AIR results if no other remote storage path is set.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    4589793 View commit details
    Browse the repository at this point in the history
  167. [Doc] Ray Debugging Doc Part 1 (OOM) (ray-project#34309)

    This doc improves the existing debugging failure documentation.
    
    It adds
    
    failure types
    how to do application level failure debugging
    out of memory debugging
    step-by-step memory profiling
    Rewrite the file descriptor issues (it has very old info that is not correct anymore)
    
    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ef285ed View commit details
    Browse the repository at this point in the history
  168. [ci] Restore pytest_checker script, but at correct location (ray-proj…

    …ect#34523)
    
    ray-project#34463 removed the scripts under the `scripts/` directory because all of them should have been symlinks. However, `pytest_checker.py` was an actual script that was not symlinked from the `ci/` directory. This PR restores this script at the correct location in `ci/lint` and adjusts all references to it in the codebase.
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    2f3c2e1 View commit details
    Browse the repository at this point in the history
  169. [air] Change doc occurrences of ray.data.Dataset to ray.data.Datastre…

    …am (ray-project#34520)
    
    We recently renamed `Dataset` to `Datastream` - this PR changes occurrences of Dataset in the Ray AIR examples to Datastream. This will also fix currently broken examples that still refer to `Dataset` when `Datastream` is imported instead
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    47501e8 View commit details
    Browse the repository at this point in the history
  170. [docs] new landing page (ray-project#33520)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9b2810b View commit details
    Browse the repository at this point in the history
  171. [Release test] [Cluster launcher] Add release test for aws `example-f…

    …ull.yaml` (ray-project#34487)
    
    Adds a release test for example-full.yaml on AWS.
    
    Starts the cluster with ray up, runs a simple Ray driver script, and calls ray down.
    
    Also fixes a bug in this YAML file where we were using a string instead of an int for a VolumeSize.
    
    Signed-off-by: elliottower <[email protected]>
    architkulkarni authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ac4ead4 View commit details
    Browse the repository at this point in the history
  172. [Train] Fix lightning trainer devices setting (ray-project#34419)

    Signed-off-by: woshiyyya <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    woshiyyya authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ec21094 View commit details
    Browse the repository at this point in the history
  173. Configuration menu
    Copy the full SHA
    e6db435 View commit details
    Browse the repository at this point in the history
  174. Revert "[docs] new landing page" (ray-project#34533)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    7aa85f3 View commit details
    Browse the repository at this point in the history
  175. [docs] gentle core walkthrough (ray-project#34134)

    * [docs] gentle core walkthrough
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    * Update gentle_walkthrough.ipynb
    
    Signed-off-by: Max Pumperla <[email protected]>
    
    ---------
    
    Signed-off-by: Max Pumperla <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    4f3e9f6 View commit details
    Browse the repository at this point in the history
  176. [serve] Remove old deployments upon redeployment of a named app (ray-…

    …project#34451)
    
    When an app with non-empty name is redeployed, old deployments that are no longer part of the new graph are not cleaned up. This is because a new application state in application state manager is [created](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/application_state.py#L350-L355), so the [logic](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/application_state.py#L77-L83) that tracks which deployments to delete never actually works.
    
    This was not caught before because the old deployments are no longer tracked in the application state manager, and become "zombie deployments".
    
    This isn't a problem for the single app case (so there is no regression), because the old logic used to clean up old deployments hasn't yet been removed: https://github.com/ray-project/ray/blob/releases/2.3.0/python/ray/serve/_private/client.py#L297-L307.
    
    Reproduction script:
    
    ```
    #script.py
    @serve.deployment
    def f():
        return "f"
    
    @serve.deployment
    def g():
        return "g"
    
    fn = f.bind()
    gn = g.bind()
    ```
    
    Deploy it:
    ```
    client = serve.start(detached=True)
    config = {"applications": [{"name": "app1", "import_path", "script.fn"}]}
    client.deploy_apps(ServeDeploySchema.parse_obj(config))
    ```
    Redeploy with a different graph:
    ```
    client = serve.start(detached=True)
    config = {"applications": [{"name": "app1", "import_path", "script.gn"}]}
    client.deploy_apps(ServeDeploySchema.parse_obj(config))
    ```
    
    See that `app1_f` is not deleted:
    ![image](https://user-images.githubusercontent.com/15851518/232265539-c5af44e4-f37a-4305-9419-60744bba9b35.png)
    
    Signed-off-by: elliottower <[email protected]>
    zcin authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    392f97e View commit details
    Browse the repository at this point in the history
  177. Revert "[CI] Fix shellcheck lint (ray-project#34488)" (ray-project#34529

    )
    
    The shellcheck fix broke the shellscript when $use_lstm is empty
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    1c10e79 View commit details
    Browse the repository at this point in the history
  178. [RLlib] Learner group checkpointing (ray-project#34379)

    Implement multinode learner group checkpointing and tests.
    
    ---------
    
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: avnishn <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    avnishn authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c500638 View commit details
    Browse the repository at this point in the history
  179. Revert "Revert "[docs] new landing page"" (ray-project#34534)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5f456ed View commit details
    Browse the repository at this point in the history
  180. [Train] Allow local datasets in HuggingFaceTrainer (ray-project#34485)

    * Allow local datasets in HuggingFaceTrainer
    
    Signed-off-by: Antoni Baum <[email protected]>
    
    * Clarify
    
    Signed-off-by: Antoni Baum <[email protected]>
    
    ---------
    
    Signed-off-by: Antoni Baum <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    Yard1 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    00d7b9b View commit details
    Browse the repository at this point in the history
  181. [air] Add tune frequent pausing release test. (ray-project#34501)

    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    23507c0 View commit details
    Browse the repository at this point in the history
  182. [CI][Bisect] Fix bisect due to wrong order of commit list (ray-projec…

    …t#34536)
    
    Why are these changes needed?
    Currently we are using git rev-list to get the commit lists. This command return the commits in the reverse order that we want, so reverse it before passing it to bisect.
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    76ccdcf View commit details
    Browse the repository at this point in the history
  183. [UI] Disable null job id jumpable (ray-project#34378)

    We would make ray submit job not clickable for job without a job id. Otherwise, we will navigate the users to a page /jobs/null where no job info is shown, making our customer confuse
    
    Signed-off-by: elliottower <[email protected]>
    chaowanggg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    73d3434 View commit details
    Browse the repository at this point in the history
  184. [core][ci] Fix mac test_task_events_2 (ray-project#34538)

    We don't have access to task name from psutil Process as well (just like windows)
    
    Closes ray-project#34530
    
    Signed-off-by: elliottower <[email protected]>
    rickyyx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3ddb1a3 View commit details
    Browse the repository at this point in the history
  185. [CI] Add GCE variances for Data chaos tests (ray-project#34519)

    This PR configures BuildKite to run Data release tests on GCE. I excluded the parquet_metadata_resolution and shuffle_data_loader release tests because more work is required to migrate those tests.
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    bveeramani authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b1196df View commit details
    Browse the repository at this point in the history
  186. [Train] Support FSDP Strategy for LightningTrainer (ray-project#34148)

    Signed-off-by: woshiyyya <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    woshiyyya authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3616ca2 View commit details
    Browse the repository at this point in the history
  187. [CI][Bisect][Easy/Urgent] Fix bisect (ray-project#34559)

    Fix a couple of issues:
    - Correct git command to get the list of revs including both boundaries
    - Correct the boundary of the remaining list after each bisect
    
    Previous code has issues with the boundaries. Added a test case that failed in previous code but pass in this new code.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d0d0757 View commit details
    Browse the repository at this point in the history
  188. [ci/release] GCE test variants for ml_user tests (ray-project#34465)

    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9c81eb8 View commit details
    Browse the repository at this point in the history
  189. [Core][easy] disable test not suppose to work with ray client ray-pro…

    …ject#34556
    
    this env doesn't work with ray client.
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3466113 View commit details
    Browse the repository at this point in the history
  190. [ci/release] GCE test variants for air_benchmark and air_examples (ra…

    …y-project#34466)
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8f34645 View commit details
    Browse the repository at this point in the history
  191. Log databricks proxy (ray-project#34088)

    This PR adds standard logging of the Databricks proxy URL for the dashboard when a ray cluster starts.
    
    Currently the HTML link does not render until cell completion so it is difficult to access the dashboard while a ray workload is running.
    
    Signed-off-by: Nathan Azrak <[email protected]>
    Co-authored-by: Nathan Azrak <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    829ce6c View commit details
    Browse the repository at this point in the history
  192. [core] add core team to protobuf owner ray-project#34566

    update the right ownership for relative folders
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    cb4aa67 View commit details
    Browse the repository at this point in the history
  193. [Core][pubsub] handle failures when publish failed. (ray-project#33115)

    Why are these changes needed?
    ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.
    
    This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.
    
    The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.
    
    We also relies on the pubsub protocol that at most one going push request will be inflight.
    
    This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    1ac5350 View commit details
    Browse the repository at this point in the history
  194. [AIR] Add util to create a torch ddp process group for a list of work…

    …ers. (ray-project#34202)
    
    Signed-off-by: Jun Gong <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    Jun Gong authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6a614fa View commit details
    Browse the repository at this point in the history
  195. [CI][Core] Set some GCE smoke tests to run on manual frequency (ray-p…

    …roject#34516)
    
    I noticed some GCE smoke versions are run on nightly. Let's move them to run on manual instead, since we don't want to spend the cost on run them on an automatic cadence yet.
    
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e6e49fa View commit details
    Browse the repository at this point in the history
  196. [CI] Fix some chaos test configurations (ray-project#34571)

    Some GCE chaos test configurations are using aws configs. Change them to the equivalence GCE. Also use the more powerful n2 instead of e2 machine.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3ab751d View commit details
    Browse the repository at this point in the history
  197. [release] Make sure that test code matches the installed wheel. (ray-…

    …project#30156)
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3c01cb9 View commit details
    Browse the repository at this point in the history
  198. [air-output] minor fix to print configuration on start. (ray-project#…

    …34575)
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    0aa2ee8 View commit details
    Browse the repository at this point in the history
  199. [Core] Deflake test_advanced_9 (ray-project#34410)

    Looks like gcs server proc doesn't go back to original num_fds; it goes lower.
    
    output from my machine:
    
    >> 222 # before starting worker procs
    (A pid=28851) HELLO
    ['WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD']
    >> 250 # with worker procs
    >> 217
    >> 216
    >> 213
    >> 212
    >> 207
    >> 206 # after work procs die.
    >> 206
    >> 208 # Not sure why it goes up again
    >> 208 # Remains at 208, times out
    This PR deflakes the test, but I don't know enough about gcs server to say if this is a good fix or not.
    
    Signed-off-by: elliottower <[email protected]>
    cadedaniel authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ca64a29 View commit details
    Browse the repository at this point in the history
  200. [data] Standardize on Arrow types for schema() in strict mode

    Signed-off-by: Eric Liang <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5c8cf49 View commit details
    Browse the repository at this point in the history
  201. [ray-data] Add alias parameters to the aggregate function, and add qu…

    …antile fn (ray-project#34358)
    
    Signed-off-by: elliottower <[email protected]>
    yiwei00000 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3485e52 View commit details
    Browse the repository at this point in the history
  202. Configuration menu
    Copy the full SHA
    41bc627 View commit details
    Browse the repository at this point in the history
  203. Disallow format query in strict mode (ray-project#34564)

    Signed-off-by: Eric Liang <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3a877cc View commit details
    Browse the repository at this point in the history
  204. [data] Log a warning if the batch size is misconfigured in a way that…

    … would grossly reduce parallelism for actor pool. (ray-project#34594)
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6abf379 View commit details
    Browse the repository at this point in the history
  205. [Dashboard] Make loading screen not block out the entire page. (ray-p…

    …roject#34515)
    
    Previously, if a dashboard page was loading, it would grey out the whole screen and buttons would not be press-able. Now, we don't block out the whole page.
    Also don't show loading bar if data is already loaded from in-memory cache.
    
    Signed-off-by: elliottower <[email protected]>
    alanwguo authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b302c52 View commit details
    Browse the repository at this point in the history
  206. [data] [docs] Datastream docs rename [5/n] (ray-project#34512)

    Part 5 of ray-project#34235
    
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    1df0ca1 View commit details
    Browse the repository at this point in the history
  207. clarify M1 installation instructions (ray-project#34505)

    A few folks have been confused by the order of the installation instructions for M1, so adding some clarifying language. While I was at it, I made minor improvements to some language in nearby paragraphs.
    
    Signed-off-by: elliottower <[email protected]>
    angelinalg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6eb23c7 View commit details
    Browse the repository at this point in the history
  208. Create LLM section and add examples (ray-project#34614)

    Surface LLM/Generative AI use cases.
    
    Signed-off-by: angelinalg <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    angelinalg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9636e78 View commit details
    Browse the repository at this point in the history
  209. Add driver logs to Jobs page for submission jobs (ray-project#34514)

    Add driver logs to Jobs page for submission jobs
    Adds a refresh button to the log viewer to reload the logs.
    Refactors the log viewer from the logs page into its own component
    Updates the look and feel of the jobs page to match the new IA style.
    Adds User-provided metadata to the job detail page. (fixes [Core|Dashboard] Support custom tags for jobs. ray-project#34187 )
    Updates the table icon
    Change "Tasks" to "Tasks/actor overview"
    Adds Node Count Card next to ray status cards
    
    Signed-off-by: elliottower <[email protected]>
    alanwguo authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5df609b View commit details
    Browse the repository at this point in the history
  210. [air/Doc] Fix unused config building function in lightning MNIST exam…

    …ple.
    
    The build_lightning_config_from_existing_code() is not called in the example, and there is a duplicated config building logic below. This PR use this function and remove the other one.
    
    Signed-off-by: woshiyyya <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    woshiyyya authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    42b9a92 View commit details
    Browse the repository at this point in the history
  211. Configuration menu
    Copy the full SHA
    304a0ce View commit details
    Browse the repository at this point in the history
  212. [ci/release] Increase concurrency limit for gpu gce (ray-project#34578)

    We now have 100 T4 machines, so increase the limit. At peak, the this limit means that we will use:
    
    84 + 44 + 2*8 + 32 = 96 machines
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    1c2e6a0 View commit details
    Browse the repository at this point in the history
  213. [serve][nit] Fix formatting & verbiage for serve shutdown (ray-proj…

    …ect#34585)
    
    Fixes unnecessary spaces & cleans up wording.
    
    Before:
    ```
    (ray) eoakes@Edwards-MacBook-Pro-2 serve % serve shutdown
    
    This will shutdown the Serve application at address "http://localhost:52365" and delete all deployments there. Do you want to continue? [y/N]: y
    2023-04-19 12:46:12,078 SUCC scripts.py:584 --
    Sent delete request successfully!
    
    ```
    
    After:
    ```
    (ray) eoakes@Edwards-MacBook-Pro-2 serve % serve shutdown
    This will shut down Serve on the cluster at address "http://localhost:52365" and delete all applications there. Do you want to continue? [y/N]: y
    2023-04-19 12:45:52,050 SUCC scripts.py:583 -- Sent shutdown request; applications will be deleted asynchronously.
    ```
    
    Signed-off-by: elliottower <[email protected]>
    edoakes authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    ed34037 View commit details
    Browse the repository at this point in the history
  214. [ci/release] GCE variants for remaining Tune tests (ray-project#34572)

    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c8ab61a View commit details
    Browse the repository at this point in the history
  215. Configuration menu
    Copy the full SHA
    2ff90f1 View commit details
    Browse the repository at this point in the history
  216. [air-output] print out worker ip for distributed train workers. (ray-…

    …project#33807)
    
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    xwjiang2010 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    15d11be View commit details
    Browse the repository at this point in the history
  217. Fix download_wheels.sh wheel urls (ray-project#34616)

    Some mac wheel urls are invalid
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9884c27 View commit details
    Browse the repository at this point in the history
  218. [Data] Fix iter_tensor_batches_benchmark_multi_node GCE (ray-projec…

    …t#34598)
    
    The `iter_tensor_batches_benchmark_multi_node` GCE variant was failing because it used the wrong compute config.
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    bveeramani authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    9379bdf View commit details
    Browse the repository at this point in the history
  219. [Doc][AIR] Improve visibility of Trainer restore and stateful callbac…

    …k restoration (ray-project#34350)
    
    Signed-off-by: elliottower <[email protected]>
    justinvyu authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    374cab9 View commit details
    Browse the repository at this point in the history
  220. [Serve] [Docs] Change incorrect Serve app name in Stable Diffusion tu…

    …torial (ray-project#34426)
    
    The ray serve command was not matching the correct object.
    
    Signed-off-by: elliottower <[email protected]>
    robin-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3e34002 View commit details
    Browse the repository at this point in the history
  221. Configuration menu
    Copy the full SHA
    aac345b View commit details
    Browse the repository at this point in the history
  222. [docs] intro and graphic for LLM (ray-project#34615)

    Follow up to ray-project#34614
    
    Why are these changes needed?
    To match the other use cases, we need a more substantial intro paragraph and graphic.
    
    ---------
    
    Signed-off-by: angelinalg <[email protected]>
    Signed-off-by: Philipp Moritz <[email protected]>
    Co-authored-by: Philipp Moritz <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a548291 View commit details
    Browse the repository at this point in the history
  223. Fix typo in node.py (ray-project#34630)

    Fix typo in docstring.
    
    Signed-off-by: JYX <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    jjyyxx authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    610a8d8 View commit details
    Browse the repository at this point in the history
  224. [CI][Green-Ray][3] Extract error logs from ray logs (ray-project#34193)

    Currently there are a lot of test run instances where we fail to acquire logs (especially for infra-failure issues). This PR will fall back to query ray logs for error patterns if we fail to query the application logs.
    
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    d69eb02 View commit details
    Browse the repository at this point in the history
  225. [Data] [strict-mode] Remove internal TableRow abstractions and instea…

    …d use Dict[str, Any] as the row format
    
    Signed-off-by: Eric Liang <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    c74554a View commit details
    Browse the repository at this point in the history
  226. [train] Add AccelerateTrainer as valid AIR_TRAINER (ray-project#34639)

    Signed-off-by: Matthew Deng <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    matthewdeng authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    fa77f89 View commit details
    Browse the repository at this point in the history
  227. [data] Configure progress bars via DataContext

    Signed-off-by: elliottower <[email protected]>
    ericl authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f68b4c1 View commit details
    Browse the repository at this point in the history
  228. [CI] disable flaky test test_run_on_all_workers (ray-project#34647)

    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    62a307f View commit details
    Browse the repository at this point in the history
  229. Revert "[core]Turn on light weight resource broadcasting. (ray-projec…

    …t#32625)" (ray-project#34636)
    
    This reverts commit 1bfbc46.
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8c77f56 View commit details
    Browse the repository at this point in the history
  230. [docs] replace tune.report with session.report (ray-project#34435)

    Signed-off-by: elliottower <[email protected]>
    angelinalg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    b702efe View commit details
    Browse the repository at this point in the history
  231. [Ci] fix pip version to deflake minimal install 3.10

    see if the test failure is caused by pip version upgrade
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3652630 View commit details
    Browse the repository at this point in the history
  232. [CI] fix virtualenv version to deflake linux://python/ray/tests:test_…

    …runtime_env_complicated (ray-project#34650)
    
    Looks the virtualenv has been upgraded between the success and failed test.
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    69a5d29 View commit details
    Browse the repository at this point in the history
  233. [Syncer] Remove spammy logs. (ray-project#34654)

    Signed-off-by: elliottower <[email protected]>
    rkooo567 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a3bd535 View commit details
    Browse the repository at this point in the history
  234. [ci/release] GCE variants for Alpa, Golden notebooks, Lightning, Horo…

    …vod, Workspace templates (ray-project#34565)
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3bb1993 View commit details
    Browse the repository at this point in the history
  235. [docs][tune] Fix Tune tutorial (ray-project#34660)

    One line fix for bug introduced in ray-project#34435
    
    Signed-off-by: Kai Fricke <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    krfricke authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    4f94951 View commit details
    Browse the repository at this point in the history
  236. [Autoscaler][gcp] parallel terminate nodes (ray-project#34455)

    Why are these changes needed?
    ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).
    
    add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
    use fine-grained locks which assign one RLock per node_id
    add unit_tests
    why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239?
    As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.
    
    Related issue number
    ray-project#26239
    
    ---------
    
    Signed-off-by: Chen-Chen Yeh <[email protected]>
    Co-authored-by: Chen-Chen Yeh <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    2 people authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    8e796fd View commit details
    Browse the repository at this point in the history
  237. [Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint…

    … files from the cloud if needed (ray-project#34461)
    
    For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.
    
    While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to:
    1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()`
    2. `Tuner.restore("s3://...").get_results()`
    
    Point 2 was the error that flagged this issue in the first place.
    
    This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`.
    
    Signed-off-by: Justin Yu <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    justinvyu authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    3469051 View commit details
    Browse the repository at this point in the history
  238. [docs] [serve] removed line numbers and fixed file name summary_model…

    ….py (ray-project#34617)
    
    Copy and paste button was including line numbers in 3 code examples, which is a bad user experience.
    Fixed error with filename. The command line instructions said `python model.py` but it should be `python summary_model.py`.
    
    This addresses two issues in GH issue 34481, but not all of them.
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
        - [ ] I've added any new APIs to the API Reference. For example, if I added a
               method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: elliottower <[email protected]>
    angelinalg authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    6fc9b4b View commit details
    Browse the repository at this point in the history
  239. [CI][Green-Ray][4] Compute and store unique crash pattern from logs (r…

    …ay-project#34200)
    
    This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a7298ec View commit details
    Browse the repository at this point in the history
  240. [serve] Add support for application builders & arguments (ray-project…

    …#34584)
    
    First cut at an implementation for ray-project#34542.
    
    There should be no changes in behavior for existing applications.
    
    Will update documentation & examples in a separate PR, would like to get it merged to get feedback from others on the API.
    
    Signed-off-by: elliottower <[email protected]>
    edoakes authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e9aa541 View commit details
    Browse the repository at this point in the history
  241. [docs] add click events for code blocks (ray-project#34623)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    f4a5aac View commit details
    Browse the repository at this point in the history
  242. [Datasets] Support non-shuffle repartitioning in Repartition `Logic…

    …alOperator` (ray-project#34547)
    
    This is a followup for ray-project#32102, to support non-shuffle repartition in logical operator, as _internal/fast_repartition.py.
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    scottjlee authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    cca57a3 View commit details
    Browse the repository at this point in the history
  243. [docs] Fix broken links (ray-project#34665)

    Signed-off-by: elliottower <[email protected]>
    jjyao authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    02299ba View commit details
    Browse the repository at this point in the history
  244. [docs] wrap autogenerated API nav items (ray-project#34047)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    e9ec461 View commit details
    Browse the repository at this point in the history
  245. [docs] sphinx design 1/n (ray-project#34625)

    Signed-off-by: elliottower <[email protected]>
    maxpumperla authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    fa30fff View commit details
    Browse the repository at this point in the history
  246. [CI][Bisect][4] Add pre-sanity check to avoid infra or external chang…

    …e root causes (ray-project#34553)
    
    Why are these changes needed?
    Many time tests can fail due to a non-code-change issue (external or infra issues). Before running a bisect, run a pre-sanity check to make sure that the provided passing and failing revision is valid. Otherwise, terminate bisect early and let the users know that the test is flaky.
    
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    a6deb57 View commit details
    Browse the repository at this point in the history
  247. [CI][HotFix] Revert 34499 ray-project#34688

    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: elliottower <[email protected]>
    can-anyscale authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5ba8b4c View commit details
    Browse the repository at this point in the history
  248. [autoscaler v2] Interface between autoscaler and gcs (ray-project#34680)

    Why are these changes needed?
    This PR introduce the interface between GCS and Autoscaler.
    
    Specifically it introduces 2 APIs
    
    GetClusterResourceState: Autoscaler will query this interface to get cluster resource usage, which includes nodes (state and resource ulitization), as well as pending requests, which include ResourceRequest, GangResourceRequest, as well as ClusterResourceConstraint.
    
    For NodeState, it includes NodeStatus, which can transit from ALIVE -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAINING -> DRAINED -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAIN_FAILED.
    it also includes instance_id where the autoscaler is aware of, this allows autoscaler to do reconsiliation if available.
    For ResourceRequest, it comes with a PlacementConstraint which only support AntiAffinityConstraint today, which the semantics the resource request can't be allocated on a node with the same label/value specified in the AntiAffinityConstraint
    There is also GangResourceRequest, which has gang scheduling semantics where the requests in the gang should be all fulfilled atomically.
    ReportAutoscalingState: Autoscaler will also report its own state back to cluster using this API, where it includes all instances (including both pending launch), as well as infeasible requests.
    
    Instance state could transition from QUEUED -> REQUESTED -> BOOTSTRAPPING -> ALIVE -> TERMINATING -> DEAD.
    
    two special states are TO_BE_PREEMPTED and TO_BE_DRAINED, where one is force preemption, another is collaborating draining (can be reversed).
    
    It also reports back requests that infeasible, associated with a specific request version.
    
    Signed-off-by: elliottower <[email protected]>
    scv119 authored and elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    4636032 View commit details
    Browse the repository at this point in the history
  249. Update gymnasium version to 0.28.1

    Signed-off-by: elliottower <[email protected]>
    elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    be632df View commit details
    Browse the repository at this point in the history
  250. Update SuperSuit version

    Signed-off-by: elliottower <[email protected]>
    elliottower committed Apr 22, 2023
    Configuration menu
    Copy the full SHA
    5888c52 View commit details
    Browse the repository at this point in the history