-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Remove return_info
from reset()
in pettingzoo_env.py
.
#33470
Commits on Apr 22, 2023
-
Fix tensorarray to numpy conversion (ray-project#34115)
* Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * Fix tensorarray to numpy conversion Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 982a4c0 - Browse repository at this point
Copy the full SHA 982a4c0View commit details -
[data] Fix test failure caused by lack of ordring in default streamin…
…g executor (ray-project#34120) * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * Fix test failure caused by lack of ordring in default streaming executor Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c8c2471 - Browse repository at this point
Copy the full SHA c8c2471View commit details -
[ci/release] Add more GCE variants for tests (ray-project#34046)
cluster_tune_scale_up_down long_running_horovod_tune_test Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9f35260 - Browse repository at this point
Copy the full SHA 9f35260View commit details -
[CI] Migrate many_actors and many_tasks to v2 (ray-project#34123)
Even though we have perf regression on v2 stack but at least they can run. Currently starting 65 nodes has very low success rate on v1 stack. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c15a66f - Browse repository at this point
Copy the full SHA c15a66fView commit details -
[Java] Don't load cpp library in dev model. (ray-project#33667)
Don't load cpp library in dev model, because it will be error when nativeGetSystemConfig is invoked in local model on the Mac. Co-authored-by: XiaodongLv <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47ece52 - Browse repository at this point
Copy the full SHA 47ece52View commit details -
[Serve] Fix standalone3 tests (ray-project#34100)
environment is not inherited in Windows with subprocess, we need to explicitly inject env variables. The reason we don't find it before is because the test is inside the standalone2.py, which is ignored for windows. windows passed. ``` //python/ray/serve:test_standalone3 PASSED in 207.6s ``` Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6969f86 - Browse repository at this point
Copy the full SHA 6969f86View commit details -
[docs] fix nav (ray-project#34133)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f2c5b1c - Browse repository at this point
Copy the full SHA f2c5b1cView commit details -
[CI][Clean][2] Make s3 function names more agnostic (ray-project#33944)
Some existing functions that work with both s3 and gs but has the word s3 in his name. Refactor those. Also create constants for commonly used values, reduce duplications, etc. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5e6ca87 - Browse repository at this point
Copy the full SHA 5e6ca87View commit details -
[CI][GCE/5] Add GCE variations of GPU tests (ray-project#33946)
Add GCE variations of GPU tests. Two key things we need to change in order for GPU tests to work: - Better concurrency control for GPU tests in GCE. GCE has low GPU quota, and between ray start up and auto-scale, jobs competing for resources tend to run into deadlock. With better concurrency control, they can now all run successfully - The 'dataset_shuffle_push_based_random_shuffle_100tb' test requires a 400TB storage in the cluster. GCE however, currently, has only 200TB. So I change this test to run with 50tb of data in GCE (for now). Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3b4d7cf - Browse repository at this point
Copy the full SHA 3b4d7cfView commit details -
[Data] Add
output_arrow_format
tofrom_items
(ray-project#33837)DelegatingBlockBuilder does not have consistent behavior for dict inputs. It attempts to create an Arrow block, but will fall back to SimpleBlock if that fails. That has led to silent behavior changes such as ray-project#33789. In this PR, we add a flag to explicitly force Arrow block. --------- Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for daf546e - Browse repository at this point
Copy the full SHA daf546eView commit details -
[data] Add streaming execution documentation (ray-project#33941)
* Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * Add streaming execution documentation * fix * feedback * remove new file * fix * fix * key concept * fix * fix * fix * wording * feedback Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 235b314 - Browse repository at this point
Copy the full SHA 235b314View commit details -
[data] Remove datasets github workflow (ray-project#34138)
This was added in ray-project#26127, but never successfully worked due to missing credentials. Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0d0eed9 - Browse repository at this point
Copy the full SHA 0d0eed9View commit details -
[data] Add pydoc for ExecutionOptions (ray-project#34144)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c25a74 - Browse repository at this point
Copy the full SHA 5c25a74View commit details -
[Part 1/n] Rename Dataset => Datastream in top level files and pydocs (…
…ray-project#33779) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6728d79 - Browse repository at this point
Copy the full SHA 6728d79View commit details -
[Datasets] Improve formatting of
DatasetStatsSummary
, `StageStatsSu……mmary`, `IterStatsSummary` (ray-project#34119) Similar to the Dataset.repr formatting improvements in ray-project#32722, improve the readability of DatasetStatsSummary, StageStatsSummary, IterStatsSummary when printed. See the included test case for examples. --------- Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3cc91d0 - Browse repository at this point
Copy the full SHA 3cc91d0View commit details -
[RLlib] checkpoint learner (ray-project#33598)
Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 64b500b - Browse repository at this point
Copy the full SHA 64b500bView commit details -
[ray.util.spark] Add warning if webui_url is None. optional dependenc…
…ies for dashboard server might be missing. (ray-project#33521) (ray-project#34026) Just trying to be helpful and give distracted people like me a potential reason why the dashboard is not available. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 976d114 - Browse repository at this point
Copy the full SHA 976d114View commit details -
[CI] Remove microbenchmark_staging (ray-project#34154)
It duplicates microbenchmark. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 91bdfb5 - Browse repository at this point
Copy the full SHA 91bdfb5View commit details -
[RLlib][Docs] Added RLModule user-guide to the docs (ray-project#33909)
Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2698ef8 - Browse repository at this point
Copy the full SHA 2698ef8View commit details -
[RLlib] DreamerV3: Catalog enhancements (MLP/CNN encoders/heads compl…
…eted and unified accross DL frameworks). (ray-project#33967) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 279a7d6 - Browse repository at this point
Copy the full SHA 279a7d6View commit details -
[Doc] Add Ray core fault tolerance guide for GCS and node (ray-projec…
…t#33446) - Add fault tolerance guide for gcs and ray node - Remove dead RAY_num_heartbeats_timeout Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 95e0cca - Browse repository at this point
Copy the full SHA 95e0ccaView commit details -
[RLlib] Change handling of try reset to support ASYNC_RESET_RETURN (r…
…ay-project#33874) if a user is using remote base envs, then when reset/try_reset is called on the env then it returns the constant "async_reset_return". Our error handler for resets in the env runner v2 didn't catch this because it makes the assumption that returns from try reset are multi env dicts. Generally speaking we don't have good test coverage on the remote base env and we frankly don't plan to as it isn't api that we plan on supporting in future releases. however in the meantime we'll patch this bug because a user brought it up as an issue affecting them. Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ec3c102 - Browse repository at this point
Copy the full SHA ec3c102View commit details -
[core] Improve the workflow finding Redis leader. (ray-project#34108)
The current way of finding Redis leader sometimes giving error information. The main reason is because the ip address is not resolved. If in the initialization stage, it connect to the master, it'll use passed in address as the leader which later might make Ray pick follower redis. This PR fixed the issue and also uses the way redis-cli used to pick the leader. The PR makes the checking more strict to give better error message. New protocol as below: - use boost to resolve the ip address from domain name. - connect to the first ip address - if it's cluster mode, - make sure it's healthy; make sure only 1 shard - send a dummy write and check the return - if return OK, use the ip address directly - otherwise, use the one mentioned in the error message - if not cluster mode, just use the ip address Refactoring is also done in this PR. Moving connection related information to redis context Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8afd070 - Browse repository at this point
Copy the full SHA 8afd070View commit details -
[release] fix tune_scalability_network_overhead by adding `--smoke-te…
…st`. (ray-project#34167) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2188ad2 - Browse repository at this point
Copy the full SHA 2188ad2View commit details -
Revert "Global logging format changes" (ray-project#34126)
Adter manual bisection, I think this PR may be causing the "Documentation" tests to fail. The failure was previously masked by an actual failing doctest, but after this commit, actor outputs clutter the doctests and lead to mismatches in expected and actual output. Let's see if reverting fixes these problems. Reverts ray-project#32741 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c5bc0a5 - Browse repository at this point
Copy the full SHA c5bc0a5View commit details -
[CoreWorker] Partially address Ray child process leaks by killing all…
… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6b6a7a6 - Browse repository at this point
Copy the full SHA 6b6a7a6View commit details -
[data] Rename .cache() to .materialize() (ray-project#34169)
Based on discussion with @c21 @jjyao , as well as the new "MaterializedDatastream" class name, materialize makes more sense as an action than cache. Furthermore, we don't need an is_cached method as the new type information suffices. We will have to pick a variation of this PR into 2.4 as well, which introduces the original fully_executed() -> cache() rename. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ff0aa1c - Browse repository at this point
Copy the full SHA ff0aa1cView commit details -
[Dataset] Add
FromXXX
operators (ray-project#32959)Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c3d4358 - Browse repository at this point
Copy the full SHA c3d4358View commit details -
[data] [streaming] Simplify progress bar reporting and integrate with…
… Jupyter (ray-project#34150) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 518e0cf - Browse repository at this point
Copy the full SHA 518e0cfView commit details -
[Data] Fix '_unwrap_protocol' for Windows systems (ray-project#31296)
The `_unwrap_protocol` method uses the `urllib.parse.urlparse` library function to split out the path and protocol. On Windows however this function returns the path with a `/` added before the drive letter. This type of path can't be used by any other functions. The solution is to strip the `/`. The logic is similar to what is used in the `pip` package, see [here](https://github.com/pypa/pip/blob/22.3.1/src/pip/_internal/utils/urls.py#L49). Signed-off-by: Jeroen Bédorf <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 54403c6 - Browse repository at this point
Copy the full SHA 54403c6View commit details -
[Java] Update coding style for RuntimeEnvTest.java (ray-project#34160)
Update coding style for RuntimeEnvTest.java Co-authored-by: XiaodongLv <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aa8c5d9 - Browse repository at this point
Copy the full SHA aa8c5d9View commit details -
[Java] make java worker log file prefix (default to java-worker) conf…
…igurable (ray-project#33797) For java worker, it's log file always being prefixed with "java-worker". And in python log_monitor.py, it hardcodes "java-worker*.log" to be polled for new log msg periodically. Some configs, like log_to_driver and RAY_BACKEND_LOG_LEVEL, don't prevent the log monitor from polling and publishing logs to gcs. To save some CPU cycle and network bandwidth, especially if there is large amount of logs produced from JVM, we can have an option. like a JVM system property, to set log file prefix for java worker instead of hard coded to "java-worker". Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7210c3d - Browse repository at this point
Copy the full SHA 7210c3dView commit details -
[core] Fix std::move without std namespace (ray-project#34149)
This is preventing build from newer mac. Signed-off-by: rickyyx <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2b79733 - Browse repository at this point
Copy the full SHA 2b79733View commit details -
[serve] Document multi-app support (ray-project#33496)
Documentation for Serve multi-application support. - Separates the Serve REST API page into V1 (single-application) and V2 (multi-application) REST API - Adds API ref pages for all config schemas - `ServeDeploySchema` - top level multi-application config - `HTTPOptionsSchema` - options to start the HTTP Proxy with - `ServeApplicationSchema` - single-application config - `DeploymentSchema` - deployment override options - `RayActorOptionsSchema` - options to start a replica actor with <img width="780" alt="image" src="https://user-images.githubusercontent.com/15851518/228297681-1f777219-8694-44e1-ad85-30a5a993e6e6.png"> - Adds API ref pages for all response schemas returned from GET endpoints - `ServeStatusSchema` - response format of old endpoint `GET /api/serve/deployments/status` - `ServeInstanceDetails` and all it's sub-schemas - response format of new endpoint `GET /api/serve/applications/` <img width="786" alt="Screen Shot 2023-03-28 at 8 35 27 AM" src="https://user-images.githubusercontent.com/15851518/228290829-b100b373-9951-4a74-b84b-646f20d7803d.png"> - Adds a user guide called "Deploying Multiple Serve Applications" under "User Guides" that covers using the serve CLI to interact with multiple applications. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 07cb238 - Browse repository at this point
Copy the full SHA 07cb238View commit details -
[RLlib] Add a flag to allow disabling initialize_loss_from_dummy_batc…
…h logit. (ray-project#34208) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 746578e - Browse repository at this point
Copy the full SHA 746578eView commit details -
[RLlib][RLModule] Abstract the build stage of RLModule to make them m…
…ore extendable (ray-project#34205) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7d1420f - Browse repository at this point
Copy the full SHA 7d1420fView commit details -
Update codeowners (ray-project#34214)
Signed-off-by: Shreyas Krishnaswamy <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6892eb8 - Browse repository at this point
Copy the full SHA 6892eb8View commit details -
[core] Mark raylet unhealthy if GCS can't recognize it. (ray-project#…
…34087) When GCS can't recognize the raylet, Raylet will just hang there and never exits. There is also no way to tell whether this raylet is healthy or not. This could happen when some incorrect setup. For example, data is lost in the DB. When Raylet detect the issue, it should just exit itself or mark itself as unhealthy. This PR will mark raylet mark itself unhealthy and the upper layer can choose what to do for this case. This is useful for Serve HA's usecase because as long as the raylet is alive, the actors will be still able to serve traffic and the upper layer can do more operations, like starting a new cluster and shutdown the current one later. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b9df7dc - Browse repository at this point
Copy the full SHA b9df7dcView commit details -
[Data] Improve state initialization for
ActorPoolMapOperator
(ray-p……roject#34037) ActorPoolMapOperator takes in a Callable class which initializes some state to be reused for every batch. In the current implementation, this state is initialized on the first batch, rather than during actor init. In this PR, we separate the state initialization and actually call it during Actor init. This allows state to be initialized for fixed size actor pools, even when tasks are not ready to be dispatched for better pipelining. It also supports using multithreaded actors, so state gets initialized once per actor instead of once per thread. --------- Signed-off-by: amogkam <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b480937 - Browse repository at this point
Copy the full SHA b480937View commit details -
[ci] No early kickoff by default for workflow test (ray-project#34213)
Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - ray-project#34101 Seems the workflow test does have a dependency on the wheels built. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d49858e - Browse repository at this point
Copy the full SHA d49858eView commit details -
don't trigger execution in ipython repr (ray-project#34219)
Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a41264b - Browse repository at this point
Copy the full SHA a41264bView commit details -
[doc] [data] Fix autosummary issues (ray-project#34220)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 98632f1 - Browse repository at this point
Copy the full SHA 98632f1View commit details -
Revert "[doc] [data] Fix autosummary issues (ray-project#34220)" (ray…
…-project#34227) This reverts commit 45b9067. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d1ac0c9 - Browse repository at this point
Copy the full SHA d1ac0c9View commit details -
[Serve][Release][Part1] Enable tests to GCE (ray-project#34163)
Add release tests to GCE. Long running failure succeed: https://buildkite.com/ray-project/release-tests-pr/builds/34156#01875cb1-619c-4752-9710-26c5b95de5e6 serve_single_deployment_1k_noop_replica succeed: https://buildkite.com/ray-project/release-tests-pr/builds/34271#01876741-acaa-4d2a-8b50-64cadf8b67d4 serve_multi_deployment_1k_noop_replica succeed: https://buildkite.com/ray-project/release-tests-pr/builds/34273#018767c2-f14f-4da6-a5cd-ad04675caa99 serve_autoscaling_single_deployment succeed https://buildkite.com/ray-project/release-tests-pr/builds/34296#018768f2-9202-4590-8536-8454b60eaca5 serve_autoscaling_multi_deployment succeed https://buildkite.com/ray-project/release-tests-pr/builds/34298#0187692f-f93c-4708-9ac9-acd32c9633ab Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cac4fdc - Browse repository at this point
Copy the full SHA cac4fdcView commit details -
[core] Fix the placement group stress test regression. (ray-project#3…
…4192) Signed-off-by: Yi Cheng <[email protected]> The regression is because of enabling ray syncer. In the code, whenever the pg is created and deleted, raylet will actively send a message to GCS and this introduced a lot of workload to the GCS and thus make the code run slow. If disable ray syncer, raylet won't create this message and not sending it to GCS. There is no need doing this since when new resource is added to local node, ray syncer will be able to notice this and the resource will be pushed to GCS after 100ms. This PR deleted this logic and thus fix the regression. ``` before: placement group create/removal per second 1271.32 +- 8.27 after: placement group create/removal per second 1282.83 +- 3.99 ``` For release test: ``` perf_metrics = [{'perf_metric_name': 'pgs_per_second', 'perf_metric_value': 17.061243668170643, 'perf_metric_type': 'THROUGHPUT'}, {'perf_metric_name': 'dashboard_p50_latency_ms', 'perf_metric_value': 3.261, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p95_latency_ms', 'perf_metric_value': 129.682, 'perf_metric_type': 'LATENCY'}, {'perf_metric_name': 'dashboard_p99_latency_ms', 'perf_metric_value': 141.648, 'perf_metric_type': 'LATENCY'}] ``` Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7ce3b0b - Browse repository at this point
Copy the full SHA 7ce3b0bView commit details -
[Core] lazy import autoscaler + don't import opentelemetry unless set…
…up hook (ray-project#33964) This will improve startup time almost 2X and reduce memory usage by 2X. If it is combined with numpy lazy import, it will improve everything 3X. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2881f2f - Browse repository at this point
Copy the full SHA 2881f2fView commit details -
[docs][KubeRay] Update KubeRay doc for release v0.5.0 (ray-project#34178
) index.md: No code here to test and verify. getting-started.ipynb: Test manually. user-guides.md: No code here to test and verify k8s-cluster-setup.md: No code here to test and verify config.md: No code here to test and verify configuring-autoscaling.md: Test manually. logging.md: Test manually. gpu.rst: I did not verify code snippets, but GPU usage will be verified in gpu-training-example.md. experimental.md: No code here to test and verify static-ray-cluster-without-kuberay.md: Skip this. This document has no relationship with KubeRay. examples.md ml-example.md: (Will update in [docs][KubeRay] Provide some GKE instructions in KubeRay example ray-project#33339) gpu-training-example.md (Will update in [docs][KubeRay] Provide some GKE instructions in KubeRay example ray-project#33339) references.md Ray Serve kubernetes.md: Test manually. fault-tolerance.md: I do not test all serve's recovery procedures. I make sure the RayService can be created as expected. helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 # path: doc/ kubectl apply -f source/serve/doc_code/fault_tolerance/k8s_config.yaml # port forward kubectl port-forward service/rayservice-sample-serve-svc 8000 # Test the serve deployment curl localhost:8000 # Delete a worker Pod kubectl delete pod ${WORKER_POD} # Test the serve deployment again curl localhost:8000 run_gcs_ft_on_k8s.py Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 018abaa - Browse repository at this point
Copy the full SHA 018abaaView commit details -
[Actor] [Code Quality] Add Unit Tests for Actors Sorting (ray-project…
…#34058) Following with ray-project#33395 (comment), add a component test to improve the code quality Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 896df4d - Browse repository at this point
Copy the full SHA 896df4dView commit details -
[Dataset] Fix breaking Data CI tests (ray-project#34195)
- ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 89e4a3b - Browse repository at this point
Copy the full SHA 89e4a3bView commit details -
[RLlib] Change broken link in parameter_noise.py (ray-project#34231)
Signed-off-by: Avnish <[email protected]> change broken open ai blog post link to a working one Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 385f0ad - Browse repository at this point
Copy the full SHA 385f0adView commit details -
[Serve] [Docs] Clarify that the Serve config only supports remote URIs (
ray-project#34212) The Serve config only supports remote URIs within its `runtime_env` for safety purposes. However, this behavior is poorly documented and only guarded by a pydantic validator with an unclear error message. This change documents the remote URI requirements and clarifies the error message. Behavior when you run the following config with an invalid `runtime_env`: ```yaml import_path: fruit:graph runtime_env: { "working_dir": "src" } ``` 1. Without the change: ```console % serve run config.yaml ... pydantic.error_wrappers.ValidationError: 1 validation error for ServeApplicationSchema runtime_env Invalid protocol for runtime_env URI src. Supported protocols: ['GCS', 'CONDA', 'PIP', 'HTTPS', 'S3', 'GS', 'FILE']. Original error: '' is not a valid Protocol (type=value_error) ``` 2. With the change: ```console % serve run config.yaml ... pydantic.error_wrappers.ValidationError: 1 validation error for ServeApplicationSchema runtime_env runtime_envs in the Serve config support only remote URIs in working_dir and py_modules. Got error when parsing URI: Invalid protocol for runtime_env URI "src". Supported protocols: ['GCS', 'CONDA', 'PIP', 'HTTPS', 'S3', 'GS', 'FILE']. Original error: '' is not a valid Protocol (type=value_error) ``` Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7231190 - Browse repository at this point
Copy the full SHA 7231190View commit details -
[core][ci] Fix test_fault_tolerance_actor_tasks_failed for test_task_…
…events_2.py (ray-project#34237) Closes ray-project#34229 Or if we could merge ray-project#33818 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bdaf578 - Browse repository at this point
Copy the full SHA bdaf578View commit details -
[core] prestart worker on node startup (ray-project#33623)
Always prestart num_cpu of workers when raylet starts up. Previously we only start python workers on driver registration, or another worker submits a new task to this raylet. This has caused cold start issues described in ray-project#26262. As part of this change, also did some needed cleanup to simplify the code / make this work removed start_initial_python_workers_for_first_job from ray.init(...) this is causing prestart to not work, since start_initial_python_workers_for_first_job defaults to false and is defaulted to true by ray client if there is no runtime env -- there is no behavior change in this PR to how worker prestart interacts with runtime env -- this doesn't seem to be big of a change to warrant api review : if customer is setting this in ray client, they should remove it -- if someone wants to turn off worker prestart, they can do so by setting RAY_enable_worker_prestart to false Benchmark : on prestarted raylet, measure time to start driver and start num_cpu tasks Master: 2.09 sec PR: 1.18 sec(56% of original startup time) We don't measure other cases due to restrictions with today's working re-use due to worker cache key - something that needs to be addressed in follow up Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c8b0c7a - Browse repository at this point
Copy the full SHA c8b0c7aView commit details -
[RLlib] Fixed a bug with kl divergence calculation of torch.Dirichlet…
… distribution within RLlib (ray-project#34209) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 13c9059 - Browse repository at this point
Copy the full SHA 13c9059View commit details -
[Core] Fix ray start command output (ray-project#34081)
With ray-project#32409, we stopped printing out information like dashboard url when creating a single node ray cluster on OSX and windows. This is a regression and this PR reverts back to the old behavior. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9573ee2 - Browse repository at this point
Copy the full SHA 9573ee2View commit details -
[RLlib] Remove infos dict before Json_writer writes sample batches (r…
…ay-project#33896) Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 86ae35c - Browse repository at this point
Copy the full SHA 86ae35cView commit details -
[RLlib] Add examples and docs for Catalog. (ray-project#33898)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e60cdee - Browse repository at this point
Copy the full SHA e60cdeeView commit details -
[core] Task backend - Add worker died info to failed tasks when job e…
…xits. (ray-project#34166) This adds the additional error_type + error_message info to non-terminal tasks (not finished and not failed) when a job exits. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cc69ce3 - Browse repository at this point
Copy the full SHA cc69ce3View commit details -
[Data] Update path expansion warning (ray-project#34221)
The warning for path expansion during metadata fetching is inaccurate with recent changes. This PR updates the warning. --------- Signed-off-by: amogkam <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9c3adec - Browse repository at this point
Copy the full SHA 9c3adecView commit details -
[docs][KubeRay] Provide some GKE instructions in KubeRay example (ray…
…-project#33339) ml-example.md: I used a GKE cluster without autopilot in this example. As there are some dependency issues on my Mac M1 at this moment, I made some slight modifications to the reproduction instructions. Instead of running the job locally, I used kubectl exec to log in to the head Pod and submit the XGBoost job. This change should not have impact on this document. Screen Shot 2023-04-10 at 3 34 47 PM Screen Shot 2023-04-10 at 3 34 37 PM gpu-training-example.md Protobuf issue (ray-ml:2.3.0-gpu): ray-ml docker images - TypeError: Descriptors cannot not be created directly ray-project#31309 (comment) => Choose a Ray 2.2 image. TorchVisionPreprocessor (Ray 2.2 does not support TorchVisionPreprocessor. Hence, I used pytorch_training_e2e.py in the branch ray-2.2.0) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6f74e42 - Browse repository at this point
Copy the full SHA 6f74e42View commit details -
[data] Add take_batch API for collecting data in the same format as i…
…ter_batches and map_batches (ray-project#34217) There isn't any convenient way to take just a single batch today, which is confusing. Introduce ds.take_batch(n, batch_format="default"), which returns a batch of n records as next(ds.iter_batches(batch_size=n, batch_format="default")) would. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47add11 - Browse repository at this point
Copy the full SHA 47add11View commit details -
[serve] Log to file on LongPollClient update (ray-project#34204)
Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f32b525 - Browse repository at this point
Copy the full SHA f32b525View commit details -
[try 2] [doc] [data] Fix autosummary issues (ray-project#34228)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 825f5c2 - Browse repository at this point
Copy the full SHA 825f5c2View commit details -
[RLlib] Change occurences of `"_observation_space_in_preferred_format…
…"` to `"_obs_space_in_preferred_format"` (ray-project#33907) Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f877ee3 - Browse repository at this point
Copy the full SHA f877ee3View commit details -
[Core] Introduce spill_on_unavailable option for soft NodeAffinitySch…
…edulingStrategy (ray-project#34224) Introduce a private _spill_on_unavailable semantic for soft NodeAffinitySchedulingStrategy. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 73ce168 - Browse repository at this point
Copy the full SHA 73ce168View commit details -
[Data] Support using concurrent actors for
ActorPool
(ray-project#3……4253) Support using concurrent actors for ActorPool. We do this by gating the user UDF in a separate threadpool of max size 1. --------- Signed-off-by: amogkam <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 89f4193 - Browse repository at this point
Copy the full SHA 89f4193View commit details -
[Part 2/n] Rename Dataset => Datastream (DataContext, DataIterator, G…
…roupedDatastream) (ray-project#34186) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8d5263c - Browse repository at this point
Copy the full SHA 8d5263cView commit details -
[ci/release] Migrate GBDT tests (xgboost/lightgbm) to GCE (ray-projec…
…t#34264) Continuing the effort to migrate tests to GCE, this introduces variations for xgboost_ and lightgbm_ tests. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 31ef76b - Browse repository at this point
Copy the full SHA 31ef76bView commit details -
[Serve][Release][Part2] Add release tests to GCE (ray-project#34245)
Makes some Serve release tests run on GCE. serve_serve_micro_benchmark succeed https://buildkite.com/ray-project/release-tests-pr/builds/34505#01876e9b-b99c-40de-af94-6c7044b401dc deployment_graph_long_chain succeed https://buildkite.com/ray-project/release-tests-pr/builds/34507#01876eb9-8727-4a59-b193-ce2ff5e9647e deployment_graph_wide_ensemble succeed https://buildkite.com/ray-project/release-tests-pr/builds/34510#01876eca-27ce-4234-b668-eea78767910d serve_handle_long_chain succeed https://buildkite.com/ray-project/release-tests-pr/builds/34553#01877135-9e44-4653-ae53-be785ca5574c serve_handle_wide_ensemble succeed https://buildkite.com/ray-project/release-tests-pr/builds/34566#0187714b-45ff-4fd9-8a22-8c8bb45b0748 serve_micro_protocol_grpc_benchmark succeed https://buildkite.com/ray-project/release-tests-pr/builds/34569#0187715b-d78b-4b3a-a0be-4cb556482729 serve_micro_protocol_http_benchmark succeed https://buildkite.com/ray-project/release-tests-pr/builds/34570#0187716b-e710-4688-b31d-c8cc3f67ab4f serve_resnet_benchmark succeed https://buildkite.com/ray-project/release-tests-pr/builds/34604#018771c7-9a7f-42f1-a6ce-766e32e48fae Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5b50270 - Browse repository at this point
Copy the full SHA 5b50270View commit details -
[data] Make sure the tf and tensor iteration work in dataset pipeline (…
…ray-project#34248) * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * make sure tf and tensor iteration in datapipeline work * Fix * fix * fix * fix * feedback * feedback * fix Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 46012cc - Browse repository at this point
Copy the full SHA 46012ccView commit details -
[Jobs] Fix race condition in supervisor actor creation and add timeou…
…t for pending jobs (ray-project#34223) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with ray-project#34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bf671ef - Browse repository at this point
Copy the full SHA bf671efView commit details -
[RLlib] Actually save the optimizer state for tf learners (ray-projec…
…t#34252) It turns out you can get the actual optimizer state by calling optimizer.variables for tf keras. this pr enables us to save the full optimizer state and restore it. To do this I added a new file called optimizer_name_state.txt to the checkpoint. This holds a bytestring serialized representation of the optimizer's state. It looks like the optimizer's variable state doesn't include things like the learning rate, so I still need to save those as a separate file and reconstruct the optimizer first before loading the state. --------- Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bbd512d - Browse repository at this point
Copy the full SHA bbd512dView commit details -
[RLlib] Change broken doc name: MultiAgentRLModule.build->MultiAgentR…
…LModule.setup (ray-project#34291) Signed-off-by: Avnish <[email protected]> fix in the title. We had a autogenerated doc that was broken because the name of a function changed. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0b087b7 - Browse repository at this point
Copy the full SHA 0b087b7View commit details -
Add Cython wrapper for GcsClient (ray-project#33769)
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient. This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes. This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code. We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3c9641e - Browse repository at this point
Copy the full SHA 3c9641eView commit details -
[Doc] Rewrite the placement group documentation (ray-project#33518)
This PR rewrites the existing placement group documentation that is confusing (sorry I wrote the original version). The new doc will start from the simplest example -> explaining the advanced concepts. Also, all the concepts are more thoroughly explained with examples. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4a0100c - Browse repository at this point
Copy the full SHA 4a0100cView commit details -
[Serve][Doc] Update metrics & log doc (ray-project#34222)
Update the logging & metrics for the 2.4. change. Co-authored-by: angelinalg <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3ea1655 - Browse repository at this point
Copy the full SHA 3ea1655View commit details -
[core] fix windows node manager test (ray-project#34304)
one of the test uses command that doesn't work on windows, disable it for now Signed-off-by: Clarence Ng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 67019be - Browse repository at this point
Copy the full SHA 67019beView commit details -
[Doc] Make front page images non clickable (ray-project#32738)
A Sphinx issue automatically makes images clickable whenever they're scaled (see https://stackoverflow.com/questions/40096251/disable-click-behavior-for-images). Clicking takes you to a full size version of the image. On the front page of the docs, there are four prominent images that look like buttons. The user would expect clicking them to take you to a docs page, but instead it just takes you to the image. (See the linked issue for details) Since there's no single docs page corresponding to each of these four images, in this PR we opt to make these images non clickable. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9feabc4 - Browse repository at this point
Copy the full SHA 9feabc4View commit details -
[ci/mac] Fix arm64 wheels builds (ray-project#34268)
The conda setup in test_wheels seems to fail from leftover state from previous python installs. This PR updates the test wheels script to create a new conda environment with the respective Python version which should not interfere with previous virtual envs. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9d312c6 - Browse repository at this point
Copy the full SHA 9d312c6View commit details -
[core][state] Add head node flag
is_head_node
to state API and GcsN……odeInfo (ray-project#34299) There have been requests for checking which/if a node is the head node. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 033a3af - Browse repository at this point
Copy the full SHA 033a3afView commit details -
[Release] Fix dask dependencies (ray-project#34261)
Some if not all of the dask release tests are failing because of dependency hell. In short, boto, s3sf does not work well with boto3 that is installed in anyscale dataplane. Good news is these tests do not need these dependencies anyway (since anyscale already installed them properly). Related issue number Closes ray-project#19399 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for de72c25 - Browse repository at this point
Copy the full SHA de72c25View commit details -
[CI][Clean][03] Break run_release_test into smaller functions (ray-pr…
…oject#33951) A purely refactor diff. Break run_release_test in glue.py into smaller functions so they are easier to read, test and change. It helps me to make future change easier too. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0e40f9c - Browse repository at this point
Copy the full SHA 0e40f9cView commit details -
[CI][Clean][4] Add exception-free functions (ray-project#34099)
Add exception-free APIs for some classes. This helps client with the option to use them without having to worry about exception handling repetitively. Make the client code a bit easier to read. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6a70c86 - Browse repository at this point
Copy the full SHA 6a70c86View commit details -
[serve] Remove pointless
asyncio.Lock
(ray-project#34314)This is a relic of a forgotten era. None of the calls it is "guarding" `await` so it is currently a no-op. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 04fbf96 - Browse repository at this point
Copy the full SHA 04fbf96View commit details -
[Doc] Fix typo in Tune restore guide (ray-project#34247)
Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a8a1927 - Browse repository at this point
Copy the full SHA a8a1927View commit details -
[data] Fix pyarrow numpy element issue (ray-project#34215)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3824410 - Browse repository at this point
Copy the full SHA 3824410View commit details -
[Doc] update workspace templates (ray-project#34289)
Signed-off-by: Sofian Hnaide <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b2cfa6d - Browse repository at this point
Copy the full SHA b2cfa6dView commit details -
[docs] fix build (ray-project#34265)
* [docs] fix build Signed-off-by: Max Pumperla <[email protected]> * fix doctests Signed-off-by: Max Pumperla <[email protected]> * last test Signed-off-by: Max Pumperla <[email protected]> * lint Signed-off-by: Max Pumperla <[email protected]> * Update doc/source/rllib/package_ref/rl_modules.rst Co-authored-by: kourosh hakhamaneshi <[email protected]> Signed-off-by: Max Pumperla <[email protected]> * fixes * revert diff * whitespace --------- Signed-off-by: Max Pumperla <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3b66972 - Browse repository at this point
Copy the full SHA 3b66972View commit details -
[AIR][Doc] LightningTrainer Advanced Example (ray-project#34082)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 05eff03 - Browse repository at this point
Copy the full SHA 05eff03View commit details -
[RLlib] External env is not compatible with the connectors API. (ray-…
…project#33945) Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 36e23fd - Browse repository at this point
Copy the full SHA 36e23fdView commit details -
[Data] Cosmetic changes to Arrow Tensor __repr__ (ray-project#34286)
Make it clear what the data type actually is -- a numpy array. Also make the argument ordering consistent between the two types. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ea0764a - Browse repository at this point
Copy the full SHA ea0764aView commit details -
[core][state] Fix list nodes test in test_state_api.py (ray-project#3…
…4349) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a236ec9 - Browse repository at this point
Copy the full SHA a236ec9View commit details -
[Dashboard][Bug fix] When using an nginx proxy, the front-end may mis…
…spell the URL when accessing the log. (ray-project#34130) In our use case, we need to access the dashboard in the online cluster through an nginx proxy from the intranet. We found that when accessing the log page under this scenario, the front-end would misspell the URL, resulting in a failure to load. ## Related issue number Closes ray-project#34043 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3bd41bf - Browse repository at this point
Copy the full SHA 3bd41bfView commit details -
[Metrics] Fix shared memory is not displayed properly (ray-project#34301
) Looks like we incorrectly recorded shared memory, and incorrectly displayed it to the metrics graph (I forgot to append ray_) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ef51e9f - Browse repository at this point
Copy the full SHA ef51e9fView commit details -
[Tune] Add support for nested hyperparams in PB2 (ray-project#31502)
This PR enables nested passing hyperparameters for the PB2 scheduler. This PR also makes a few minor improvements to PB2 (happy to separate out these changes if needed): 1. Hyperparameter initialization (if missing from param space) should be sampled uniformly between bounds. Currently, PB2 falls back to PBT for sampling initial hyperparameters, which will just choose between the low/high values. 2. Allow `custom_explore_fn` to be passed into PB2 to match PBT functionality. This solves a user request here: https://discuss.ray.io/t/pb2-hyper-parameters-as-integers/8822. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 22aa4b9 - Browse repository at this point
Copy the full SHA 22aa4b9View commit details -
[RLlib] DreamerV3: Add Conv2d-transpose support to new model Catalog. (…
…ray-project#33969) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0437cb1 - Browse repository at this point
Copy the full SHA 0437cb1View commit details -
[serve] Revert
info
log line inLongPollClient
(ray-project#34313)This is getting spammed to the driver console because it also has a `LongPollClient` :( Need to add a way to filter these messages before adding it back. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 278d89a - Browse repository at this point
Copy the full SHA 278d89aView commit details -
[Release Test] Add GCE variation for core release tests [2/n] (ray-pr…
…oject#34337) - single_node_oom - benchmark_worker_startup - Removed worker node types with max_worker = 0 Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 120c34b - Browse repository at this point
Copy the full SHA 120c34bView commit details -
[Serve] Remove smoke test from gce (ray-project#34319)
We don't have smoke test for these release tests. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 89b9cff - Browse repository at this point
Copy the full SHA 89b9cffView commit details -
[build_base] Use bazelisk for better bazel version management. (ray-p…
…roject#34246) Upgrading bazel require a lot of file changes: - update setup.py - update windows bazel fix - update workspace This PR make Ray use bazelisk instead of bazel to make the management easier. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 51bcdb9 - Browse repository at this point
Copy the full SHA 51bcdb9View commit details -
[build] Fix build on latest clang (ray-project#34151)
The latest clang just make some warning as error by default. This PR tries to fix that. More detail in https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213/1 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0222468 - Browse repository at this point
Copy the full SHA 0222468View commit details -
[Data] Remove unnecessary setting of global logging level to INFO whe…
…n using Ray Data (ray-project#34347) When initializing Ray Data, the global logging level is set to `INFO`, which causes non-Ray `INFO` logs to be unintentionally emitted (the default level in the `logging` library is `WARNING`, which would normally ignore `INFO`-level logs). We remove an unnecessary setting of the logging level in `DatasetLogger` which resolves this issue. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 75e17e4 - Browse repository at this point
Copy the full SHA 75e17e4View commit details -
[RLlib] Make the KL coefficient traced in appo tf (ray-project#34293)
Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dba4144 - Browse repository at this point
Copy the full SHA dba4144View commit details -
[air] Move to new storage_path API in tests and examples (ray-project…
…#34263) Following ray-project#33463, this PR updates our tests, examples, and docs to use the new `storage_path` API. The only locations where we continue to use the `local_dir` statement are tests where we specify both a local dir and a remote dir. For these tests, we can move to an environment-variable based wrapper in the future. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 074e976 - Browse repository at this point
Copy the full SHA 074e976View commit details -
[AIR] Experiment restore stress tests (ray-project#33706)
Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 32b4b92 - Browse repository at this point
Copy the full SHA 32b4b92View commit details -
[RLlib] Fix two RL docs examples (ray-project#34353)
Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9d2e693 - Browse repository at this point
Copy the full SHA 9d2e693View commit details -
[air] Deflake test_e2e_train_flow.py (ray-project#34308)
The test_e2e_train_flow test has been flaky. After some investigation this seems to be due to a race condition: The mock train flow would continue from the latest "checkpoint", but an actor restart could resolve before the next iteration finished. This triggers a new continuation, which increases the training iteration, leading to a mismatch. The fix in this mock flow is to only unset the "restore" instruction after the next round of training results came in. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bff2b94 - Browse repository at this point
Copy the full SHA bff2b94View commit details -
[Datasets] Use read stage name for naming Data-read tasks on Ray Dash…
…board (ray-project#34341) This PR updates the naming so that we use the underlying read stage name, if available from the input `LazyBlockList`, as the resulting `MapOperator`; otherwise, we fall back to the existing `DoRead` name. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 99fd534 - Browse repository at this point
Copy the full SHA 99fd534View commit details -
[train] Fix rendering of diff code-blocks (ray-project#34355)
Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0ef6596 - Browse repository at this point
Copy the full SHA 0ef6596View commit details -
[RLlib] Check that results has learner info appo test (ray-project#34381
) The appo kl coefficient learner test is flakey because we run training until there are some results. What can end up happening is that training is run for so long that eval results are available but not learner results This pr fixes this by training until there are learner results that are available not just evaluation results. Signed-off-by: Avnish <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eb9fbf1 - Browse repository at this point
Copy the full SHA eb9fbf1View commit details -
pull out shared deploy code into deploy utils (ray-project#34321)
Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c71a2c - Browse repository at this point
Copy the full SHA 5c71a2cView commit details -
[serve] Fix get endpoint when autoscaling config is set (ray-project#…
…34377) If autoscaling config is set for a deployment, we can't set the num replicas when returning the deployment details of that deployment. Otherwise, it breaks the entirety of the get metadata endpoint. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c28386 - Browse repository at this point
Copy the full SHA 5c28386View commit details -
add main for obod test (ray-project#34311)
Signed-off-by: Catch-Bull <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 95bf952 - Browse repository at this point
Copy the full SHA 95bf952View commit details -
[tune] fix a typo in
tune/execution/checkpoint_manager
state serial……ization. (ray-project#34368) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 453cced - Browse repository at this point
Copy the full SHA 453ccedView commit details -
[air] DreamBooth example: Fix code for batch size > 1 (ray-project#34398
) The DreamBooth finetuning example currently throws an error when batch size > 1, even when the GPU memory is large enough. This is because the training batches are currently not created correctly. This PR fixes the batch format and includes in-line comments to explain the new behavior. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d5a46e3 - Browse repository at this point
Copy the full SHA d5a46e3View commit details -
[Data] combine_chunks before chunking pyarrow.Table block into batches (
ray-project#34352) pyarrow.Table.slice is slow when the table has many chunks which makes batching pyarrow block slow. The fix is combining chunks into a single one to make slice faster with the cost of an extra copy. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 670686a - Browse repository at this point
Copy the full SHA 670686aView commit details -
[data] [streaming] [part 3/n] Rename Dataset => Datastream in interna…
…l files (ray-project#34340) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6ea21bd - Browse repository at this point
Copy the full SHA 6ea21bdView commit details -
[Dataset] Validate sort key in
Sort
LogicalOperator (ray-project#34282) As a followup of ray-project#32133, we should validate key with block.py:_validate_key_fn(), in generate_sort_fn() before doing sort. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c860d17 - Browse repository at this point
Copy the full SHA c860d17View commit details -
[data] Add usage tag for which block formats are used (ray-project#34384
) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6294a84 - Browse repository at this point
Copy the full SHA 6294a84View commit details -
[Dataset] Reset row count when filtering on Dataset reading from Parq…
…uet (ray-project#34372) Previously, if we filter on a Dataset which read from a Parquet datasource, the row count on the resulting Dataset is the same as the unfiltered Dataset (see ray-project#33766 and modified test for example). This PR fixes the bug and gets the correct row count after applying the filter. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 282424f - Browse repository at this point
Copy the full SHA 282424fView commit details -
Remove python 3.6 support [1/n] (ray-project#34373)
Python 3.6 support will be removed in Ray 2.5 Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e7dcdc4 - Browse repository at this point
Copy the full SHA e7dcdc4View commit details -
[RLlib] Add 2D box example for PPO RL Modules (ray-project#33840)
Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 451150d - Browse repository at this point
Copy the full SHA 451150dView commit details -
Revert "[Metrics] Fix shared memory is not displayed properly (ray-pr…
…oject#34301)" (ray-project#34407) This reverts commit 688ddf6. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c9ccbec - Browse repository at this point
Copy the full SHA c9ccbecView commit details -
Add GCE variation for core release tests [3/n] (ray-project#34425)
- microbenchmark_38 - shuffle_20gb_with_state_api - object_store - many_actors - many_tasks - many_pgs - chaos_many_tasks_no_object_store - chaos_many_actors Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f0ee586 - Browse repository at this point
Copy the full SHA f0ee586View commit details -
[train] rename _base_dataset to _base_datastream (ray-project#34423)
Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 420c60c - Browse repository at this point
Copy the full SHA 420c60cView commit details -
[CI][Bisect][1] Skeleton for automated bisect of release tests (ray-p…
…roject#34329) A script to bisect release test failures. This PR only contains a skeleton and unit-tests Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a45407f - Browse repository at this point
Copy the full SHA a45407fView commit details -
[RLlib] DreamerV3: Catalog enhancements 04 - LSTM default models. (ra…
…y-project#34272) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d4c9bc4 - Browse repository at this point
Copy the full SHA d4c9bc4View commit details -
[Dataset] Validate aggregation key in
Aggregate
LogicalOperator (ra……y-project#34292) As a followup of ray-project#32462, we should validate aggregate functions with `AggregateFn._validate`, in `generate_aggregate_fn()` before doing aggregate. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2ef7ec3 - Browse repository at this point
Copy the full SHA 2ef7ec3View commit details -
[requirements] Add PyArrow to ray[tune] dependencies (ray-project#34397)
Ray Tune depends on PyArrow for filesyncing. However, `ray[tune]` currently does not include pyarrow as a dependency, which means version constraints are not enforced and syncing is not guaranteed to work out of the box. This surfaced as a problem when a user used poetry with `ray[tune]` as a constraint, but an incompatible version of pyarrow was installed. In this case, syncing to cloud storage was broken. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 64a5c78 - Browse repository at this point
Copy the full SHA 64a5c78View commit details -
[air] pin deepspeed version for now to unblock ci. (ray-project#34406)
Deepspeed had a new release yesterday that broke our CI. Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5796620 - Browse repository at this point
Copy the full SHA 5796620View commit details -
Closing issue (ray-project#31926) about unknown windows crash when to…
…o many arguments given in the config file (ray-project#32206) There is a crash that I encountered in Windows. It related to the fact that the path was too long for windows. So to allow the user to be aware of this issue, I added a check in the code that checks if the path is too long and warn it with a logger warning message. Signed-off-by: sahar <[email protected]> Signed-off-by: Sahar <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: sahar <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e566859 - Browse repository at this point
Copy the full SHA e566859View commit details -
Serve Dashboard features polish (ray-project#34391)
Filter out serve system endpoints from grafana dashboards Make it more clear when a log file is empty Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8988ded - Browse repository at this point
Copy the full SHA 8988dedView commit details -
[core][state] Efficient get/list actors with filters on some high-car…
…dinality fields ray-project#34348 Signed-off-by: rickyyx <[email protected]> This improves the state API for listing/getting actors: if filtering by id/state/job, filtering is pushed down to the source (GCS). Other state API resources will be implemented in a similar way (e.g. tasks/workers). Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9552120 - Browse repository at this point
Copy the full SHA 9552120View commit details -
[CI][Bisect][2] Actually bisect test failures on buildkite (ray-proje…
…ct#34331) Implement the actual functions to run test and bisect on buildkite. This first implementation is pretty naive in several ways: - It uses a main bisect orchestration step that waits for test steps. We can make it more efficient here by sub-bisect orchestration step - It only runs one test at a time, which is less effective when the range gets small Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e47ba3d - Browse repository at this point
Copy the full SHA e47ba3dView commit details -
[Doc] Fix linter (ray-project#34474)
Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9c6fc49 - Browse repository at this point
Copy the full SHA 9c6fc49View commit details -
[RLlib] Try 8gpus_96cpus_gce with n1 and t4 nodes (ray-project#34459)
Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for edcc30a - Browse repository at this point
Copy the full SHA edcc30aView commit details -
[RLlib] fix cartpole lstm string (ray-project#34458)
Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 12122d5 - Browse repository at this point
Copy the full SHA 12122d5View commit details -
[Release Test] Add GCE variation for core release tests [4/n] (ray-pr…
…oject#34442) - dask_on_ray_100gb_sort - stress_test_state_api_scale - stress_test_many_tasks - stress_test_dead_actors - threaded_actors_stress_test - many_nodes_actor_test_on_v2 - placement_group_performance_test - scheduling_test_many_0s_tasks_many_nodes - agent_stress_test Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9e1c154 - Browse repository at this point
Copy the full SHA 9e1c154View commit details -
[RLlib] Throw meaningful error when trying to run DirectMethod OPE wi…
…th TF (ray-project#34417) * Introduce error Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 986b2d1 - Browse repository at this point
Copy the full SHA 986b2d1View commit details -
[CI][Green-Ray][1] Automated retry of infra-error release tests (ray-…
…project#34057) This PR is a part of my effort to make OSS release test run greener, starting with reducing infra error rates. Other work such as [this from Lonnie](https://docs.google.com/document/d/1hF7h8F19qFWFxH9WVeT8fWwVuNyUyHLTx-7LP3uxD50/edit#heading=h.i0cvl0u8jbfu) fixes systematic issues such as unstable Anyscale staging environment. This PR addresses transient issues with Anyscale that are hard to avoid in a distributed system. On a day Anyscale behaves well, transient issue seem to be around [2-3%](https://b534fd88.us1a.app.preset.io/superset/dashboard/43/?force=false&native_filters_key=MoYaGptJfGwbkF60A7RSzfoRLL_ypDf_JvNFxp2YGQ8Ls4CNgbAWEBh0WcOkOLsS), aka. 4 random failures for a test suite of 200 tests, annoying! Concretely it will: - First, classify an infra test run as a transient infra issue - Instruct buildkite to automatically retry on transient issue - If retry runs out, classify the infra test run as infra issue Some other limitations that will be addressed in followup PRs: - Move infra-failure retry configuration into LaunchDarkly? - Limit auto-retry based on test cost or test runtime Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3f30c1b - Browse repository at this point
Copy the full SHA 3f30c1bView commit details -
Remove python 3.6 support [2/n] (ray-project#34416)
Removed some dead code for 3.6 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 233cd31 - Browse repository at this point
Copy the full SHA 233cd31View commit details -
Fix backpressure handling of queued actor pool tasks (ray-project#34254)
There is a bug in the backpressure implementation with regard to actor pools, in that once a task is queued for an actor pool, it is no longer subject to backpressure. This is problematic when the output size of a task is much bigger than the input size. In this situation, the actor pool will keep executing tasks (converting small objects into larger objects), even when this would grossly exceed memory limits. Put another way: it fixes the issue where the streaming executor queues tasks on an actor pool operator, but later on wants to "take it back" due to unexpectedly high memory usage. This avoids the issue by not queueing tasks that won't be immediately executed (so they won't need to be taken back). Example: 1. Suppose there is an actor pool of size 10, each of which can take 1 active task each. 2. Each input task is size 1GB. The memory limit is 100GB, so we add 100 of these inputs in an actor pool operator. 3. When the tasks run, they expand into 100GB of output each. Now, the memory usage overall is 200GB (2x over our limit!). 4. However, since we already added those 100 inputs to the actor pool, there is no way of the streaming scheduler to pause execution of those 90 remaining queued inputs. 5. Now the 90 queued inputs execute and we end up using 1TB, or 10x our intended memory limit. We need to check for the memory limit right before executing a task in the actor pool; one way of doing this is to eliminate the internal queue in the actor pool operator and instead always queue work outside the operator. TODO: - [x] Performance testing - [x] Unit tests - [x] Perf test final version Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 56c8673 - Browse repository at this point
Copy the full SHA 56c8673View commit details -
Deflake gcs_client_test.cc (ray-project#34411)
Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9839f6b - Browse repository at this point
Copy the full SHA 9839f6bView commit details -
release logs for 2.4.0 (ray-project#33905)
Release logs perf benchmark for 2.4.0 Also updated tool to sort the regressions Signed-off-by: Clarence Ng <[email protected]> Co-authored-by: Clarence Ng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6419eab - Browse repository at this point
Copy the full SHA 6419eabView commit details -
[data] [streaming] Improve handling of KeyboardInterrupt (ray-project…
…#34441) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d32d4b1 - Browse repository at this point
Copy the full SHA d32d4b1View commit details -
[no_early_kickoff][core][state] Make state api return results that ar…
…e strongly typed (ray-project#34297) We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bc75bff - Browse repository at this point
Copy the full SHA bc75bffView commit details -
[core][state] Use
--err
flag to query stderr logs from worker/actor……s instead of `--suffix=err` (ray-project#34300) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d0a0ced - Browse repository at this point
Copy the full SHA d0a0cedView commit details -
[data] [streaming] [part 4/n] Rename dataset module files to datastre…
…am (ray-project#34413) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4ded4da - Browse repository at this point
Copy the full SHA 4ded4daView commit details -
[CI] Fix shellcheck lint (ray-project#34488)
* [CI] Fix shellcheck lint Signed-off-by: Antoni Baum <[email protected]> * More lint fixes Signed-off-by: Antoni Baum <[email protected]> * Revert "More lint fixes" This reverts commit 8d6f316. --------- Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 589d371 - Browse repository at this point
Copy the full SHA 589d371View commit details -
[CI] Add GCE variances to Data tests (ray-project#34105)
This PR configures BuildKite to run Data release tests on GCE. I excluded the parquet_metadata_resolution and shuffle_data_loader release tests because more work is required to migrate those tests. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7f00d44 - Browse repository at this point
Copy the full SHA 7f00d44View commit details -
[Core] convert gcs port read from env variable from str to int (ray-p…
…roject#34482) convert the variable from str to int to close ray-project#33963 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fc569e5 - Browse repository at this point
Copy the full SHA fc569e5View commit details -
[Serve] gRPC Deployment schema check & minor improvements (ray-projec…
…t#34210) Find issues as debugging gRPC, fixes in this pr: Fix options API is not set correctly. Add deployment attribute check. Remove the notification step in the deployment state. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3d0a89d - Browse repository at this point
Copy the full SHA 3d0a89dView commit details -
Fix mutable dataclass attribute (ray-project#34339)
This PR fixes an instance where a mutable attribute is used as a dataclass member, which causes an exception. See [this part of the docs](https://docs.python.org/3/library/dataclasses.html#mutable-default-values) for more information. Signed-off-by: pdmurray <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ec0e813 - Browse repository at this point
Copy the full SHA ec0e813View commit details -
[Event ]Fix incorrect event timestamp (ray-project#34402)
We didn't use the correct system clock + always used UTC timestamp, which is bad. It fixes the issue. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cc0bc5e - Browse repository at this point
Copy the full SHA cc0bc5eView commit details -
[core][tests] Harden flaky pytest (ray-project#34480)
I suspect the flaky cancellation test is due to an expectation that the final log message assumes a particular format. This may not be the last log message, so check backwards from the last message for this string. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1200b20 - Browse repository at this point
Copy the full SHA 1200b20View commit details -
[data] Experimental strict schema mode (ray-project#34336)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0b75855 - Browse repository at this point
Copy the full SHA 0b75855View commit details -
[Datasets] Defer first block computation when reading a Datasource wi…
…th schema information in metadata (ray-project#34251) In the current implementation of [ExecutionPlan._get_unified_blocks_schema](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L418), we force execution to compute the first block when given a `LazyBlockList`. However, when creating a Dataset from a datasource which have schema information available before reading (e.g. Parquet), this unnecessarily forces execution, since we already check for metadata in the subsequent [ensure_metadata_for_first_block](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/lazy_block_list.py#L379). Therefore, we can remove `blocks.compute_first_block()`. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ac117e2 - Browse repository at this point
Copy the full SHA ac117e2View commit details -
[core] Task backend - marking tasks failed on worker death (ray-proje…
…ct#33818) When a parent task tailed due to task execution error, we right now mark children tasks as failed (incorrectly). With this PR, we are marking task failure states properly, we will rely on worker exits to trigger the failure marking routine for tasks. This also aligns more correctly with ray's actual behaviour: relevant tests are changed to explicitly verify tasks are not running on the process. When a node fails, we rely on other parts of ray (gcs) to report the workers failure, which will trigger the task failure marking for the worker, and then mark tasks as failed properly. Tests are also added to verify the detached actor behaviour. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3d902fd - Browse repository at this point
Copy the full SHA 3d902fdView commit details -
Revert "Revert "[Metrics] Fix shared memory is not displayed properly… (
ray-project#34460) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 59be0e7 - Browse repository at this point
Copy the full SHA 59be0e7View commit details -
[Data] Update code owners of Ray Data (ray-project#34506)
As title, to reflect the latest group actively working on Ray Data module. Signed-off-by: Cheng Su <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d82f401 - Browse repository at this point
Copy the full SHA d82f401View commit details -
[CI][Green-Ray][2] Transient error release test needs to fail fast (r…
…ay-project#34110) In ray-project#34057, I made it so far release tests that fail with infra-error will automatically retry once. This PR makes it so that, not only it has to fail with infra-error, it has to run within less than 30 minutes as well. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 95ef21a - Browse repository at this point
Copy the full SHA 95ef21aView commit details -
[data] Also improve repr of pandas dtype (ray-project#34502)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e182af5 - Browse repository at this point
Copy the full SHA e182af5View commit details -
[merge fix] Remove scripts again (ray-project#34513)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6e75358 - Browse repository at this point
Copy the full SHA 6e75358View commit details -
[ci] Remove scripts duplicates and symlinks except for format.sh (ray…
…-project#34463) A year ago, ray-project#23866 moved our CI scripts into a more descriptive folder structure. Files in scripts/ were symlinks to the moved scripts. Even then, CI and documentation did not refer to any scripts in scripts/, with the exception of scripts/format.sh, which is referred to in pull request templates. Recently, ray-project#34340 overwrote some of the symlinks with their actual files. Since almost all of these scripts are only used in CI and not by users and developers, we should just get rid of the symlinks. The exception is format.sh which is actively used by developers. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 98afc2b - Browse repository at this point
Copy the full SHA 98afc2bView commit details -
[ci] Fix further linter errors (ray-project#34517)
Some shell scripts are still failing. This PR tries to identify and fix the remaining linter errors. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 093c223 - Browse repository at this point
Copy the full SHA 093c223View commit details -
[air] Use Ray storage URI as default storage path, if configured [no_…
…early_kickoff] (ray-project#34470) With this PR, we will use the configured Ray storage URI for syncing Ray AIR results if no other remote storage path is set. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4589793 - Browse repository at this point
Copy the full SHA 4589793View commit details -
[Doc] Ray Debugging Doc Part 1 (OOM) (ray-project#34309)
This doc improves the existing debugging failure documentation. It adds failure types how to do application level failure debugging out of memory debugging step-by-step memory profiling Rewrite the file descriptor issues (it has very old info that is not correct anymore) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ef285ed - Browse repository at this point
Copy the full SHA ef285edView commit details -
[ci] Restore pytest_checker script, but at correct location (ray-proj…
…ect#34523) ray-project#34463 removed the scripts under the `scripts/` directory because all of them should have been symlinks. However, `pytest_checker.py` was an actual script that was not symlinked from the `ci/` directory. This PR restores this script at the correct location in `ci/lint` and adjusts all references to it in the codebase. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2f3c2e1 - Browse repository at this point
Copy the full SHA 2f3c2e1View commit details -
[air] Change doc occurrences of ray.data.Dataset to ray.data.Datastre…
…am (ray-project#34520) We recently renamed `Dataset` to `Datastream` - this PR changes occurrences of Dataset in the Ray AIR examples to Datastream. This will also fix currently broken examples that still refer to `Dataset` when `Datastream` is imported instead Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47501e8 - Browse repository at this point
Copy the full SHA 47501e8View commit details -
[docs] new landing page (ray-project#33520)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9b2810b - Browse repository at this point
Copy the full SHA 9b2810bView commit details -
[Release test] [Cluster launcher] Add release test for aws `example-f…
…ull.yaml` (ray-project#34487) Adds a release test for example-full.yaml on AWS. Starts the cluster with ray up, runs a simple Ray driver script, and calls ray down. Also fixes a bug in this YAML file where we were using a string instead of an int for a VolumeSize. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ac4ead4 - Browse repository at this point
Copy the full SHA ac4ead4View commit details -
[Train] Fix lightning trainer devices setting (ray-project#34419)
Signed-off-by: woshiyyya <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ec21094 - Browse repository at this point
Copy the full SHA ec21094View commit details -
[RLlib] DreamerV3: Catalog enhancements 05 - GRU default model suppor…
…t. (ray-project#34284) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e6db435 - Browse repository at this point
Copy the full SHA e6db435View commit details -
Revert "[docs] new landing page" (ray-project#34533)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7aa85f3 - Browse repository at this point
Copy the full SHA 7aa85f3View commit details -
[docs] gentle core walkthrough (ray-project#34134)
* [docs] gentle core walkthrough Signed-off-by: Max Pumperla <[email protected]> * Update gentle_walkthrough.ipynb Signed-off-by: Max Pumperla <[email protected]> --------- Signed-off-by: Max Pumperla <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4f3e9f6 - Browse repository at this point
Copy the full SHA 4f3e9f6View commit details -
[serve] Remove old deployments upon redeployment of a named app (ray-…
…project#34451) When an app with non-empty name is redeployed, old deployments that are no longer part of the new graph are not cleaned up. This is because a new application state in application state manager is [created](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/application_state.py#L350-L355), so the [logic](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/application_state.py#L77-L83) that tracks which deployments to delete never actually works. This was not caught before because the old deployments are no longer tracked in the application state manager, and become "zombie deployments". This isn't a problem for the single app case (so there is no regression), because the old logic used to clean up old deployments hasn't yet been removed: https://github.com/ray-project/ray/blob/releases/2.3.0/python/ray/serve/_private/client.py#L297-L307. Reproduction script: ``` #script.py @serve.deployment def f(): return "f" @serve.deployment def g(): return "g" fn = f.bind() gn = g.bind() ``` Deploy it: ``` client = serve.start(detached=True) config = {"applications": [{"name": "app1", "import_path", "script.fn"}]} client.deploy_apps(ServeDeploySchema.parse_obj(config)) ``` Redeploy with a different graph: ``` client = serve.start(detached=True) config = {"applications": [{"name": "app1", "import_path", "script.gn"}]} client.deploy_apps(ServeDeploySchema.parse_obj(config)) ``` See that `app1_f` is not deleted: ![image](https://user-images.githubusercontent.com/15851518/232265539-c5af44e4-f37a-4305-9419-60744bba9b35.png) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 392f97e - Browse repository at this point
Copy the full SHA 392f97eView commit details -
Revert "[CI] Fix shellcheck lint (ray-project#34488)" (ray-project#34529
) The shellcheck fix broke the shellscript when $use_lstm is empty Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1c10e79 - Browse repository at this point
Copy the full SHA 1c10e79View commit details -
[RLlib] Learner group checkpointing (ray-project#34379)
Implement multinode learner group checkpointing and tests. --------- Signed-off-by: Avnish <[email protected]> Signed-off-by: avnishn <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c500638 - Browse repository at this point
Copy the full SHA c500638View commit details -
Revert "Revert "[docs] new landing page"" (ray-project#34534)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5f456ed - Browse repository at this point
Copy the full SHA 5f456edView commit details -
[Train] Allow local datasets in
HuggingFaceTrainer
(ray-project#34485)* Allow local datasets in HuggingFaceTrainer Signed-off-by: Antoni Baum <[email protected]> * Clarify Signed-off-by: Antoni Baum <[email protected]> --------- Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 00d7b9b - Browse repository at this point
Copy the full SHA 00d7b9bView commit details -
[air] Add tune frequent pausing release test. (ray-project#34501)
Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 23507c0 - Browse repository at this point
Copy the full SHA 23507c0View commit details -
[CI][Bisect] Fix bisect due to wrong order of commit list (ray-projec…
…t#34536) Why are these changes needed? Currently we are using git rev-list to get the commit lists. This command return the commits in the reverse order that we want, so reverse it before passing it to bisect. --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 76ccdcf - Browse repository at this point
Copy the full SHA 76ccdcfView commit details -
[UI] Disable null job id jumpable (ray-project#34378)
We would make ray submit job not clickable for job without a job id. Otherwise, we will navigate the users to a page /jobs/null where no job info is shown, making our customer confuse Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 73d3434 - Browse repository at this point
Copy the full SHA 73d3434View commit details -
[core][ci] Fix mac test_task_events_2 (ray-project#34538)
We don't have access to task name from psutil Process as well (just like windows) Closes ray-project#34530 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3ddb1a3 - Browse repository at this point
Copy the full SHA 3ddb1a3View commit details -
[CI] Add GCE variances for Data chaos tests (ray-project#34519)
This PR configures BuildKite to run Data release tests on GCE. I excluded the parquet_metadata_resolution and shuffle_data_loader release tests because more work is required to migrate those tests. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b1196df - Browse repository at this point
Copy the full SHA b1196dfView commit details -
[Train] Support FSDP Strategy for LightningTrainer (ray-project#34148)
Signed-off-by: woshiyyya <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3616ca2 - Browse repository at this point
Copy the full SHA 3616ca2View commit details -
[CI][Bisect][Easy/Urgent] Fix bisect (ray-project#34559)
Fix a couple of issues: - Correct git command to get the list of revs including both boundaries - Correct the boundary of the remaining list after each bisect Previous code has issues with the boundaries. Added a test case that failed in previous code but pass in this new code. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d0d0757 - Browse repository at this point
Copy the full SHA d0d0757View commit details -
[ci/release] GCE test variants for ml_user tests (ray-project#34465)
Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9c81eb8 - Browse repository at this point
Copy the full SHA 9c81eb8View commit details -
[Core][easy] disable test not suppose to work with ray client ray-pro…
…ject#34556 this env doesn't work with ray client. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3466113 - Browse repository at this point
Copy the full SHA 3466113View commit details -
[ci/release] GCE test variants for air_benchmark and air_examples (ra…
…y-project#34466) Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8f34645 - Browse repository at this point
Copy the full SHA 8f34645View commit details -
Log databricks proxy (ray-project#34088)
This PR adds standard logging of the Databricks proxy URL for the dashboard when a ray cluster starts. Currently the HTML link does not render until cell completion so it is difficult to access the dashboard while a ray workload is running. Signed-off-by: Nathan Azrak <[email protected]> Co-authored-by: Nathan Azrak <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 829ce6c - Browse repository at this point
Copy the full SHA 829ce6cView commit details -
[core] add core team to protobuf owner ray-project#34566
update the right ownership for relative folders Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cb4aa67 - Browse repository at this point
Copy the full SHA cb4aa67View commit details -
[Core][pubsub] handle failures when publish failed. (ray-project#33115)
Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1ac5350 - Browse repository at this point
Copy the full SHA 1ac5350View commit details -
[AIR] Add util to create a torch ddp process group for a list of work…
…ers. (ray-project#34202) Signed-off-by: Jun Gong <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6a614fa - Browse repository at this point
Copy the full SHA 6a614faView commit details -
[CI][Core] Set some GCE smoke tests to run on manual frequency (ray-p…
…roject#34516) I noticed some GCE smoke versions are run on nightly. Let's move them to run on manual instead, since we don't want to spend the cost on run them on an automatic cadence yet. --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e6e49fa - Browse repository at this point
Copy the full SHA e6e49faView commit details -
[CI] Fix some chaos test configurations (ray-project#34571)
Some GCE chaos test configurations are using aws configs. Change them to the equivalence GCE. Also use the more powerful n2 instead of e2 machine. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3ab751d - Browse repository at this point
Copy the full SHA 3ab751dView commit details -
[release] Make sure that test code matches the installed wheel. (ray-…
…project#30156) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3c01cb9 - Browse repository at this point
Copy the full SHA 3c01cb9View commit details -
[air-output] minor fix to print configuration on start. (ray-project#…
…34575) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0aa2ee8 - Browse repository at this point
Copy the full SHA 0aa2ee8View commit details -
[Core] Deflake test_advanced_9 (ray-project#34410)
Looks like gcs server proc doesn't go back to original num_fds; it goes lower. output from my machine: >> 222 # before starting worker procs (A pid=28851) HELLO ['WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD', 'WORLD'] >> 250 # with worker procs >> 217 >> 216 >> 213 >> 212 >> 207 >> 206 # after work procs die. >> 206 >> 208 # Not sure why it goes up again >> 208 # Remains at 208, times out This PR deflakes the test, but I don't know enough about gcs server to say if this is a good fix or not. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ca64a29 - Browse repository at this point
Copy the full SHA ca64a29View commit details -
[data] Standardize on Arrow types for schema() in strict mode
Signed-off-by: Eric Liang <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c8cf49 - Browse repository at this point
Copy the full SHA 5c8cf49View commit details -
[ray-data] Add alias parameters to the aggregate function, and add qu…
…antile fn (ray-project#34358) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3485e52 - Browse repository at this point
Copy the full SHA 3485e52View commit details -
Revert "[data] Add usage tag for which block formats are used (ray-pr…
…oject#34384)" (ray-project#34569) This reverts commit ffeedbf. [release test passing](https://buildkite.com/ray-project/release-tests-pr/builds/35579) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 41bc627 - Browse repository at this point
Copy the full SHA 41bc627View commit details -
Disallow format query in strict mode (ray-project#34564)
Signed-off-by: Eric Liang <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3a877cc - Browse repository at this point
Copy the full SHA 3a877ccView commit details -
[data] Log a warning if the batch size is misconfigured in a way that…
… would grossly reduce parallelism for actor pool. (ray-project#34594) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6abf379 - Browse repository at this point
Copy the full SHA 6abf379View commit details -
[Dashboard] Make loading screen not block out the entire page. (ray-p…
…roject#34515) Previously, if a dashboard page was loading, it would grey out the whole screen and buttons would not be press-able. Now, we don't block out the whole page. Also don't show loading bar if data is already loaded from in-memory cache. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b302c52 - Browse repository at this point
Copy the full SHA b302c52View commit details -
[data] [docs] Datastream docs rename [5/n] (ray-project#34512)
Part 5 of ray-project#34235 Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1df0ca1 - Browse repository at this point
Copy the full SHA 1df0ca1View commit details -
clarify M1 installation instructions (ray-project#34505)
A few folks have been confused by the order of the installation instructions for M1, so adding some clarifying language. While I was at it, I made minor improvements to some language in nearby paragraphs. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6eb23c7 - Browse repository at this point
Copy the full SHA 6eb23c7View commit details -
Create LLM section and add examples (ray-project#34614)
Surface LLM/Generative AI use cases. Signed-off-by: angelinalg <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9636e78 - Browse repository at this point
Copy the full SHA 9636e78View commit details -
Add driver logs to Jobs page for submission jobs (ray-project#34514)
Add driver logs to Jobs page for submission jobs Adds a refresh button to the log viewer to reload the logs. Refactors the log viewer from the logs page into its own component Updates the look and feel of the jobs page to match the new IA style. Adds User-provided metadata to the job detail page. (fixes [Core|Dashboard] Support custom tags for jobs. ray-project#34187 ) Updates the table icon Change "Tasks" to "Tasks/actor overview" Adds Node Count Card next to ray status cards Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5df609b - Browse repository at this point
Copy the full SHA 5df609bView commit details -
[air/Doc] Fix unused config building function in lightning MNIST exam…
…ple. The build_lightning_config_from_existing_code() is not called in the example, and there is a duplicated config building logic below. This PR use this function and remove the other one. Signed-off-by: woshiyyya <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 42b9a92 - Browse repository at this point
Copy the full SHA 42b9a92View commit details -
[core][state][nightly] Fix stress_test_state_api_scale (ray-project#3…
…4579) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 304a0ce - Browse repository at this point
Copy the full SHA 304a0ceView commit details -
[ci/release] Increase concurrency limit for gpu gce (ray-project#34578)
We now have 100 T4 machines, so increase the limit. At peak, the this limit means that we will use: 84 + 44 + 2*8 + 32 = 96 machines Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1c2e6a0 - Browse repository at this point
Copy the full SHA 1c2e6a0View commit details -
[serve][nit] Fix formatting & verbiage for
serve shutdown
(ray-proj……ect#34585) Fixes unnecessary spaces & cleans up wording. Before: ``` (ray) eoakes@Edwards-MacBook-Pro-2 serve % serve shutdown This will shutdown the Serve application at address "http://localhost:52365" and delete all deployments there. Do you want to continue? [y/N]: y 2023-04-19 12:46:12,078 SUCC scripts.py:584 -- Sent delete request successfully! ``` After: ``` (ray) eoakes@Edwards-MacBook-Pro-2 serve % serve shutdown This will shut down Serve on the cluster at address "http://localhost:52365" and delete all applications there. Do you want to continue? [y/N]: y 2023-04-19 12:45:52,050 SUCC scripts.py:583 -- Sent shutdown request; applications will be deleted asynchronously. ``` Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ed34037 - Browse repository at this point
Copy the full SHA ed34037View commit details -
[ci/release] GCE variants for remaining Tune tests (ray-project#34572)
Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c8ab61a - Browse repository at this point
Copy the full SHA c8ab61aView commit details -
[Doc] Fix AIR benchmark configuration link failure. (ray-project#34597)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2ff90f1 - Browse repository at this point
Copy the full SHA 2ff90f1View commit details -
[air-output] print out worker ip for distributed train workers. (ray-…
…project#33807) Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 15d11be - Browse repository at this point
Copy the full SHA 15d11beView commit details -
Fix download_wheels.sh wheel urls (ray-project#34616)
Some mac wheel urls are invalid Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9884c27 - Browse repository at this point
Copy the full SHA 9884c27View commit details -
[Data] Fix
iter_tensor_batches_benchmark_multi_node
GCE (ray-projec……t#34598) The `iter_tensor_batches_benchmark_multi_node` GCE variant was failing because it used the wrong compute config. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9379bdf - Browse repository at this point
Copy the full SHA 9379bdfView commit details -
[Doc][AIR] Improve visibility of Trainer restore and stateful callbac…
…k restoration (ray-project#34350) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 374cab9 - Browse repository at this point
Copy the full SHA 374cab9View commit details -
[Serve] [Docs] Change incorrect Serve app name in Stable Diffusion tu…
…torial (ray-project#34426) The ray serve command was not matching the correct object. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3e34002 - Browse repository at this point
Copy the full SHA 3e34002View commit details -
[data] [strict-mode] Require compute spec to be explicitly spelled out (
ray-project#34610) Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aac345b - Browse repository at this point
Copy the full SHA aac345bView commit details -
[docs] intro and graphic for LLM (ray-project#34615)
Follow up to ray-project#34614 Why are these changes needed? To match the other use cases, we need a more substantial intro paragraph and graphic. --------- Signed-off-by: angelinalg <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a548291 - Browse repository at this point
Copy the full SHA a548291View commit details -
Fix typo in node.py (ray-project#34630)
Fix typo in docstring. Signed-off-by: JYX <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 610a8d8 - Browse repository at this point
Copy the full SHA 610a8d8View commit details -
[CI][Green-Ray][3] Extract error logs from ray logs (ray-project#34193)
Currently there are a lot of test run instances where we fail to acquire logs (especially for infra-failure issues). This PR will fall back to query ray logs for error patterns if we fail to query the application logs. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d69eb02 - Browse repository at this point
Copy the full SHA d69eb02View commit details -
[Data] [strict-mode] Remove internal TableRow abstractions and instea…
…d use Dict[str, Any] as the row format Signed-off-by: Eric Liang <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c74554a - Browse repository at this point
Copy the full SHA c74554aView commit details -
[train] Add AccelerateTrainer as valid AIR_TRAINER (ray-project#34639)
Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fa77f89 - Browse repository at this point
Copy the full SHA fa77f89View commit details -
[data] Configure progress bars via DataContext
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f68b4c1 - Browse repository at this point
Copy the full SHA f68b4c1View commit details -
[CI] disable flaky test test_run_on_all_workers (ray-project#34647)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 62a307f - Browse repository at this point
Copy the full SHA 62a307fView commit details -
Revert "[core]Turn on light weight resource broadcasting. (ray-projec…
…t#32625)" (ray-project#34636) This reverts commit 1bfbc46. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8c77f56 - Browse repository at this point
Copy the full SHA 8c77f56View commit details -
[docs] replace tune.report with session.report (ray-project#34435)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b702efe - Browse repository at this point
Copy the full SHA b702efeView commit details -
[Ci] fix pip version to deflake minimal install 3.10
see if the test failure is caused by pip version upgrade Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3652630 - Browse repository at this point
Copy the full SHA 3652630View commit details -
[CI] fix virtualenv version to deflake linux://python/ray/tests:test_…
…runtime_env_complicated (ray-project#34650) Looks the virtualenv has been upgraded between the success and failed test. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 69a5d29 - Browse repository at this point
Copy the full SHA 69a5d29View commit details -
[Syncer] Remove spammy logs. (ray-project#34654)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a3bd535 - Browse repository at this point
Copy the full SHA a3bd535View commit details -
[ci/release] GCE variants for Alpa, Golden notebooks, Lightning, Horo…
…vod, Workspace templates (ray-project#34565) Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3bb1993 - Browse repository at this point
Copy the full SHA 3bb1993View commit details -
[docs][tune] Fix Tune tutorial (ray-project#34660)
One line fix for bug introduced in ray-project#34435 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4f94951 - Browse repository at this point
Copy the full SHA 4f94951View commit details -
[Autoscaler][gcp] parallel terminate nodes (ray-project#34455)
Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8e796fd - Browse repository at this point
Copy the full SHA 8e796fdView commit details -
[Tune] Enable
tune.ExperimentAnalysis
to pull experiment checkpoint…… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3469051 - Browse repository at this point
Copy the full SHA 3469051View commit details -
[docs] [serve] removed line numbers and fixed file name summary_model…
….py (ray-project#34617) Copy and paste button was including line numbers in 3 code examples, which is a bad user experience. Fixed error with filename. The command line instructions said `python model.py` but it should be `python summary_model.py`. This addresses two issues in GH issue 34481, but not all of them. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6fc9b4b - Browse repository at this point
Copy the full SHA 6fc9b4bView commit details -
[CI][Green-Ray][4] Compute and store unique crash pattern from logs (r…
…ay-project#34200) This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix. Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a7298ec - Browse repository at this point
Copy the full SHA a7298ecView commit details -
[serve] Add support for application builders & arguments (ray-project…
…#34584) First cut at an implementation for ray-project#34542. There should be no changes in behavior for existing applications. Will update documentation & examples in a separate PR, would like to get it merged to get feedback from others on the API. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e9aa541 - Browse repository at this point
Copy the full SHA e9aa541View commit details -
[docs] add click events for code blocks (ray-project#34623)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f4a5aac - Browse repository at this point
Copy the full SHA f4a5aacView commit details -
[Datasets] Support non-shuffle repartitioning in
Repartition
`Logic……alOperator` (ray-project#34547) This is a followup for ray-project#32102, to support non-shuffle repartition in logical operator, as _internal/fast_repartition.py. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cca57a3 - Browse repository at this point
Copy the full SHA cca57a3View commit details -
[docs] Fix broken links (ray-project#34665)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 02299ba - Browse repository at this point
Copy the full SHA 02299baView commit details -
[docs] wrap autogenerated API nav items (ray-project#34047)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e9ec461 - Browse repository at this point
Copy the full SHA e9ec461View commit details -
[docs] sphinx design 1/n (ray-project#34625)
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fa30fff - Browse repository at this point
Copy the full SHA fa30fffView commit details -
[CI][Bisect][4] Add pre-sanity check to avoid infra or external chang…
…e root causes (ray-project#34553) Why are these changes needed? Many time tests can fail due to a non-code-change issue (external or infra issues). Before running a bisect, run a pre-sanity check to make sure that the provided passing and failing revision is valid. Otherwise, terminate bisect early and let the users know that the test is flaky. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a6deb57 - Browse repository at this point
Copy the full SHA a6deb57View commit details -
[CI][HotFix] Revert 34499 ray-project#34688
Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5ba8b4c - Browse repository at this point
Copy the full SHA 5ba8b4cView commit details -
[autoscaler v2] Interface between autoscaler and gcs (ray-project#34680)
Why are these changes needed? This PR introduce the interface between GCS and Autoscaler. Specifically it introduces 2 APIs GetClusterResourceState: Autoscaler will query this interface to get cluster resource usage, which includes nodes (state and resource ulitization), as well as pending requests, which include ResourceRequest, GangResourceRequest, as well as ClusterResourceConstraint. For NodeState, it includes NodeStatus, which can transit from ALIVE -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAINING -> DRAINED -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAIN_FAILED. it also includes instance_id where the autoscaler is aware of, this allows autoscaler to do reconsiliation if available. For ResourceRequest, it comes with a PlacementConstraint which only support AntiAffinityConstraint today, which the semantics the resource request can't be allocated on a node with the same label/value specified in the AntiAffinityConstraint There is also GangResourceRequest, which has gang scheduling semantics where the requests in the gang should be all fulfilled atomically. ReportAutoscalingState: Autoscaler will also report its own state back to cluster using this API, where it includes all instances (including both pending launch), as well as infeasible requests. Instance state could transition from QUEUED -> REQUESTED -> BOOTSTRAPPING -> ALIVE -> TERMINATING -> DEAD. two special states are TO_BE_PREEMPTED and TO_BE_DRAINED, where one is force preemption, another is collaborating draining (can be reversed). It also reports back requests that infeasible, associated with a specific request version. Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4636032 - Browse repository at this point
Copy the full SHA 4636032View commit details -
Update gymnasium version to 0.28.1
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for be632df - Browse repository at this point
Copy the full SHA be632dfView commit details -
Signed-off-by: elliottower <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5888c52 - Browse repository at this point
Copy the full SHA 5888c52View commit details