-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][GCI/3] Add variations attribute to create tests in multiple cluster environment #33718
Commits on Mar 18, 2023
-
Fix 'Observed wheel commit () is not expected' issue (#32156) that ha…
…s been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 57d69f8 - Browse repository at this point
Copy the full SHA 57d69f8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3184500 - Browse repository at this point
Copy the full SHA 3184500View commit details
Commits on Mar 20, 2023
-
Configuration menu - View commit details
-
Copy full SHA for d085526 - Browse repository at this point
Copy the full SHA d085526View commit details -
Improve wheel commit validation error message
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c17ef9 - Browse repository at this point
Copy the full SHA 5c17ef9View commit details
Commits on Mar 23, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 98212de - Browse repository at this point
Copy the full SHA 98212deView commit details -
Setup dependencies and crendential for GCE in buildkite
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c4638a6 - Browse repository at this point
Copy the full SHA c4638a6View commit details -
Add google-cloud-storage package to requirements
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eb1b6a2 - Browse repository at this point
Copy the full SHA eb1b6a2View commit details
Commits on Mar 24, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 048545e - Browse repository at this point
Copy the full SHA 048545eView commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d02abc5 - Browse repository at this point
Copy the full SHA d02abc5View commit details
Commits on Mar 25, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 3e78d18 - Browse repository at this point
Copy the full SHA 3e78d18View commit details
Commits on Mar 26, 2023
-
Support for gs:// in anyscale job runner
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1fb1997 - Browse repository at this point
Copy the full SHA 1fb1997View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7db91f5 - Browse repository at this point
Copy the full SHA 7db91f5View commit details -
Support test definition with multiple flavors
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a37234f - Browse repository at this point
Copy the full SHA a37234fView commit details -
Use not in to check key in dict
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fbfcc92 - Browse repository at this point
Copy the full SHA fbfcc92View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9cd8412 - Browse repository at this point
Copy the full SHA 9cd8412View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6a8e36b - Browse repository at this point
Copy the full SHA 6a8e36bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 22792cc - Browse repository at this point
Copy the full SHA 22792ccView commit details -
Configuration menu - View commit details
-
Copy full SHA for accd686 - Browse repository at this point
Copy the full SHA accd686View commit details
Commits on Mar 27, 2023
-
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 235e877 - Browse repository at this point
Copy the full SHA 235e877View commit details -
Configuration menu - View commit details
-
Copy full SHA for f57b95b - Browse repository at this point
Copy the full SHA f57b95bView commit details
Commits on Mar 28, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 3c98173 - Browse repository at this point
Copy the full SHA 3c98173View commit details -
Only initialize gs client on gs host
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dc325a3 - Browse repository at this point
Copy the full SHA dc325a3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 65d0577 - Browse repository at this point
Copy the full SHA 65d0577View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4121ec4 - Browse repository at this point
Copy the full SHA 4121ec4View commit details -
[RLlib] fix preprocessor test (#33719)
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 170ec1c - Browse repository at this point
Copy the full SHA 170ec1cView commit details -
[RLlib] APPO TF with RLModule and Learner API (#33310)
Signed-off-by: Avnish <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a0c8f1e - Browse repository at this point
Copy the full SHA a0c8f1eView commit details -
[Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to …
…make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d9e00cd - Browse repository at this point
Copy the full SHA d9e00cdView commit details -
[serve] Fix serve HA test (#33699)
#33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.
Configuration menu - View commit details
-
Copy full SHA for c019896 - Browse repository at this point
Copy the full SHA c019896View commit details -
Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#…
…33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
Configuration menu - View commit details
-
Copy full SHA for 99eaefa - Browse repository at this point
Copy the full SHA 99eaefaView commit details -
[tune] add data to CI test dependencies (#33729)
1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests...
Configuration menu - View commit details
-
Copy full SHA for 8450787 - Browse repository at this point
Copy the full SHA 8450787View commit details -
Configuration menu - View commit details
-
Copy full SHA for b314f31 - Browse repository at this point
Copy the full SHA b314f31View commit details -
[RLlib] Fixed a typo in multi-agent definition using RLModules in tes…
…t_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5ae9abc - Browse repository at this point
Copy the full SHA 5ae9abcView commit details -
[RLlib][RLModule] Disabled rl_module in one of the subtests in test_c…
…uriosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 090b579 - Browse repository at this point
Copy the full SHA 090b579View commit details -
[RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#…
…33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e1e36cb - Browse repository at this point
Copy the full SHA e1e36cbView commit details -
[RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu test…
…s becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d7e87cf - Browse repository at this point
Copy the full SHA d7e87cfView commit details -
[Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#3…
…2747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.
Configuration menu - View commit details
-
Copy full SHA for eef5240 - Browse repository at this point
Copy the full SHA eef5240View commit details -
[runtime env] Close schema after loading and continue on error (#33535)
This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6b45157 - Browse repository at this point
Copy the full SHA 6b45157View commit details -
[Jobs] Fix race condition on submitting multiple jobs with the same id (
#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change.
Configuration menu - View commit details
-
Copy full SHA for e5b6f78 - Browse repository at this point
Copy the full SHA e5b6f78View commit details -
Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)
Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b5f1c3c - Browse repository at this point
Copy the full SHA b5f1c3cView commit details -
Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#…
…33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 971a9c8 - Browse repository at this point
Copy the full SHA 971a9c8View commit details -
Deprecate RuntimeContext.get (#33734)
RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3e8e902 - Browse repository at this point
Copy the full SHA 3e8e902View commit details -
[Serve] Fix the serve.batch api doc (#33588)
Fix the example formatting in the serve batch API doc
Configuration menu - View commit details
-
Copy full SHA for 90c60da - Browse repository at this point
Copy the full SHA 90c60daView commit details -
[infra] increase Build timeout (#33756)
Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2
Configuration menu - View commit details
-
Copy full SHA for b8596e8 - Browse repository at this point
Copy the full SHA b8596e8View commit details -
[RLlib][RLModule] Use forward_exploration() inside the unit-test for …
…test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c1aac60 - Browse repository at this point
Copy the full SHA c1aac60View commit details -
[data] [streaming] Dataset.cache() doesn't work properly for streamin…
…g executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
Configuration menu - View commit details
-
Copy full SHA for ed9d773 - Browse repository at this point
Copy the full SHA ed9d773View commit details -
[Test] Fix the failing workflow test_dataset after streaming executor…
… is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )
Configuration menu - View commit details
-
Copy full SHA for 561fa53 - Browse repository at this point
Copy the full SHA 561fa53View commit details -
[Test] Fix out of disk error (#33732)
Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.
Configuration menu - View commit details
-
Copy full SHA for d62dfe8 - Browse repository at this point
Copy the full SHA d62dfe8View commit details -
[Data] Repurpose streaming CI to bulk CI(#33478)
Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).
Configuration menu - View commit details
-
Copy full SHA for 1041e81 - Browse repository at this point
Copy the full SHA 1041e81View commit details -
[Serve] Enable serve metrics lib working in ray actor (#33717)
Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ```
Configuration menu - View commit details
-
Copy full SHA for f6e0028 - Browse repository at this point
Copy the full SHA f6e0028View commit details -
[RLlib] Fix: Recovered eval worker should use eval-config's policy_ma…
…pping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 074396a - Browse repository at this point
Copy the full SHA 074396aView commit details -
[Data] Don't automatically move batches to device if
collate_fn
is ……specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3a98259 - Browse repository at this point
Copy the full SHA 3a98259View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d487977 - Browse repository at this point
Copy the full SHA d487977View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d894a97 - Browse repository at this point
Copy the full SHA d894a97View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8053b3b - Browse repository at this point
Copy the full SHA 8053b3bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 34ff5d1 - Browse repository at this point
Copy the full SHA 34ff5d1View commit details -
Change ray to 2.3.1 to work around the #ir-glorious-shape
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for abad535 - Browse repository at this point
Copy the full SHA abad535View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 15ec7f7 - Browse repository at this point
Copy the full SHA 15ec7f7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8173e4e - Browse repository at this point
Copy the full SHA 8173e4eView commit details -
Merge branch 'master' into gce03
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1355867 - Browse repository at this point
Copy the full SHA 1355867View commit details -
[CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#…
…33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825
Configuration menu - View commit details
-
Copy full SHA for a0001d5 - Browse repository at this point
Copy the full SHA a0001d5View commit details -
Configuration menu - View commit details
-
Copy full SHA for f21b22f - Browse repository at this point
Copy the full SHA f21b22fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4d13b7b - Browse repository at this point
Copy the full SHA 4d13b7bView commit details -
Setup dependencies and crendential for GCE in buildkite
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cc1268d - Browse repository at this point
Copy the full SHA cc1268dView commit details -
Add google-cloud-storage package to requirements
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aa3453e - Browse repository at this point
Copy the full SHA aa3453eView commit details -
Support for gs:// in anyscale job runner
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 15965b5 - Browse repository at this point
Copy the full SHA 15965b5View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f0ea4c4 - Browse repository at this point
Copy the full SHA f0ea4c4View commit details -
[RLlib] APPO TF with RLModule and Learner API (#33310)
Signed-off-by: Avnish <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 38ce312 - Browse repository at this point
Copy the full SHA 38ce312View commit details -
[data] [streaming] Dataset.cache() doesn't work properly for streamin…
…g executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
Configuration menu - View commit details
-
Copy full SHA for 81472b7 - Browse repository at this point
Copy the full SHA 81472b7View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 66c2de8 - Browse repository at this point
Copy the full SHA 66c2de8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 41f7636 - Browse repository at this point
Copy the full SHA 41f7636View commit details -
Configuration menu - View commit details
-
Copy full SHA for e87a6ce - Browse repository at this point
Copy the full SHA e87a6ceView commit details -
Configuration menu - View commit details
-
Copy full SHA for 56447ae - Browse repository at this point
Copy the full SHA 56447aeView commit details -
Configuration menu - View commit details
-
Copy full SHA for c68097d - Browse repository at this point
Copy the full SHA c68097dView commit details -
Configuration menu - View commit details
-
Copy full SHA for b352ffc - Browse repository at this point
Copy the full SHA b352ffcView commit details
Commits on Mar 29, 2023
-
Add unit tests for test definition parser
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cdebd35 - Browse repository at this point
Copy the full SHA cdebd35View commit details -
Configuration menu - View commit details
-
Copy full SHA for 30ae2e1 - Browse repository at this point
Copy the full SHA 30ae2e1View commit details -
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 32f69f8 - Browse repository at this point
Copy the full SHA 32f69f8View commit details -
Check that parse_test_definition throws exception on empty variations
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c34cf96 - Browse repository at this point
Copy the full SHA c34cf96View commit details -
Remove the constant test definition in test.
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b49a078 - Browse repository at this point
Copy the full SHA b49a078View commit details
Commits on Mar 30, 2023
-
[CI] Logic to create test variations in release test configs (#33920)
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement. * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for db0be11 - Browse repository at this point
Copy the full SHA db0be11View commit details -
The cluster environment name does not allow the character '.', so fix…
… that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c5c3c66 - Browse repository at this point
Copy the full SHA c5c3c66View commit details -
Merge branch 'can-gce-01-test-variations' into gce03
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5fd05d9 - Browse repository at this point
Copy the full SHA 5fd05d9View commit details -
Remove import copy from test file
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 58e72c3 - Browse repository at this point
Copy the full SHA 58e72c3View commit details
Commits on Mar 31, 2023
-
Address @krfricke's comments. I keep the __suffix__ as it is based on…
… our conversation so far, but if anyone has a strong opinion feel free to let me know @aslonnie Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b123bd2 - Browse repository at this point
Copy the full SHA b123bd2View commit details -
Fix lint. Thanks @krfricke for catching!
Signed-off-by: Cuong Nguyen <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 690dd26 - Browse repository at this point
Copy the full SHA 690dd26View commit details