[RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. #33648

sven1977 · 2023-03-23T22:12:13Z

Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones.

If a (multi-agent) eval worker fails and has a policy_mapping_fn (or is_policy_to_train fn) configured different from the main workers it would be reinstated (after a possible failure in fault-tolerant mode) with the train workers' policy_mapping_fn and policy_to_train functions. The root cause of this behavior was that we use the local worker's state (which includes policy_mapping_fn AND policy_to_train fn) to sync to the newly re-created eval worker.
Note that the local worker is always from the "main" worker set used for collecting training data. The eval worker set does NOT have a local worker.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…eval_worker_is_recovered_with_main_config_not_eval_config

Signed-off-by: sven1977 <[email protected]>

sven1977 · 2023-03-23T23:07:46Z

rllib/utils/actor_manager.py

@@ -341,7 +341,7 @@ def num_actors(self) -> int:
    @DeveloperAPI
    def num_healthy_actors(self) -> int:
        """Return the number of healthy remote actors."""
-        return sum([s.is_healthy for s in self.__remote_actor_states.values()])
+        return sum(s.is_healthy for s in self.__remote_actor_states.values())


My LINTer complained :)

sven1977 · 2023-03-23T23:07:52Z

rllib/utils/actor_manager.py

@@ -785,10 +785,12 @@ def probe_unhealthy_actors(
    ) -> List[int]:
        """Ping all unhealthy actors to try bringing them back.

-        Returns:
-            A list of actor ids that are restored.
+        Args:


sven1977 · 2023-03-23T23:08:05Z

rllib/algorithms/tests/test_worker_failures.py

-            self.assertEqual(a.workers.num_healthy_remote_workers(), 1)
-            self.assertEqual(a.evaluation_workers.num_healthy_remote_workers(), 1)
-
+            for _ in range(2):


why do we have to this for two iterations now?

We did before, too. Just that the code was completely written out (copied), not as a for loop.

sven1977 · 2023-03-23T23:08:18Z

rllib/algorithms/algorithm.py

-    def validate_config(self, config) -> None:
-        # TODO: Deprecate. All logic has been moved into the AlgorithmConfig classes.
+    @Deprecated(new="AlgorithmConfig.validate()", error=False)
+    def validate_config(self, config):


sven1977 · 2023-03-23T23:08:35Z

rllib/algorithms/algorithm.py

-            local_worker=local_worker,
-            logdir=self.logdir,
-        )
+    @Deprecated(new="construct WorkerSet(...) instance directly", error=True)


Not used anywhere in the codebase anymore. Let's take it out asap.

sven1977 · 2023-03-23T23:09:39Z

rllib/algorithms/algorithm.py

@@ -1316,12 +1316,27 @@ def restore_workers(self, workers: WorkerSet):
        restored = workers.probe_unhealthy_workers()

        if restored:
-            from_worker = workers.local_worker() or self.workers.local_worker()
-            state = ray.put(from_worker.get_state())
+            # Figure out whether we are restoring a worker from the eval worker set.


Main logic in this PR:

Figure out whether workers is the eval worker set.

If yes, we need to adjust the state of the local worker to contain the correct (eval) policy_mapping_fn and policy_to_train fn.

kouroshHakha

1/2 questions. O.w looks great.

kouroshHakha · 2023-03-24T00:27:26Z

rllib/algorithms/algorithm.py

+            # For the evaluation set, we need to adjust the `policy_mapping_fn` and
+            # `is_policy_to_train` fn from the original evaluation config.
+            if is_eval_worker_set:
+                state["policy_mapping_fn"] = self.evaluation_config.policy_mapping_fn


Are these the only two things we have to restore? I feel like we have to do it for a lot more entries? Why not just merge the two configs?

Good question. We are not transferring/updating items from the config into the state.
The config - in the case of RolloutWorkers - is NOT part of the state at all (we might want to change that, but that's a different discussion). Instead, the state of a Worker are:

policy IDs -> list of all IDs Mapping from policy ID -> Policy's state policy_mapping_fn is_policy_to_train fn all filters (filters are stateful)

Note that the RolloutWorker - at the time we are doing this re-synching - has already been constructed, so even if we changed its config here, it would have no effect on the already up worker.

Enhanced it to del those two keys from state. This way, the set_state method will NOT touch these two function and they should have their original [Algorithm].evaluation_config-based values.

kouroshHakha · 2023-03-24T00:31:13Z

rllib/algorithms/tests/test_worker_failures.py

-            self.assertEqual(a.workers.num_healthy_remote_workers(), 1)
-            self.assertEqual(a.evaluation_workers.num_healthy_remote_workers(), 1)
-
+            for _ in range(2):


why do we have to this for two iterations now?

Signed-off-by: sven1977 <[email protected]>

gjoliver · 2023-03-24T18:27:18Z

rllib/algorithms/algorithm.py

+            # `is_policy_to_train` fn from the original evaluation config.
+            if is_eval_worker_set:
+                del state["policy_mapping_fn"]
+                del state["is_policy_to_train"]


what if we simply always del these 2 fields? you then don't have to guess whether we are dealing with an eval worker or not.
given how worker actors are restored, they should have proper mapping_fn and policy_to_train upon recovery?

That would break the case where you have already changed the policy_mapping_fn a couple of times (it's different now than the original one from your config; which is why these are part of the state to begin with) and are now trying to re-start a crashed remote worker and make sure it uses the correct mapping fn.

Signed-off-by: sven1977 <[email protected]>

…eval_worker_is_recovered_with_main_config_not_eval_config

Signed-off-by: sven1977 <[email protected]>

…eval_worker_is_recovered_with_main_config_not_eval_config

Signed-off-by: sven1977 <[email protected]>

kouroshHakha

Looks good. Thanks @sven1977 Let's merge conditioned on tests passing.

gjoliver

thanks for the fix.

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement. * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement. * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement. * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

…pping_fn and policy_to_train fn, not the main train workers' ones. (ray-project#33648) Signed-off-by: sven1977 <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <[email protected]> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <[email protected]> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Add new lines to some files Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <[email protected]> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 2 Signed-off-by: Cuong Nguyen <[email protected]> * Debugging 03 Signed-off-by: Cuong Nguyen <[email protected]> * Remove temoprary logs Signed-off-by: Cuong Nguyen <[email protected]> * -s * Update flavors Signed-off-by: Cuong Nguyen <[email protected]> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <[email protected]> * Lint Signed-off-by: Cuong Nguyen <[email protected]> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <[email protected]> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <[email protected]> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <[email protected]> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <[email protected]> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <[email protected]> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <[email protected]> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <[email protected]> * Revert to normal ray image Signed-off-by: Cuong Nguyen <[email protected]> * Fix delete_fn Signed-off-by: Cuong Nguyen <[email protected]> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <[email protected]> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <[email protected]> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <[email protected]> * Correct adding gce tests Signed-off-by: Cuong Nguyen <[email protected]> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <[email protected]> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Run linter Signed-off-by: Cuong Nguyen <[email protected]> * Run linters Signed-off-by: Cuong Nguyen <[email protected]> * -s * Fix some tests Signed-off-by: Cuong Nguyen <[email protected]> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <[email protected]> * Fix lints Signed-off-by: Cuong Nguyen <[email protected]> * @aslonnie's comments Signed-off-by: Cuong Nguyen <[email protected]> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <[email protected]> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <[email protected]> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <[email protected]> --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Avnish <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: sven1977 <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: augray <[email protected]> Co-authored-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Avnish Narayan <[email protected]> Co-authored-by: jiafu zhang <[email protected]> Co-authored-by: Qing Wang <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: James Clark <[email protected]> Co-authored-by: Archit Kulkarni <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Sihan Wang <[email protected]> Co-authored-by: clarng <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]>

…pping_fn and policy_to_train fn, not the main train workers' ones. (ray-project#33648) Signed-off-by: sven1977 <[email protected]> Signed-off-by: elliottower <[email protected]>

…pping_fn and policy_to_train fn, not the main train workers' ones. (ray-project#33648) Signed-off-by: sven1977 <[email protected]> Signed-off-by: Jack He <[email protected]>

sven1977 added 3 commits March 23, 2023 23:05

wip.

92ed386

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

9c29d39

…eval_worker_is_recovered_with_main_config_not_eval_config

LINT.

4d04ec0

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners March 23, 2023 22:12

sven1977 assigned kouroshHakha Mar 23, 2023

sven1977 commented Mar 23, 2023

View reviewed changes

kouroshHakha reviewed Mar 24, 2023

View reviewed changes

enhancement.

146223f

Signed-off-by: sven1977 <[email protected]>

gjoliver reviewed Mar 24, 2023

View reviewed changes

sven1977 added 8 commits March 25, 2023 17:01

wip

1e50240

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

b08c669

…eval_worker_is_recovered_with_main_config_not_eval_config

LINT

6b3466b

Signed-off-by: sven1977 <[email protected]>

test case fix

42940a6

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

21f90ba

…eval_worker_is_recovered_with_main_config_not_eval_config

wip

36297bb

Signed-off-by: sven1977 <[email protected]>

fix

644faa6

Signed-off-by: sven1977 <[email protected]>

LINT

d5f86e9

Signed-off-by: sven1977 <[email protected]>

kouroshHakha approved these changes Mar 27, 2023

View reviewed changes

gjoliver approved these changes Mar 27, 2023

View reviewed changes

gjoliver merged commit d127273 into ray-project:master Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. #33648

[RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. #33648

sven1977 commented Mar 23, 2023 •

edited

Loading

sven1977 Mar 23, 2023

sven1977 Mar 23, 2023

sven1977 Mar 23, 2023

kouroshHakha Mar 24, 2023

sven1977 Mar 24, 2023

sven1977 Mar 23, 2023

sven1977 Mar 23, 2023

sven1977 Mar 23, 2023

kouroshHakha left a comment

kouroshHakha Mar 24, 2023

sven1977 Mar 24, 2023

sven1977 Mar 24, 2023

kouroshHakha Mar 24, 2023

gjoliver Mar 24, 2023

sven1977 Mar 24, 2023

kouroshHakha left a comment

gjoliver left a comment

[RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. #33648

[RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. #33648

Conversation

sven1977 commented Mar 23, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

sven1977 commented Mar 23, 2023 •

edited

Loading