Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][GCI/3] Add variations attribute to create tests in multiple cluster environment #33718

Merged
merged 85 commits into from
Mar 31, 2023
Merged

Commits on Mar 18, 2023

  1. Fix 'Observed wheel commit () is not expected' issue (#32156) that ha…

    …s been creeping through many of ci/cd builds in our pipeline.
    
    The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.
    
    You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 18, 2023
    Configuration menu
    Copy the full SHA
    57d69f8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3184500 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2023

  1. Configuration menu
    Copy the full SHA
    d085526 View commit details
    Browse the repository at this point in the history
  2. Improve wheel commit validation error message

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 20, 2023
    Configuration menu
    Copy the full SHA
    5c17ef9 View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2023

  1. Configuration menu
    Copy the full SHA
    98212de View commit details
    Browse the repository at this point in the history
  2. Setup dependencies and crendential for GCE in buildkite

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 23, 2023
    Configuration menu
    Copy the full SHA
    c4638a6 View commit details
    Browse the repository at this point in the history
  3. Add google-cloud-storage package to requirements

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 23, 2023
    Configuration menu
    Copy the full SHA
    eb1b6a2 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2023

  1. Configuration menu
    Copy the full SHA
    048545e View commit details
    Browse the repository at this point in the history
  2. Add new lines to some files

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 24, 2023
    Configuration menu
    Copy the full SHA
    d02abc5 View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2023

  1. Configuration menu
    Copy the full SHA
    3e78d18 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2023

  1. Support for gs:// in anyscale job runner

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    1fb1997 View commit details
    Browse the repository at this point in the history
  2. Correct adding gce tests

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    7db91f5 View commit details
    Browse the repository at this point in the history
  3. Support test definition with multiple flavors

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    a37234f View commit details
    Browse the repository at this point in the history
  4. Use not in to check key in dict

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    fbfcc92 View commit details
    Browse the repository at this point in the history
  5. Debugging

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    9cd8412 View commit details
    Browse the repository at this point in the history
  6. Debugging

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    6a8e36b View commit details
    Browse the repository at this point in the history
  7. Debugging 2

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    22792cc View commit details
    Browse the repository at this point in the history
  8. Debugging 03

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 26, 2023
    Configuration menu
    Copy the full SHA
    accd686 View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2023

  1. Remove temoprary logs

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 27, 2023
    Configuration menu
    Copy the full SHA
    235e877 View commit details
    Browse the repository at this point in the history
  2. -s

    can-anyscale committed Mar 27, 2023
    Configuration menu
    Copy the full SHA
    f57b95b View commit details
    Browse the repository at this point in the history

Commits on Mar 28, 2023

  1. Update flavors

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    3c98173 View commit details
    Browse the repository at this point in the history
  2. Only initialize gs client on gs host

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    dc325a3 View commit details
    Browse the repository at this point in the history
  3. Lint

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    65d0577 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    4121ec4 View commit details
    Browse the repository at this point in the history
  5. [RLlib] fix preprocessor test (#33719)

    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    ArturNiederfahrenhorst authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    170ec1c View commit details
    Browse the repository at this point in the history
  6. [RLlib] APPO TF with RLModule and Learner API (#33310)

    Signed-off-by: Avnish <[email protected]>
    avnishn authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    a0c8f1e View commit details
    Browse the repository at this point in the history
  7. [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to …

    …make Java logging consistent with Python (#33665)
    
    To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.
    
    Co-authored-by: Qing Wang <[email protected]>
    2 people authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    d9e00cd View commit details
    Browse the repository at this point in the history
  8. [serve] Fix serve HA test (#33699)

    #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.
    zcin authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    c019896 View commit details
    Browse the repository at this point in the history
  9. Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#…

    …33731)
    
    <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
    This reverts commit cb5bb0e.
    
    <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
    
    <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
    
    ## Why are these changes needed?
    
    <!-- Please give a short summary of the change and the problem this solves. -->
    
    ## Related issue number
    
    <!-- For example: "Closes #1234" -->
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
        - [ ] I've added any new APIs to the API Reference. For example, if I added a 
               method in Tune, I've added it in `doc/source/tune/api/` under the 
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    rkooo567 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    99eaefa View commit details
    Browse the repository at this point in the history
  10. [tune] add data to CI test dependencies (#33729)

    1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
    2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
    3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
    
    **Note:** There should probably be a better way for handling dependencies in CI tests...
    matthewdeng authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    8450787 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    b314f31 View commit details
    Browse the repository at this point in the history
  12. [RLlib] Fixed a typo in multi-agent definition using RLModules in tes…

    …t_env_runner_v2::test_guess_the_number_multi_agent (#33723)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    kouroshHakha authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    5ae9abc View commit details
    Browse the repository at this point in the history
  13. [RLlib][RLModule] Disabled rl_module in one of the subtests in test_c…

    …uriosity (#33726)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    kouroshHakha authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    090b579 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    e1e36cb View commit details
    Browse the repository at this point in the history
  15. [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu test…

    …s becauase of LSTMs (#33728)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    kouroshHakha authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    d7e87cf View commit details
    Browse the repository at this point in the history
  16. [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#3…

    …2747)
    
    This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.
    clarkzinzow authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    eef5240 View commit details
    Browse the repository at this point in the history
  17. [runtime env] Close schema after loading and continue on error (#33535)

    This PR fixes a few things:
    
    * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
    * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
        * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.
    
    
    **Steps to Reproduce**
    1. Save this script as `test.py`
    ```python
    import ray
    
    @ray.remote(runtime_env={"env_vars": {}})
    def my_fn():
        return True
    
    ray.init()
    print(ray.get(my_fn.remote()))
    ```
    2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
    3.
        a. save `:` or other invalid JSON as `bad-json.json`
        b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`
     
    This PR fixes the issue and adds a new test case.
    Signed-off-by: James Clark <[email protected]>
    jamesclark-Zapata authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    6b45157 View commit details
    Browse the repository at this point in the history
  18. [Jobs] Fix race condition on submitting multiple jobs with the same id (

    #33259)
    
    In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).
    
    Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.
    
    This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.
    
    This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).
    
    Also adds a unit test which fails without this change.
    architkulkarni authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    e5b6f78 View commit details
    Browse the repository at this point in the history
  19. Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

    Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.
    
    
    Signed-off-by: Jiajun Yao <[email protected]>
    jjyao authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    b5f1c3c View commit details
    Browse the repository at this point in the history
  20. Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#…

    …33561)"" (#33740)
    
    Additionally fix `test_usage_test.py`.
    
    Signed-off-by: xwjiang2010 <[email protected]>
    xwjiang2010 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    971a9c8 View commit details
    Browse the repository at this point in the history
  21. Deprecate RuntimeContext.get (#33734)

    RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    jjyao authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    3e8e902 View commit details
    Browse the repository at this point in the history
  22. [Serve] Fix the serve.batch api doc (#33588)

    Fix the example formatting in the serve batch API doc
    sihanwang41 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    90c60da View commit details
    Browse the repository at this point in the history
  23. [infra] increase Build timeout (#33756)

    Why are these changes needed?
    release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g
    
    https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2
    clarng authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    b8596e8 View commit details
    Browse the repository at this point in the history
  24. [RLlib][RLModule] Use forward_exploration() inside the unit-test for …

    …test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    kouroshHakha authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    c1aac60 View commit details
    Browse the repository at this point in the history
  25. [data] [streaming] Dataset.cache() doesn't work properly for streamin…

    …g executor (#33713)
    
    It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
    ericl authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    ed9d773 View commit details
    Browse the repository at this point in the history
  26. [Test] Fix the failing workflow test_dataset after streaming executor…

    … is enabled. (#33736)
    
    Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).
    
    I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )
    rkooo567 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    561fa53 View commit details
    Browse the repository at this point in the history
  27. [Test] Fix out of disk error (#33732)

    Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.
    rkooo567 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    d62dfe8 View commit details
    Browse the repository at this point in the history
  28. [Data] Repurpose streaming CI to bulk CI(#33478)

    Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).
    jianoaix authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    1041e81 View commit details
    Browse the repository at this point in the history
  29. [Serve] Enable serve metrics lib working in ray actor (#33717)

    Make sure ray.serve.lib working with ray.actor without serve context.
    ```
    @ray.remote
    class MyActor:
        def __init__(self):
            self.my_counter = metrics.Counter(
                "my_ray_actor",
                description=("The number of requests to this deployment."),
                tag_keys=("my_tag",),
            )
        def test(self):
            self.my_counter.inc(tags={"my_tag": "value"})
            return "hello"
    
    @serve.deployment(num_replicas=2)
    class Model:
        def __init__(self, model_name):
            self.my_actor = MyActor.remote()
    
        async def __call__(self, req: starlette.requests.Request):
            await self.my_actor.test.remote()
            return
    ```
    sihanwang41 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    f6e0028 View commit details
    Browse the repository at this point in the history
  30. [RLlib] Fix: Recovered eval worker should use eval-config's policy_ma…

    …pping_fn and policy_to_train fn, not the main train workers' ones. (#33648)
    
    Signed-off-by: sven1977 <[email protected]>
    sven1977 authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    074396a View commit details
    Browse the repository at this point in the history
  31. [Data] Don't automatically move batches to device if collate_fn is …

    …specified (#33761)
    
    If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    amogkam authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    3a98259 View commit details
    Browse the repository at this point in the history
  32. @aslonnie's comments

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    d487977 View commit details
    Browse the repository at this point in the history
  33. @aslonnie's comments

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    d894a97 View commit details
    Browse the repository at this point in the history
  34. Run linter

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    8053b3b View commit details
    Browse the repository at this point in the history
  35. Run linters

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    34ff5d1 View commit details
    Browse the repository at this point in the history
  36. Change ray to 2.3.1 to work around the #ir-glorious-shape

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    abad535 View commit details
    Browse the repository at this point in the history
  37. Revert to normal ray image

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    15ec7f7 View commit details
    Browse the repository at this point in the history
  38. Fix delete_fn

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    8173e4e View commit details
    Browse the repository at this point in the history
  39. Merge branch 'master' into gce03

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale authored Mar 28, 2023
    Configuration menu
    Copy the full SHA
    1355867 View commit details
    Browse the repository at this point in the history
  40. [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#…

    …33772)
    
    Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
     - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
     - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works
    
    - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [X] I've run `scripts/format.sh` to lint the changes in this PR.
    - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    a0001d5 View commit details
    Browse the repository at this point in the history
  41. Run lint

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    f21b22f View commit details
    Browse the repository at this point in the history
  42. Configuration menu
    Copy the full SHA
    4d13b7b View commit details
    Browse the repository at this point in the history
  43. Setup dependencies and crendential for GCE in buildkite

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    cc1268d View commit details
    Browse the repository at this point in the history
  44. Add google-cloud-storage package to requirements

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    aa3453e View commit details
    Browse the repository at this point in the history
  45. Support for gs:// in anyscale job runner

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    15965b5 View commit details
    Browse the repository at this point in the history
  46. Correct adding gce tests

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    f0ea4c4 View commit details
    Browse the repository at this point in the history
  47. [RLlib] APPO TF with RLModule and Learner API (#33310)

    Signed-off-by: Avnish <[email protected]>
    avnishn authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    38ce312 View commit details
    Browse the repository at this point in the history
  48. [data] [streaming] Dataset.cache() doesn't work properly for streamin…

    …g executor (#33713)
    
    It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
    ericl authored and can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    81472b7 View commit details
    Browse the repository at this point in the history
  49. @aslonnie's comments

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    66c2de8 View commit details
    Browse the repository at this point in the history
  50. Run linter

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    41f7636 View commit details
    Browse the repository at this point in the history
  51. Run linters

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    e87a6ce View commit details
    Browse the repository at this point in the history
  52. Configuration menu
    Copy the full SHA
    56447ae View commit details
    Browse the repository at this point in the history
  53. -s

    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    c68097d View commit details
    Browse the repository at this point in the history
  54. Fix some tests

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    b352ffc View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2023

  1. Add unit tests for test definition parser

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 29, 2023
    Configuration menu
    Copy the full SHA
    cdebd35 View commit details
    Browse the repository at this point in the history
  2. Fix lints

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 29, 2023
    Configuration menu
    Copy the full SHA
    30ae2e1 View commit details
    Browse the repository at this point in the history
  3. @aslonnie's comments

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 29, 2023
    Configuration menu
    Copy the full SHA
    32f69f8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c34cf96 View commit details
    Browse the repository at this point in the history
  5. Remove the constant test definition in test.

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 29, 2023
    Configuration menu
    Copy the full SHA
    b49a078 View commit details
    Browse the repository at this point in the history

Commits on Mar 30, 2023

  1. [CI] Logic to create test variations in release test configs (#33920)

    * Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.
    
    The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.
    
    You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Improve wheel commit validation error message
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Setup dependencies and crendential for GCE in buildkite
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Add google-cloud-storage package to requirements
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Add new lines to some files
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Support for gs:// in anyscale job runner
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Correct adding gce tests
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Support test definition with multiple flavors
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Use not in to check key in dict
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Debugging
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Debugging
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Debugging 2
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Debugging 03
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Remove temoprary logs
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * -s
    
    * Update flavors
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Only initialize gs client on gs host
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Lint
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Update image for Sematic integration (#33469)
    
    * [RLlib] fix preprocessor test (#33719)
    
    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    
    * [RLlib] APPO TF with RLModule and Learner API (#33310)
    
    Signed-off-by: Avnish <[email protected]>
    
    * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)
    
    To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.
    
    Co-authored-by: Qing Wang <[email protected]>
    
    * [serve] Fix serve HA test (#33699)
    
    #33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.
    
    * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)
    
    <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
    This reverts commit cb5bb0e.
    
    <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
    
    <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
    
    ## Why are these changes needed?
    
    <!-- Please give a short summary of the change and the problem this solves. -->
    
    ## Related issue number
    
    <!-- For example: "Closes #1234" -->
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
        - [ ] I've added any new APIs to the API Reference. For example, if I added a 
               method in Tune, I've added it in `doc/source/tune/api/` under the 
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    * [tune] add data to CI test dependencies (#33729)
    
    1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
    2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
    3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
    
    **Note:** There should probably be a better way for handling dependencies in CI tests...
    
    * [Test] Fix test event test timeout (#33704)
    
    * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    
    * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    
    * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    
    * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    
    * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)
    
    This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.
    
    * [runtime env] Close schema after loading and continue on error (#33535)
    
    This PR fixes a few things:
    
    * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
    * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
        * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.
    
    
    **Steps to Reproduce**
    1. Save this script as `test.py`
    ```python
    import ray
    
    @ray.remote(runtime_env={"env_vars": {}})
    def my_fn():
        return True
    
    ray.init()
    print(ray.get(my_fn.remote()))
    ```
    2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
    3.
        a. save `:` or other invalid JSON as `bad-json.json`
        b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`
     
    This PR fixes the issue and adds a new test case.
    Signed-off-by: James Clark <[email protected]>
    
    * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)
    
    In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).
    
    Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.
    
    This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.
    
    This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).
    
    Also adds a unit test which fails without this change.
    
    * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)
    
    Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.
    
    
    Signed-off-by: Jiajun Yao <[email protected]>
    
    * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)
    
    Additionally fix `test_usage_test.py`.
    
    Signed-off-by: xwjiang2010 <[email protected]>
    
    * Deprecate RuntimeContext.get (#33734)
    
    RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    
    * [Serve] Fix the serve.batch api doc (#33588)
    
    Fix the example formatting in the serve batch API doc
    
    * [infra] increase Build timeout (#33756)
    
    Why are these changes needed?
    release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g
    
    https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2
    
    * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)
    
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    
    * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)
    
    It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
    
    * [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)
    
    Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).
    
    I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )
    
    * [Test] Fix out of disk error (#33732)
    
    Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.
    
    * [Data] Repurpose streaming CI to bulk CI(#33478)
    
    Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).
    
    * [Serve] Enable serve metrics lib working in ray actor (#33717)
    
    Make sure ray.serve.lib working with ray.actor without serve context.
    ```
    @ray.remote
    class MyActor:
        def __init__(self):
            self.my_counter = metrics.Counter(
                "my_ray_actor",
                description=("The number of requests to this deployment."),
                tag_keys=("my_tag",),
            )
        def test(self):
            self.my_counter.inc(tags={"my_tag": "value"})
            return "hello"
    
    @serve.deployment(num_replicas=2)
    class Model:
        def __init__(self, model_name):
            self.my_actor = MyActor.remote()
    
        async def __call__(self, req: starlette.requests.Request):
            await self.my_actor.test.remote()
            return
    ```
    
    * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)
    
    Signed-off-by: sven1977 <[email protected]>
    
    * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)
    
    If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.
    
    ---------
    
    Signed-off-by: amogkam <[email protected]>
    
    * @aslonnie's comments
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * @aslonnie's comments
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Run linter
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Run linters
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Change ray to 2.3.1 to work around the #ir-glorious-shape
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Revert to normal ray image
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Fix delete_fn
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)
    
    Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
     - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
     - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works
    
    - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [X] I've run `scripts/format.sh` to lint the changes in this PR.
    - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825
    
    * Run lint
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)
    
    * Setup dependencies and crendential for GCE in buildkite
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Add google-cloud-storage package to requirements
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Support for gs:// in anyscale job runner
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Correct adding gce tests
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * [RLlib] APPO TF with RLModule and Learner API (#33310)
    
    Signed-off-by: Avnish <[email protected]>
    
    * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)
    
    It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.
    
    * @aslonnie's comments
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Run linter
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Run linters
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * -s
    
    * Fix some tests
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Add unit tests for test definition parser
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Fix lints
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * @aslonnie's comments
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Check that parse_test_definition throws exception on empty variations
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    * Remove the constant test definition in test.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    
    ---------
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Artur Niederfahrenhorst <[email protected]>
    Signed-off-by: Avnish <[email protected]>
    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: xwjiang2010 <[email protected]>
    Signed-off-by: sven1977 <[email protected]>
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Co-authored-by: augray <[email protected]>
    Co-authored-by: Artur Niederfahrenhorst <[email protected]>
    Co-authored-by: Avnish Narayan <[email protected]>
    Co-authored-by: jiafu zhang <[email protected]>
    Co-authored-by: Qing Wang <[email protected]>
    Co-authored-by: Cindy Zhang <[email protected]>
    Co-authored-by: SangBin Cho <[email protected]>
    Co-authored-by: matthewdeng <[email protected]>
    Co-authored-by: kourosh hakhamaneshi <[email protected]>
    Co-authored-by: Clark Zinzow <[email protected]>
    Co-authored-by: James Clark <[email protected]>
    Co-authored-by: Archit Kulkarni <[email protected]>
    Co-authored-by: Jiajun Yao <[email protected]>
    Co-authored-by: xwjiang2010 <[email protected]>
    Co-authored-by: Sihan Wang <[email protected]>
    Co-authored-by: clarng <[email protected]>
    Co-authored-by: Eric Liang <[email protected]>
    Co-authored-by: Jian Xiao <[email protected]>
    Co-authored-by: Sven Mika <[email protected]>
    Co-authored-by: Amog Kamsetty <[email protected]>
    Configuration menu
    Copy the full SHA
    db0be11 View commit details
    Browse the repository at this point in the history
  2. The cluster environment name does not allow the character '.', so fix…

    … that.
    
    Address Lonnie's comments and add more tests.
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 30, 2023
    Configuration menu
    Copy the full SHA
    c5c3c66 View commit details
    Browse the repository at this point in the history
  3. Merge branch 'can-gce-01-test-variations' into gce03

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale authored Mar 30, 2023
    Configuration menu
    Copy the full SHA
    5fd05d9 View commit details
    Browse the repository at this point in the history
  4. Remove import copy from test file

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 30, 2023
    Configuration menu
    Copy the full SHA
    58e72c3 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2023

  1. Address @krfricke's comments. I keep the __suffix__ as it is based on…

    … our conversation so far,
    
    but if anyone has a strong opinion feel free to let me know @aslonnie
    
    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 31, 2023
    Configuration menu
    Copy the full SHA
    b123bd2 View commit details
    Browse the repository at this point in the history
  2. Fix lint. Thanks @krfricke for catching!

    Signed-off-by: Cuong Nguyen <[email protected]>
    can-anyscale committed Mar 31, 2023
    Configuration menu
    Copy the full SHA
    690dd26 View commit details
    Browse the repository at this point in the history