Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release Test] Remove runtime env usage from release tests #33288

Merged
merged 11 commits into from
Mar 17, 2023

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Mar 14, 2023

Why are these changes needed?

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Related issue number

Closes #32750

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>
@rkooo567
Copy link
Contributor Author

rkooo567 commented Mar 14, 2023

cc @scv119 I am verifying if this works now (for some reasons, I couldn't start the cluster. Will ping shomil for this). Can you tell me a list of tests to verify?

@Yard1 my assumption is this should be fine because we anyway sync working dir using the file manager. Is this correct?

Also, do you know exactly when the runtime env is used?

@ollie-iterators
Copy link

Speaking of release tests, why do the release tests show up much less results than the other tests in https://flakey-tests.ray.io/?

Signed-off-by: SangBin Cho <[email protected]>
@Yard1
Copy link
Member

Yard1 commented Mar 14, 2023

This will cause issues with tests that import from other files in the working directory, as those will not be propagated to other nodes by the file manager. Also, we need to ensure that the working directory is set correctly for imports.

What is the reason for this PR in the first place? The runtime envs in the tests take precedence over the job runtime env anyway (this is how we have set it up here, normally it's the other way with Ray).

@Yard1
Copy link
Member

Yard1 commented Mar 14, 2023

Also it's not a trivial change to Anyscale Jobs because we can't start a cluster and upload to it separately. We'd need to add complex logic to first upload the files to S3 and then download them onto the cluster, essentially reimplementing runtime envs ourselves.

@rkooo567
Copy link
Contributor Author

We will revert back to sdk manager. But before that, I'd like to just try if eager_installs=True can help. I think the "proper solution" from the prod (iiuc) is to include all necessary files and code inside cluster env, which probably takes some time to implement.

@rkooo567
Copy link
Contributor Author

eager install = True doesn't seem to fix the issue (not sure if it was actually used) https://buildkite.com/ray-project/release-tests-pr/builds/30942#0186e2af-fdf8-44aa-aa1a-c57d7952f906.

I am trying the regular SDK solution now

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
@rkooo567
Copy link
Contributor Author

hmm looks like it is not working (maybe sdk is not working with v2 stack)?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Mar 15, 2023

@shomilj is the sdk_command API still available from the v2 stack?

@Yard1
Copy link
Member

Yard1 commented Mar 15, 2023

@rkooo567 it should be working fine, if you get an infra error, just retry

@rkooo567
Copy link
Contributor Author

@Yard1 it looks like the wait_for_nodes fail with status code 5555 (which is pretty weird). It doesn't seem to be an infra error. Let me investigate a bit

Signed-off-by: SangBin Cho <[email protected]>
@rkooo567
Copy link
Contributor Author

rkooo567 commented Mar 15, 2023

trying v1 + sdk commands now. We will try syncing files using cluster env later (after this PR)

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
@rkooo567
Copy link
Contributor Author

https://buildkite.com/ray-project/release-tests-pr/builds/31218#0186ead3-1865-4cfc-9443-bb7c7fa9361e

The perf seems to be recovered.

cc @krfricke can you approve this PR as a code owner?

@Yard1
Copy link
Member

Yard1 commented Mar 16, 2023

@rkooo567 can you just add a comment to the code change explaining why this special case is added?

@rkooo567 rkooo567 changed the title [WIP] Remove runtime env usage from release tests [Release Test] Remove runtime env usage from release tests Mar 17, 2023
@rkooo567
Copy link
Contributor Author

The release test result lgtm. Since V1 stack wil be deprecated by end of April we should figure out the root cause of regressions in the new job runner. It looks like it is 4X slower for some reasons (and we verified it doesn't use the runtime env). I will create an issue.

@rkooo567 rkooo567 merged commit 76fd2cd into ray-project:master Mar 17, 2023
@rkooo567
Copy link
Contributor Author

This PR recovers test_per_seconds to 190 (40 in the nightly) and actors_per_second to 800 (240 in nightly) again.

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Signed-off-by: Jack He <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.
chaowanggg pushed a commit to chaowanggg/ray-dev that referenced this pull request Apr 4, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Signed-off-by: chaowang <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ct#33288)

Use SDK commands for all core tests.

It is because there was a big regression after migrating to V2 anyscale job runner.

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[job] If runtime_env isn't specified for job, don't use any runtime_env for the supervisor actor/job driver.
4 participants