-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Release Test] Remove runtime env usage from release tests #33288
Conversation
Signed-off-by: SangBin Cho <[email protected]>
cc @scv119 I am verifying if this works now (for some reasons, I couldn't start the cluster. Will ping shomil for this). Can you tell me a list of tests to verify? @Yard1 my assumption is this should be fine because we anyway sync working dir using the file manager. Is this correct? Also, do you know exactly when the runtime env is used? |
Speaking of release tests, why do the release tests show up much less results than the other tests in https://flakey-tests.ray.io/? |
Signed-off-by: SangBin Cho <[email protected]>
This will cause issues with tests that import from other files in the working directory, as those will not be propagated to other nodes by the file manager. Also, we need to ensure that the working directory is set correctly for imports. What is the reason for this PR in the first place? The runtime envs in the tests take precedence over the job runtime env anyway (this is how we have set it up here, normally it's the other way with Ray). |
Also it's not a trivial change to Anyscale Jobs because we can't start a cluster and upload to it separately. We'd need to add complex logic to first upload the files to S3 and then download them onto the cluster, essentially reimplementing runtime envs ourselves. |
Signed-off-by: SangBin Cho <[email protected]>
We will revert back to sdk manager. But before that, I'd like to just try if eager_installs=True can help. I think the "proper solution" from the prod (iiuc) is to include all necessary files and code inside cluster env, which probably takes some time to implement. |
eager install = True doesn't seem to fix the issue (not sure if it was actually used) https://buildkite.com/ray-project/release-tests-pr/builds/30942#0186e2af-fdf8-44aa-aa1a-c57d7952f906. I am trying the regular SDK solution now |
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
hmm looks like it is not working (maybe sdk is not working with v2 stack)? |
@shomilj is the sdk_command API still available from the v2 stack? |
@rkooo567 it should be working fine, if you get an infra error, just retry |
@Yard1 it looks like the wait_for_nodes fail with status code 5555 (which is pretty weird). It doesn't seem to be an infra error. Let me investigate a bit |
Signed-off-by: SangBin Cho <[email protected]>
trying v1 + sdk commands now. We will try syncing files using cluster env later (after this PR) |
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
https://buildkite.com/ray-project/release-tests-pr/builds/31218#0186ead3-1865-4cfc-9443-bb7c7fa9361e The perf seems to be recovered. cc @krfricke can you approve this PR as a code owner? |
@rkooo567 can you just add a comment to the code change explaining why this special case is added? |
Signed-off-by: SangBin Cho <[email protected]>
The release test result lgtm. Since V1 stack wil be deprecated by end of April we should figure out the root cause of regressions in the new job runner. It looks like it is 4X slower for some reasons (and we verified it doesn't use the runtime env). I will create an issue. |
This PR recovers test_per_seconds to 190 (40 in the nightly) and actors_per_second to 800 (240 in nightly) again. |
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner. Signed-off-by: Jack He <[email protected]>
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner. Signed-off-by: Edward Oakes <[email protected]>
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner.
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner. Signed-off-by: chaowang <[email protected]>
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner. Signed-off-by: elliottower <[email protected]>
…ct#33288) Use SDK commands for all core tests. It is because there was a big regression after migrating to V2 anyscale job runner. Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
Use SDK commands for all core tests.
It is because there was a big regression after migrating to V2 anyscale job runner.
Related issue number
Closes #32750
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.