Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RuntimeEnv, Windows] Fix working_dir, pip & conda for windows #28589

Merged
merged 11 commits into from
Sep 28, 2022

Conversation

jbedorf
Copy link
Contributor

@jbedorf jbedorf commented Sep 18, 2022

Why are these changes needed?

Various options of the runtime_env method, when using Windows, are currently broken due to the changes in this PR. That PR removed the use of the command_prefix in the context. This work restores the usage of that. However, due to the different launch methods between Linux and Windows the changes had to be made in multiple locations to ensure that lists are returned instead of strings.

To prevent this breaking in the future various tests have been fixed/enabled. However, there are some lingering issues with the tests:

  • The top level folder of the working_dir is not deleted due to not being able to delete a folder that is in use on Windows. This is accounted for in the tests by whitelisting that particular folder in various tests.
  • For the conda environment option the deleting is not complete. Similar the deleting fails because a number of files and folders are in use. Typically they are in use by the ray.util.client.server process. As such not all tests are enabled for Windows as they would keep failing, and leave behind sizeable temporary files.

Other:

  • Fixed issue where Ray and the temp folder are placed on different drives and as such the change directory method did not work.
  • Fixed a number of tests that had recursion errors on Windows

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jeroen Bédorf <[email protected]>
Signed-off-by: Jeroen Bédorf <[email protected]>
Signed-off-by: Jeroen Bédorf <[email protected]>
@jbedorf jbedorf changed the title [RuntimeEnv, Windows] Fix runtime dir windows [RuntimeEnv, Windows] Fix working_dir, pip & conda for windows Sep 18, 2022
@jbedorf jbedorf marked this pull request as ready for review September 19, 2022 05:37
@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 19, 2022

@mattip Given that you previously worked on Windows related PRs could you help find some reviewers? Thanks!

@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 22, 2022

@architkulkarni Can you help move this forward? Thanks!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Regarding the conda tests:

As such not all tests are enabled for Windows as they would keep failing, and leave behind sizeable temporary files.

Certainly if they're failing we shouldn't enable them in the PR, but for those tests where the only issue is leaving behind temporary files, can we try to enable them in this PR? If the temporary files somehow cause a problem, we'll see it in the CI run for this PR.

else:
context.command_prefix += [
_PathHelper.get_virtualenv_activate_command(target_dir)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make get_virtualenv_activate_command always return List[str] to avoid this if-else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking more about it. A cleaner solution would to have always add lists to the command_prefix item and then do the combination of the items to string for Linux/Mac and keep it as list items for Windows. So we would have to update it for each of the three plugins.


if set(items) == whitelist:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the term "whitelist" I would expect subset instead of == here, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will update.

@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 22, 2022

As such not all tests are enabled for Windows as they would keep failing, and leave behind sizeable temporary files.

Certainly if they're failing we shouldn't enable them in the PR, but for those tests where the only issue is leaving behind temporary files, can we try to enable them in this PR? If the temporary files somehow cause a problem, we'll see it in the CI run for this PR.

The tests will pass when adding the same construct as added in some of the other tests:

        whitelist = get_local_file_whitelist(cluster, option)
        wait_for_condition(lambda: check_local_files_gced(cluster, whitelist))

Basically ignore the top level folder in the conda subfolder. It worked fine, but because the temporary session folders are not cleaned during the tests there will be increased disk usage due to various DLLs and exe files that are left behind in the conda folder. For the workdir folders that was not an issue as those were empty. I can enable the tests and hope the CI servers don't run out of disk space next time the tests run 😬

@architkulkarni architkulkarni added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Sep 22, 2022
@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 26, 2022

@architkulkarni Has something changed with the buildkite settings? After updating the code and importing the latest master I'm unable to view the build results/details. This makes it impossible to see why some of the steps failed.

Looking at the links it appears the address previously was ray-project/ray-builders-pr/.. and now ray-project/oss-ci-build-pr

@architkulkarni
Copy link
Contributor

@jbedorf Not 100% sure if it's related, but you're right that there was a recent change to our CI pipeline, but it shouldn't require any special action on our part. I would suggest merging the latest master but it looks like you've already done that.

Which parts can you no longer see? From what I can tell at https://buildkite.com/ray-project/oss-ci-build-pr/builds/396
Screen Shot 2022-09-26 at 8 27 56 AM some runtime env tests are failing on linux/mac, which should be reproducible locally.

@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 26, 2022

I see, I guess the project settings have changed regarding anonymous access. When I click on the current links it tells me I need a buildkite account. Whereas previous builds could be accessed by anyone. You can see it by opening the buildkite links in an incognito window.

Anyway, I'll update my Linux environment and take a look at the failing tests.

@pcmoritz
Copy link
Contributor

@jbedorf Thanks for your feedback about the buildkite visibility, this is not intended. We are looking into fixing it!

cc @simon-mo @thomasdesr

@pcmoritz
Copy link
Contributor

I believe this is fixed now -- I tried it out via incognito mode :)

Thanks @simon-mo and @krfricke for fixing!

@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 27, 2022

I believe this is fixed now -- I tried it out via incognito mode :)

Thanks @simon-mo and @krfricke for fixing!

Yep, it works now. Thanks!

Signed-off-by: Jeroen Bédorf <[email protected]>
Signed-off-by: Jeroen Bédorf <[email protected]>
Signed-off-by: Jeroen Bédorf <[email protected]>
@jbedorf
Copy link
Contributor Author

jbedorf commented Sep 28, 2022

@architkulkarni
Please have a another look. The requested changes are made and some more tests are enabled. All related tests do pass, there are some failing tests that appear to either be known issues and/or failing for other PRs as well.

For example:

@architkulkarni
Copy link
Contributor

https://flaky-tests.ray.io/
Windows test_traceback broken on master
Wheels and Jars timeout likely unrelated
Linkcheck failure unrelated (no doc changes)
test_usage_stats broken on master
test_client_proxy broken recently on master

@architkulkarni architkulkarni added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. labels Sep 28, 2022
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Before I merge this, it would be good to have a review from Windows expert @mattip who should be back from OOO soon.

Copy link
Contributor

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. There is only once small nit with changing shell=True but I think that is OK. I like that there are more tests enabled and the use of pytest for the tmpdir fixture.

@@ -82,7 +82,8 @@ def exec_worker(self, passthrough_args: List[str], language: Language):
)
logger.debug(f"Exec'ing worker with command: {command_str}")
if sys.platform == "win32":
subprocess.run([executable, *passthrough_args])
cmd = [*self.command_prefix, executable, *passthrough_args]
subprocess.Popen(cmd, shell=True).wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some slight differences when running with shell=True. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the tests ran more stable with shell=True. I honestly don't understand exactly why as the tests that fail with shell=False often pass when I run them individually. Maybe it has to do with the way pytest runs/keeps the state between runs, but I couldn't pin it down exactly.

@pytest.mark.skipif(
os.environ.get("CI") and sys.platform != "linux",
reason="This test is only run on linux CI machines.",
)
def test_pip_with_env_vars(start_cluster):
def test_pip_with_env_vars(start_cluster, tmp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is a welcome change

@architkulkarni architkulkarni merged commit de79e6d into ray-project:master Sep 28, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…roject#28589)

* Fix working_dir, conda and pip options for Windows

Signed-off-by: Jeroen Bédorf <[email protected]>

* More test fixes

Signed-off-by: Jeroen Bédorf <[email protected]>

* More test fixes

Signed-off-by: Jeroen Bédorf <[email protected]>

* More test fixes and style format fixes

Signed-off-by: Jeroen Bédorf <[email protected]>

* Style fixes

Signed-off-by: Jeroen Bédorf <[email protected]>

* Restructure, enable more tests

Signed-off-by: Jeroen Bédorf <[email protected]>

* Initial fixes to tests

Signed-off-by: Jeroen Bédorf <[email protected]>

* Fix lint errors

Signed-off-by: Jeroen Bédorf <[email protected]>

* Fix style

Signed-off-by: Jeroen Bédorf <[email protected]>

Signed-off-by: Jeroen Bédorf <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants