Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Disable pytorch-lightning multiprocessing per default #28335

Merged
merged 9 commits into from
Sep 13, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Sep 7, 2022

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

Pytorch lightning uses multiprocessing pools per default (e.g. for device lookup), which can lead to hangs (see #28328). This PR sets an environment variable to disable this until #28328 is addressed.

Related issue number

Closes #28197

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@amogkam
Copy link
Contributor

amogkam commented Sep 7, 2022

Thanks @krfricke! Seems this is a problem only with PTL 1.7 (Lightning-AI/pytorch-lightning#14292). Should we update the PTL version used in CI to test that these changes work?

@krfricke
Copy link
Contributor Author

krfricke commented Sep 8, 2022

Good idea, I'll add this to this PR

Kai Fricke added 7 commits September 8, 2022 10:33
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor Author

krfricke commented Sep 9, 2022

@amogkam unfortunately we land in a dependency loop here - we can't upgrade to PTL 1.7.X as ray-lightning 0.3.0 is not compatible, but upgrading ray lightning for compatibility requires changes to the library - and the CI won't pass as the trials can hang, as this fix is not merged.
I've run it manually with the fix and it works for me. Adding another pipeline with the specific set of requirements seems like too much overhead to me - let's deploy this fix and focus on ptl 1.7 compatibility for ray lightning - ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] PyTorch Lightning 1.7 with Ray Tune hangs
2 participants