[tune] Disable pytorch-lightning multiprocessing per default #28335

krfricke · 2022-09-07T14:31:33Z

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

Pytorch lightning uses multiprocessing pools per default (e.g. for device lookup), which can lead to hangs (see #28328). This PR sets an environment variable to disable this until #28328 is addressed.

Related issue number

Closes #28197

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

amogkam · 2022-09-07T18:59:19Z

Thanks @krfricke! Seems this is a problem only with PTL 1.7 (Lightning-AI/pytorch-lightning#14292). Should we update the PTL version used in CI to test that these changes work?

krfricke · 2022-09-08T08:22:42Z

Good idea, I'll add this to this PR

Signed-off-by: Kai Fricke <[email protected]>

krfricke · 2022-09-09T09:38:28Z

@amogkam unfortunately we land in a dependency loop here - we can't upgrade to PTL 1.7.X as ray-lightning 0.3.0 is not compatible, but upgrading ray lightning for compatibility requires changes to the library - and the CI won't pass as the trials can hang, as this fix is not merged.
I've run it manually with the fix and it works for me. Adding another pipeline with the specific set of requirements seems like too much overhead to me - let's deploy this fix and focus on ptl 1.7 compatibility for ray lightning - ok?

[tune] Disable pytorch-lightning fork per default

ab029a7

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested a review from amogkam September 7, 2022 14:31

krfricke assigned amogkam Sep 7, 2022

Kai Fricke added 7 commits September 8, 2022 10:33

Upgrade pytorch lightning

b15c06b

Signed-off-by: Kai Fricke <[email protected]>

Change CI

ecacb8f

Signed-off-by: Kai Fricke <[email protected]>

Move PTL install

130c574

Signed-off-by: Kai Fricke <[email protected]>

Also pin ray-lightning

6c08e19

Signed-off-by: Kai Fricke <[email protected]>

Fix dependency order

87e370a

Signed-off-by: Kai Fricke <[email protected]>

Fix lightning test

7886966

Signed-off-by: Kai Fricke <[email protected]>

Revert dependencies, update tune test

b872bc3

Signed-off-by: Kai Fricke <[email protected]>

Revert depnendcies

e173101

amogkam approved these changes Sep 13, 2022

View reviewed changes

amogkam merged commit 745ec42 into ray-project:master Sep 13, 2022

krfricke deleted the tune/ptl-disable-fork branch September 13, 2022 08:30

amogkam mentioned this pull request Oct 13, 2022

[Core][Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. #27493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Disable pytorch-lightning multiprocessing per default #28335

[tune] Disable pytorch-lightning multiprocessing per default #28335

krfricke commented Sep 7, 2022

amogkam commented Sep 7, 2022 •

edited

Loading

krfricke commented Sep 8, 2022

krfricke commented Sep 9, 2022

[tune] Disable pytorch-lightning multiprocessing per default #28335

[tune] Disable pytorch-lightning multiprocessing per default #28335

Conversation

krfricke commented Sep 7, 2022

Why are these changes needed?

Related issue number

Checks

amogkam commented Sep 7, 2022 • edited Loading

krfricke commented Sep 8, 2022

krfricke commented Sep 9, 2022

amogkam commented Sep 7, 2022 •

edited

Loading