Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with python 3.11 #8616

Closed
epizut opened this issue Apr 11, 2024 · 9 comments · Fixed by #8916
Closed

Deadlock with python 3.11 #8616

epizut opened this issue Apr 11, 2024 · 9 comments · Fixed by #8916
Labels
deadlock The cluster appears to not make any progress

Comments

@epizut
Copy link

epizut commented Apr 11, 2024

I spent a lot of time digging into a deadlock issue with Dask distributed and Python 3.11.
This issue was created so other people won't have to lose the same amount of time debugging endless dask and CPython stacks.

Thanks to @diegorusso we now have a pending fix python/cpython#116969

Solutions:

  • Wait for #116969 if you are forced to use py3.11, but be aware that you need either python<3.11.9 or dask>=2024.4.1 due to #11035
  • Skip 3.11 and move to 3.12
@fjetter
Copy link
Member

fjetter commented Apr 12, 2024

Thanks a lot for tracking this down and reporting it here!

@fjetter fjetter added deadlock The cluster appears to not make any progress and removed needs triage labels Apr 12, 2024
@diegorusso
Copy link

diegorusso commented Apr 12, 2024

It really depends if they are going to accept the PR. In theory 3.11.9 is the latest release bug fix they do it. After that, there will be just source distributions releases with security fixes for the distros to pick up.
I really hope they accept the PR because 3.11 will still be used for quite some time.

@epizut
Copy link
Author

epizut commented Apr 12, 2024

Sure, thanks again for your work on this.
In the meamtime, we could make the dask gc diagnostic optional or even better try to make it non re-entrant on the CPython lock by avoiding calling _current_frames()

@diegorusso
Copy link

Hello, there is the sentiment not to include the fix in Python 3.11 unless the issue is affecting a considerable number of users. Just wanted to ask what's the situation on your front? Are you able to workaround the problem and/or have a fix for your use case? Feel free to comment directly on the PR python/cpython#117332

@epizut
Copy link
Author

epizut commented May 28, 2024

I haven't taken the time to find a fix. Unfortunately, I selfishly skipped over version 3.11 and jumped straight to 3.12

@diegorusso
Copy link

Do you think this could be a good solution for dask distributed users? I mean, could/should this be documented somewhere?

@epizut
Copy link
Author

epizut commented May 28, 2024

It definitely should be better documented, which was my initial intention when I opened this issue.
Also, It would be interesting to get feedback from a Dask maintainer to decide whether to fix it (workaround it since it's not Dask's fault) or just improve the documentation.

@diegorusso
Copy link

Thanks @epizut let's wait some feedback from a Dask maintainer and take action after that.

@fjetter
Copy link
Member

fjetter commented Jun 12, 2024

Apologies for the slow reply.

IIUC this deadlock can happen in pretty much every situation since our profiling thread is using sys._current_frames(). Do you know how easy it is to trigger this?

The CPython PR discusses enabling/disabling the GC. Is this a workaround that could work for us? (Not sure about perf implications)

Beyond this, I guess the only workaround we could provide is to disable the profiling thread entirely on python 3.11. Any other advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants