Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115874: Fix segfault in FutureIter_dealloc #117741

Merged
merged 17 commits into from
Apr 19, 2024

Conversation

savannahostrowski
Copy link
Member

@savannahostrowski savannahostrowski commented Apr 11, 2024

@savannahostrowski
Copy link
Member Author

cc: @brandtbucher for his review as well

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to @brandtbucher to review this -- I am somewhat disappointed that in order to fix this we have to copy a bunch of subtle code from inside PyType_GetModuleByDef() -- especially since there the code is enclosed in macros BEGIN_TYPE_LOCK() and END_TYPE_LOCK() (though those seem needed only because of the call to lookup_tp_mro() there).

At the same time I understand you don't want to call PyType_GetModuleByDef() and have to deal with the exception it raises in exactly the scenario we're trying to tiptoe around here.

@brandtbucher
Copy link
Member

At the same time I understand you don't want to call PyType_GetModuleByDef() and have to deal with the exception it raises in exactly the scenario we're trying to tiptoe around here.

It looks like it's actually a bit more subtle than that.

If I understand correctly, this isn't being done to avoid an exception, but to avoid a crash due to the type's MRO being NULL (this happens when the type and instance are part of a cycle, and the type is cleared before the instance is deallocated). In general, it seems unsafe to assume types aren't cleared in an instance's tp_dealloc, as we do in the current code.

@brandtbucher
Copy link
Member

@savannahostrowski, do you mind adding a NEWS blurb? No need to get too technical, just explaining that we've fixed a possible crash during garbage-collection of _asyncio.FutureIter objects.

Can you also add a comment or two to the new code referencing the issue number and explaining that:

  • We can't use PyType_GetModuleByDef, since the type might have already been cleared. This is also why we need to check that ht_module isn't NULL.
  • Since it's uncommon to subclass this type, it's fine (and probably faster for the common case!) to just check our type's module and not bother walking the MRO.
  • This means subclasses can't make use of the free list... oh well.

Can you also confirm for me that our old reproducer crashes on 3.12? If so, we can add the test (it's okay if it no longer works on 3.13, probably still good to have) and flag this for backport.

@gvanrossum
Copy link
Member

If I understand correctly, this isn't being done to avoid an exception, but to avoid a crash due to the type's MRO being NULL (this happens when the type and instance are part of a cycle, and the type is cleared before the instance is deallocated). In general, it seems unsafe to assume types aren't cleared in an instance's tp_dealloc, as we do in the current code.

Hm, that would point to a pretty universal problem (the instance being cleared after the type, when both are involved in a cycle). Why isn’t that crashing other code? What’s special about this example?

(I am pushing back on this because the fix requires breaking through a nice abstraction, potentially in many more cases.)

@savannahostrowski
Copy link
Member Author

savannahostrowski commented Apr 13, 2024

Thanks for the feedback, folks. I'll add some comments to the code.

@brandtbucher I can confirm that the segfault still repros on 3.12 so I can add a test here. I know you did some investigation in other modules and that's how you found this issue. That said, is it worth doing a bit more spelunking to understand if this is happening in other places as well? I'm new to this part of the codebase to really understand how prolific this might be 😅 but I want to address concerns here.

@erlend-aasland

This comment was marked as outdated.

@brandtbucher
Copy link
Member

brandtbucher commented Apr 16, 2024

Hm, that would point to a pretty universal problem (the instance being cleared after the type, when both are involved in a cycle). Why isn’t that crashing other code? What’s special about this example?

What's special is that only tp_clear and tp_dealloc functions are susceptible to this fragile situation. Doing anything other than untracking, clearing, and freeing objects in these functions is rare. Clever stuff belongs in finalizers, where you have an object in a known sane state... the obvious exception being freelists like this, which must go in a dealloc function, since it's "freeing" the memory.

(I am pushing back on this because the fix requires breaking through a nice abstraction, potentially in many more cases.)

In the original issue, we found a similar crash in itertools.tee. I was worried too, but after a (very) quick scan of literally every function matching \w+(clear|dealloc)\( in C code, this was the only other one that worried me.

A perhaps more "principled" fix would be to change PyType_GetModuleByDef to raise in the case where tp_mro is NULL and change _asyncio to handle the error instead of asserting that its return value is non-NULL. But that's a lot more invasive, and will slow down the happy path of common operations to work around a really tricky edge case that doesn't affect that much code.

In my opinion, the bug here is assuming that your type hasn't been cleared in a clear or dealloc func. That's incorrect.

@brandtbucher
Copy link
Member

I know you did some investigation in other modules and that's how you found this issue. That said, is it worth doing a bit more spelunking to understand if this is happening in other places as well? I'm new to this part of the codebase to really understand how prolific this might be 😅 but I want to address concerns here.

See my comment above. I'm reasonably confident that there aren't other offenders, but another pair of eyes definitely wouldn't hurt.

Modules/_asynciomodule.c Outdated Show resolved Hide resolved
Modules/_asynciomodule.c Outdated Show resolved Hide resolved
Modules/_asynciomodule.c Outdated Show resolved Hide resolved
Modules/_asynciomodule.c Outdated Show resolved Hide resolved
Modules/_asynciomodule.c Show resolved Hide resolved
@gvanrossum
Copy link
Member

In my opinion, the bug here is assuming that your type hasn't been cleared in a clear or dealloc func. That's incorrect.

Hmm... But it's an easy trap to fall into. Not exactly a bug magnet (only two instances in CPython itself), perhaps, but hard to debug, and hard to reason about: it seems it only happens when the type is cleared first. So isn't the bug that the type is cleared while it still has instances?

Do we understand how exactly this happened? Is module_clear called prematurely?

@brandtbucher
Copy link
Member

So isn't the bug that the type is cleared while it still has instances?

Do we understand how exactly this happened? Is module_clear called prematurely?

My hunch is that we have a cycle: instance -> type -> module -> [...] -> instance. I don't think the GC can be faulted for breaking the cycle at type -> module in this case, since it just chooses an element to clear essentially at random. Here, it just happens to be the type.

@erlend-aasland
Copy link
Contributor

@erlend-aasland I took a look at your suggestion but in this case, even if we stored the state in futureiterobject, I think that the state could still be cleared before the instance is deallocated, making it possibly as unreliable as the module to grab the necessary info for deallocation.

Yes, I noticed that after reading through the discussion again (which is why I marked my comment as outdated); it is an unfortunate issue. Well, thanks a lot for implementing this workaround :)

@brandtbucher brandtbucher enabled auto-merge (squash) April 19, 2024 22:13
@brandtbucher brandtbucher merged commit d8f3503 into python:main Apr 19, 2024
36 of 48 checks passed
@brandtbucher brandtbucher added the needs backport to 3.12 bug and security fixes label Apr 19, 2024
@miss-islington-app
Copy link

Thanks @savannahostrowski for the PR, and @brandtbucher for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Apr 19, 2024
(cherry picked from commit d8f3503)

Co-authored-by: Savannah Ostrowski <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Apr 19, 2024

GH-118114 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Apr 19, 2024
brandtbucher pushed a commit that referenced this pull request Apr 19, 2024
GH-115874: Fix segfault in FutureIter_dealloc (GH-117741)
(cherry picked from commit d8f3503)

Co-authored-by: Savannah Ostrowski <[email protected]>
@vstinner
Copy link
Member

Thanks for fixing this crash!

willingc pushed a commit that referenced this pull request Jul 12, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jul 12, 2024
…Iter_dealloc`) (pythonGH-121638)

Address comments
(cherry picked from commit 65feded)

Co-authored-by: Savannah Ostrowski <[email protected]>
willingc pushed a commit that referenced this pull request Jul 12, 2024
…eIter_dealloc`) (GH-121638) (GH-121642)

Update retroactive comments from GH-117741 (segfault in `FutureIter_dealloc`) (GH-121638)

Address comments
(cherry picked from commit 65feded)

Co-authored-by: Savannah Ostrowski <[email protected]>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
@savannahostrowski savannahostrowski deleted the fix-115874 branch September 27, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants