-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: ABI incompatibility between 2.9.2 and 2.10 #4105
Comments
From Gitter:
|
We need something to work off from, ideally a reproducer, but at least a careful description of the steps and symptoms to reproduce. I think I've seen all changes between 2.9 and 2.10, but I wouldn't know how we may have accidentally broken ABI compatibility. There is also a chance that it only worked by luck before, and some small perturbation makes it fall over now. ABI compatibility is very tricky to be certain about, especially because we don't even have any directly related unit testing (I believe it's impractical). |
Thanks @rwgk . I can put together an example using |
Sorry I sure won't have time to debug this. I'm just hoping for some concrete clues. If that doesn't lead to easy insights, I'd lean toward bumping the internals version. We are holding back an improvement for some time already, behind an ifdef. If we now have an unintentional break, I'd say let's make that a trigger for a version bump rather than spending time on looking back. Sooner or later ABI breaks are unavoidable. |
Honestly, even if you have a massive project, as long as it can compile in finite time, and can be compared with live checkouts of pybind11, I could probably fire and forget a git bisect and detect where it broke. Currently it's a handwavy "something broke in my library". |
Thanks @henryiii - I guess you mean that I can build Project A with pybind11 2.9.2, and Project B with pybind11 on current master and see if the ABI compatibility is resolved? |
Yes. If it's not, if you could provide an ocp example I can test, I can run a got bisect and see where it started failing. |
I have a suspicion that PR #1895 could be the reason. When I merged it I was thinking we're safe, but in the meantime I did some unrelated work (#4043, #4050) and what I learned made me think maybe not. That PR cannot easily be rolled back, even locally ( I wanted to let you know this suspicion before you spend time on a reproducer. I really want to know, too, what the root cause is, but then we will have the question:
What is better overall? I think backtracking could be extremely disruptive. Patching will take effort to figure out & validate. Incrementing the ABI version, after many years, is a clean, simple move. The fact that nobody has given us a reproducer suggests to me that ABI compatibility is important, but not actually critical. Ultimately: We don't have any unit testing for ABI compatibility. That would be a useful contribution to make for someone with a vested interest, protecting us from accidents in the future. |
Just to provide some clarity on my own use case - I don't particularly care if versions 2.9 and 2.10 are ABI compatible. However, it would be nice to know when an ABI-compatibility breaking change has been made. In the conda-forge feedstock for pybind11, the To this end, the unit testing suggested by @rwgk would be more personally useful, rather than attempting to make 2.10 ABI-compatible with 2.9. |
This is the one and only report of ABI breakage. I've worked through the other two issues I thought might be ABI breakages, and both of them were something else. Without any example code or reproducer, I think this one issue of private code with no reproducer is not enough evidence to say 2.10 has broken ABI. We have many major users of our ABI that haven't complained. I am hoping it's because they upgraded and were fine, and not because they haven't upgraded for other reasons! I think we should continue with 2.10.1, bump the ABI for 2.11, which will follow not too far in the future, and move on. We really should add an ABI test in the future.
This is only true if you use the pybind11 conda-forge package, which many packages do not. I checked; SciPy does, while PyTorch does not. Also, Conda-forge is not the only packaging ecosystem. Many packages ship on PyPI and there's no ABI control there. If we bump ABI, every single PyTorch extension must be recompiled when PyTorch picks up our ABI bump. They will not be cross compatible anymore. Something users have been able to rely on since 2.2-2.4 or so. Also, @rwgk wants to allow selectable ABI with one release having several choices, which totally breaks the conda-forge model of a single ABI. So it's nice, but ABI bumps are not trivial things for a large number of pybind11's users just because conda-forge has a nice mechanism for it. ;) |
Thanks for doing that! Do you have the issue numbers handy?
Are you referring to #4218? What's in there is barely more than floating an idea, I defended it a little bit when @jbms suggested we can bundle the GIL-related We already have selectable ABI versions, 4 or 5, since PR #3275 was merged on Sep 20, 2021. Google-internally we're running with 5 since the same day (an automatic patch is applied to internals.h when we're importing the smart_holder branch). This was after the 2.7.1 release on Aug 2021, but before the 2.8.0 release on Oct 4, 2021. I.e. the selectable ABI situation already exists for over a year in released versions. Google cannot wait a year+ for external Coming back to the problem reported here, my suspicion under #4105 (comment) is related to
I'm beginning to think if that's actually behind the problem reported here, it is not an Also note that it is really tricky to get a handle on ODR violations. How those play out is extremely platform dependent. I'm not aware of reliable tools for detecting ODR violations. There is this in ASAN, but I know for sure that it does not detect something as severe as a |
@henryiii could you please let me know the two issue numbers? I'm thinking of working on a demo for the suspected ODR. I expect that it will be a non-trivial effort. I'd like to look at the two issues to see if there is anything that could be related and maybe help me. |
Without any information to reproduce the issue or even a deception of exactly how the two libraries are failing to work together, it is difficult to do much with this report. Even a large reproduction would probably make it easy to identify the problem. The exception types would only create potential abi incompatibility if they are possibly thrown from one extension and caught in another. I haven't yet analyzed the internals data structures enough yet to see if that can happen. If it can, then the exception types would indeed be part of the abi. Going forward, it would be better if the internals data structures didn't involve the pybind11 exception types. |
It's only relatively recently that we added
PR 2999 added a "Note" that makes me think only some environments actually need it, near the bottom of this section: Making a very big jump from here: @whophil could change pybind11/detail/common.h to replace this entire block
with just
Does that resolve your issues? |
I think I believe #2999 unintentionally punched a hole into the ABI protection layer (sloppily speaking). #2999 was merged between pybind11 releases v2.6.2 (Jan 26, 2021) & v2.7.0 (Jul 16, 2021), but I don't expect issues between those because v2.6.2 doesn't have the hole yet. I also don't expect issues between following releases, assuming we didn't touch the definitions of the exported exceptions. But 2.10 changed the definition of Still a lot of guessing here. @whophil if you could help with experiments from your end that would be extremely valuable. |
Regarding #2999 --- the original report in #2847 didn't involve pybind11 exception types at all, and changing the visibility of the pybind11 exception types would have no effect there. However, #3016 changes the original example to inherit from a pybind11 exception type. There are really 2 forms of ABI compatibility that we need to consider:
Case 1 is what I have normally considered when thinking about ABI compatibility --- the only interaction happens through the internals data structures, so as long as they are compatible it is okay if PA and PB use a different version of pybind11 headers. (But note that I have not yet confirmed that exception types never pass through functions registered in the internals data structures.) For case 2, essentially any pybind11 type could be used across an ABI boundary, and therefore it requires a much stronger form of ABI compatibility. I think we should say that in case 2, library A and B must use exactly the same version of pybind11. Unfortunately, I don't know of a way to try to enforce this restriction automatically, so that an error is produced if it is violated. |
They are already linked above on Aug 24: #4150 & #4137. Both turned out to not be ABI breakages, but something else. I don't think they are helpful in debugging. This issue is the only report of ABI breakage we have, and it's for private code we can't see.
It's quite possible this is the problem. Many, I might even postulate most, pybind11 extensions don't use ABI compatibility at all - it's a nice but slightly niche feature (at least between separately compiled extensions - I'm not counting multiple extensions in one package, as those are always compiled with the same version of pybind11 anyway so it doesn't matter in this context). Of those that do use it (like PyTorch and several other ecosystem-like packages - due to pybind11's usage even a few percent is potentially hundreds of packages), it might be quite rare to produce and catch errors across extensions. If this is true, then 2.10 is only an ABI breaking release if you fit into this narrow category. Our ABI support is currently a best-effort project, and it's held up pretty well from 2.4-2.9/2.10. It's extremely useful for our users to be able to upgrade pybind11 without worrying about consequences - for most users, they want to use the latest Python version and such but don't want to have to force their ecosystem to upgrade. For example, PyTorch would have to force for all dependent packages to upgrade pybind11 if they upgraded past an ABI boundary. They've pushed us in the past to make a way to make ABI protection looser, not stronger. And our ABI protection fixes nothing (not counting systems that use it like conda-forge to do migrations). It just replaces a potential for a bad error with an arguably better/arguably worse error. (Specifically, it simply won't work because it won't see the extension as compatible - someone developing an extension might actually prefer a segfault over it simply not seeing it at all - at least then you have a clue it's configured correctly but incompatible). Specifically, "only matching versions" would be useless - someone can do that already by manually forcing matching pybind11 versions. But that means you fully control the stack - which is not the case that we are trying to address with ABI versioning in the first place. Repeat: You only get a different error with the ABI number (outside of conda-forge's ABI pinning & migrations!). Bumping the ABI number doesn't somehow mean you can use 2.4-9 and 2.x together, it just they don't work together at all - avoiding a segfault or similar, but not "fixing" anything. I very much doubt we'll convince PyTorch and OCP to update to 2.x with a new ABI soon and force them to migrate all dependent users on our schedule. So unless we know the ABI is broken beyond repair, then continuing to support 2.10.x will benefit most of our users! In an ideal world, my proposal would be:
Of course, we don't have a developer dedicated to the ABI - Google doesn't care about an ABI since they control the stack. I'm not full time on pybind11, my focus is on scikit-build for the next few years (and teaching for the next few weeks). Etc. So we'll probably do a subset of the above, but I think it's a good plan in general. In summary, I firmly believe 2.10.1 will be the best version of pybind11 we've ever released. It's the culmination of a lot of hard work, and real-world usage and bug reports. There might be some small concern about ABI compatibility, but it's no worse than 2.10.0, which generated only this one bug report. There might be some concern about GIL reentry, but it's no worse then 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, and many versions before that too - it's simply not supported yet (and 2.10 offers a possible workaround, so that's a net improvement). In basically every other way, it is a clear upgrade. Also, I think we should always be willing to provide a patch release if one is needed. Patch releases never promise to every known bug, they are just provide small iterative fixes for whatever bugs we can address. |
Thanks @henryiii for the issue numbers. Good for me to be able to start from a certain place. I looked at #4137 and closed it after confirming conclusively it's a GIL-not-held issue. Hopefully #4246 will stave off such bug reports in the future, after we merge it. Similarly, I closed #4150 after I realized and explained that it's a DECREF-after-interpreter-finalization. Same conclusion as yours: nothing to do with an ABI break, for sure.
That's only true internally. Google has a big OSS presence. We do care. At least I do, in particular because I'm the one who's pushing the PyCLIF-pybind11 integration, with the promise that we connect better to the OSS world. I want to take the
2.7.0 was the release that brought in the suspected compatibility trap, but only 2.10.0 triggered it, via #1895. We don't know that with certainty, but we have a very strong suspicion. I feel it will be wise to not make a public move before we at least hear back from @whophil (trying the empty |
You are welcome to take Releasing 2.10.1 does not mean we promise to fix every known bug (otherwise, we have 400+ open issues someone needs to work on). It means we fix some known bugs. And we have. A Python 3.11 embedding bug, a MSVC 2019 C++14 bug, an NVCC 11.4-11.8 bug, a PyPy fix, a C++23 bug, a vcpkg fix, A Python 3.12 fix, and a pkg-config file added. Combined, these affect far more users than I won't release 2.10.1 without at least other maintainer's approval, and I'd really like @rwgk's approval. |
Why slip into such radical language? I just want to wait a day or two to hopefully hear back from @whophil. If he confirms that the empty |
I'm just quoting you. You have said we need to abandon the 2.10 release series. Multiple times (#4218 (comment), #4125 (comment)). If we could "yank" the 2.10.0 release, that would be an option. But we can't - it's the only 3.11 compatible release, and already in heavy use. I've moved over every project I work on to it, and many others have done the same. This issue is two and a half months old. I've been saying we would make a 2.10.1 release before 3.11 for months. I've let it slip till the last possible second (today is release day). I don't think one issue with one unverified problem should stop the release. We can always make a 2.10.2! Numbers are cheap. Patch releases should be frequent (say monthly).
Our time matters too. I've spent a lot of time building the other issues's code to verify they were not ABI issues. I can't work on this one because there's nothing public to work on. If someone hits this, hopefully they will open an issue with public code that we can work on.
If you can set a date, I disagree that it's necessary to make a patch release with unrelated fixes, but I'll agree to wait till then - I'm not releasing without maintainer sign-offs. I had already set a date and nothing happened before that date. |
"Google doesn't care" and turning "abandon a release series to protect users" into "abandon users" will not get us anywhere good.
I know. I'll focus on doing homework now, similar to what I did with PRs #4022, #4050, and #4072. Please don't try to coerce me into approvals while I figure this out, by suggesting carelessness and bad intentions. Abandoning users is of course irresponsible. Yesterday already I sent a comment that the developments here ("it's not an |
"Abandon a release series to protect users" is abandoning users. If users have already moved to 2.10 (as many of them have), refusing to release important bug fixes because you are afraid they will hit a bug you can't fix and then associate your name to that bug is abandoning them. They can't downgrade to 2.9 without losing Python 3.11 support! All software might have bugs. We have to try to do our best on anything we maintain, and accept there will be things we can't address every release. Even if there was a major ABI incompatibly, a large fraction of our users don't care about ABI compatibility and those users matter too. The ABI compatibly feature of pybind11 is a powerful but rather specialized feature that we currently don't even test. A few users really care and use it. Just like a few(er number of) users really care about and use the GIL dissociation feature. And 2.10.1 wouldn't have been any worse than 2.10.0, it just would have been the same. We've now missed the Python 3.11 release party so I don't care anymore. My work was for nothing. Do what you want. I have to go back to teaching. |
I'm really sorry you feel that way. I did review #4279 as quickly as I could, wrote that I'm not opposed to the release, based on believing we're probably up against an ODR issue that cannot be fixed by bumping the Me and everybody makes mistakes. I believe it's important that everybody acts to the best of their knowledge, we vote, and go with that, averaging out individual mistakes. That process worked out in your favor, I totally respect that, as I wrote before. TBH, I only have a very crude idea of what the Python 3.11 release party is, I'm still not sure, I didn't give that a lot of weight. I'm quite often in a situation where I have to decide: make a move? or more due diligence? I got burned pretty badly a few times moving too fast internally, sending some teams into distress. Each time I decided to go a notch slower in general. That's what I'm still going by w.r.t this bug. I also want to add that I don't feel great about being that one who merged both #2999 and #1895, overlooking the ODR potential arising from the combination of the two. @whophil I made what I was hoping could be a reproducer based on my guessing: but the one time that I'm hoping the CI doesn't pass, it passed very easily:
This has happened to me many times before, guess-constructing a failure is a little bit like predicting the next lottery. If there is any chance you could try out my very simple suggestion, that could be an extremely valuable clue. Reducing from something big often looks daunting at first, but is a much safer (and in aggregate more efficient) way of finding root causes. I'll keep experimenting and asking around until I find a way to tease out observable failures caused by the ODR in #4283, or I convinced myself that the combination of #2999 & #1895 is actually inter-version-ABI-safe on all platforms. |
Hi all. Sorry I haven't been able to follow all the action on this thread! I worked on creating a reproducer over the weekend, but am having trouble producing a minimal one. The real private project is still failing when I build it with pybind 2.10 and try to use it with a pybind 2.9.2-compiled shared library, so there is definitely an issue. I think the jury is out on whether it is actually an ABl issue - I suppose that's what the reproducer aims to find out. I will work on the reproducer as much as I can, but can't make any guarantees on when it will be done. |
@whophil If I understand your situation correctly, you're building from 2.10, could you please try this tiny side-project, with everything else exactly as you have it already? Simply apply this one-line change:
I.e. simply chop off (I do have another idea that is only slightly more involved, but let's try this one first.) |
Have you tested the latest master of pybind11? We fixed a variety of things that could possibly be related. Have you tested building the parent library with 2.10? IMO, this would be the most important thing to test. Showing 2.10 - 2.10 works would verify this is related to ABI (or a related leakage or ODR violation). Both of these checks require no code. :) Also, what sort of error are you getting? |
@henryiii @rwgk I have some rather embarrassing news to report. In building my reproducer, I was encountering really weird and inconsistent behavior. Digging deeper, I found that the issue does not seem to be in pybind11 at all - but rather related to some weird interaction between conda multi-python-version builds, pybind11, and CMake, resulting in CMake building Python extensions that did not match the Python interpreter version. For some reason, these problems only arise when building with pybind11 2.10, but aren't due to an ABI incompatibility between 2.9.2 and 2.10. Building manually, I can confirm that my OCP extension library built with pybind11 2.10 works with a version of OCP built with pybind11 2.9.2. I don't know if this comes as good or bad news to you. But in my specific application, this problem does seem to be resolved for now. I want to say thanks very much for your help in trying to address my issues! |
I believe this was fixed with #3577, which would have been in 3.10.1. This is why I said "We fixed a variety of things that could possibly be related." I didn't know it would be this one, but that's why I want to ship fixes even if not every issue is resolved. |
Thank you so much for working on this! I'm sorry we didn't release 2.10.1 sooner and probably resolve your issue without forcing you to debug, but I am very grateful you did. If you can try latest master, it should work. But don't worry about it, we should release soon. |
Awesome, thanks for letting us know! I was really spooked by having broken something with #1895. A couple small good things came out of this: I believe #2999 went far beyond what was actually needed (see #4284). We can trim it back after the 2.10.1 release. Also, I managed to tease out "real" ODR violation behavior under #4283, so strictly speaking the combination of #1895 & #2999 actually broke something, but only for highly unusual environments (roughly, using |
Background: pybind#2999, pybind#4105, pybind#4283, pybind#4284 In a nutshell: * Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (pybind#4284). * Evidently (pybind#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations, * but evidently (pybind#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used. * Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
…#4298) Background: #2999, #4105, #4283, #4284 In a nutshell: * Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (#4284). * Evidently (#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations, * but evidently (#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used. * Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
2.10.1 has been released. |
@whophil It's very likely that the deadlocks I was struggling with here have nothing to do with pybind11, but simply threads & fork are an unsafe combination: #4305, #4306 Unfortunately both PRs don't work internally (for interesting reasons unrelated to deadlocks), therefore I cannot easily verify that the deadlocks are gone. (Maybe it's easy, but I don't know how.) |
Update:
I tried really hard to reproduce the flakes on my corp dev workstation (Debian derived), using Python 3.9 and Python 3.10, using current pybind11 master with this tiny tweak only: test_gil_scoped.py: -ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS + (_intentional_deadlock,)
-SKIP_IF_DEADLOCK = True # See PR #4216
+ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS # (_intentional_deadlock,)
+SKIP_IF_DEADLOCK = False # See PR #4216 I ran pytest 100 times with these options:
In those 100,000 runs total there isn't a single deadlock (or any other failure). Unfortunately that means I still don't know how to prove that #4305 and/or #4306 resolve the "flakes" we saw in the GitHub CI over the past couple years. Giving up. We need to decide if we want to go with #4305 or #4306. The only tie breaker I can think of: which one makes test_gil_scoped.py run fastest? Method: Run the test twice without doing anything else in between, take the 2nd "real" time. Results:
Small detail for completeness: the "spawn" timings are essentially identical for 1. using @henryiii @Skylion007 "forkserver" is clearly winning this one. Let's go with that and hope that we will never see any flakes or deadlocks again from test_gil_scoped.py |
…#4298) Background: #2999, #4105, #4283, #4284 In a nutshell: * Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (#4284). * Evidently (#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations, * but evidently (#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used. * Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
* Use `multiprocessing` `start_method` `"forkserver"` Alternative to PR #4305 * Add link to comment under PR #4105 * Unconditionally `pytest.skip("DEADLOCK")` for PyPy Windows * Remove `SKIP_IF_DEADLOCK` entirely, for simplicity. Hopefully this PR will resolve the deadlocks for good. * Add "In a nutshell" comment, in response to request by @EricCousineau-TRI
FYI: We still seem to have a deadlock problem, even with multiprocessing start_method "forkserver". I'll track what I happen to see under #4373. Note that this first one is with |
Required prerequisites
Problem description
I have two pybind11 extension libraries, A and B, where B uses classes from A.
My library A was compiled with pybind11 2.9 and B was compiled with pybind11 2.10. They both declare the same
PYBIND11_INTERNALS_VERSION
of 4, yet they do not seem to work with each other. They do work if I compile both A and B with pybind11 2.9.Based on discussion with @henryiii on Gitter, these should be ABI compatible.
Reproducible example code
The text was updated successfully, but these errors were encountered: