-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic segfaults in TensorFlow during PyObject_GC_UnTrack when used through PyO3 #1623
Comments
Ok, it's possible I may have found a fix for it after all. The GC calls lead me to some Python dev threads that mentioned errors like this. It seems like Python's GC calls have been going through some churn with the addition of I had been using Python 3.8.5 ever since we were discussing some solutions in #1274, but it seems like this might have been a mistake. TensorFlow claims to support Python 3.6-3.8, but I'm not sure if that actually includes 3.8.5 and I'm not sure if they've tested that support along with I had to hack the initialization a bit because of the /// hack to fix module 'sys' has no attribute 'argv' error in python 3.6
pub fn hack_py_sys_argv() {
Python::with_gil(|_py| unsafe {
let s = std::ffi::CString::new("").unwrap();
let mut args = [pyo3::ffi::Py_DecodeLocale(s.as_ptr(), 0)];
pyo3::ffi::PySys_SetArgv(1, args.as_mut_ptr());
});
} After that, the segfaults disappeared, so it looks like this was just a regression as a result of the internal Python GC behavior changing a bit and that TensorFlow may need to change it's dealloc logic in a future version. It's also possible that this error is racy and it's just slipped into a sweet spot temporarily (which has happened before), so I'm hesitant to call it completely fixed. I'll post an update if the error comes back. |
That tensorflow code looks correct on inspection, however the stack trace doesn't yet seem conclusive to point the blame at us either. Please do let us know if the problems reoccur. |
Btw note that if this PyO3 code path is being exercised you're dropping |
@davidhewitt unfortunately I think the errors have come back. They're definitely more rare now so I haven't gotten a proper backtrace yet to verify that they're the same error, but it looks the same from the outside. I've also been seeing new problems since moving back to Python 3.6. And even stranger, it seems like there are multiple different crashes. I just got this one after about 2 hours of running my program:
In some runs, it'll occasionally complain that some of my tensor shapes are off too, which is really strange, because it should be running the same set of training loop operations each time. I'm really not sure which library has the issue at this point (although I'm fairly certain tensorflow is doing some weird stuff), so I'll keep trying some things. That information about the Edit: |
Just as an update, I have seen new segfaults in different places since rolling back to 3.6.9 with traces like this:
These segfaults happen maybe once every 30mins. Also, I'm seeing a litany of new exceptions being thrown around in tensorflow code that I've never seen before. Most seem distinct from one another, and they only happen every once in awhile. I've been trying to constrain my tensorflow logic even more by running the pyo3 tensorflow calls in one single thread, but I'm even having issues with that. It seems to work fine until it reaches a certain point in the code, then it always deadlocks in the same place on a call to I thought maybe it was because I was creating the thread in Rust, so I tried creating one with the python Adding a call to Upgrading the version back to 3.8.5 avoids all of the issues listed above, but I just can't shake the original segfaults in In short, I've just been having rotten luck with this all around. I just can't seem to win with it. |
Yikes. Are you able to share a sample of this code? There's enough issues that I'd like to run it myself if possible to understand the cause. (If you're not able to share it in the public domain, potentially can you reach out to me via Gitter to arrange something confidential?) |
Yeah, it's not a work project so there's no problem sharing it, although it's not at a point yet where I want to fully open source it. It's around 20k lines of code too, so I'd like to strip it down as much as I can. That should help us narrow the problem down a bit. I'm not sure how much time I'll have to work on it this weekend, but I'll post an update once I have the sample code. |
Sure thing, whenever suits you! Thanks! |
I did manage to track down the spots where I was deleting As for the deadlock on These are just hacks though, so I'd still like to get a better understanding of the underlying bugs since I feel like they're going to come back to bite me at some point. I've reduced the code that reproduces the problem down to ~4k lines so far, but I think I can reduce that down further. I'll try to push that sample to GitLab within a few days. |
Awesome! Especially the deadlock is very interesting so would be curious to investigate that too.
👍 thanks. |
Just letting you know, I havent forgotten about this, Ive just been more busy than i expected lately. I'll try to get back to this soon |
No problem, will help debug once it's ready |
It's been a year since last action on this issue, so I'm going to close it on the assumption that whatever conditions which were causing the problem have changed. If there's still a problem, please ping and we can continue investigation. |
I've been seeing some SEGV errors when calling TensorFlow ops through PyO3. I haven't been able to find a solid pattern for when they occur. As far as I can tell it's pretty random, although sometimes I can find a sweet spot by rearranging or splitting up some calls.
The backtrace consistently starts with the following frames:
For reference, that
EagerTensor_dealloc
function is defined here:This may not be the right spot to file this issue, so don't feel obligated to help with this if it doesn't seem related to PyO3. I just thought I'd check with you guys to see if these snippets raise any red flags.
Environment
Docker image: nvidia/cuda11.0-base-ubuntu20.04
Python 3.8.5
PyO3 v0.13.2
The text was updated successfully, but these errors were encountered: