-
-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia 1.8 CVODE_BDF
causes segfaults
#367
Comments
The preconditioners test is also doing weird things on v1.8 (#366) and we haven't figured out why. That might be a lead, but it's hard to say in advance. |
This code doesn't have preconditioners, the only options used (all of which error), are I've got a less stiff problem working with |
Looks like by building Sundials from src (with debug build) and hacking Sundials.jl to use the compiled libraries (instead of Sundials_jll) we can avoid the segfaults. This was built from Sundials 5.2.0 (see tag in git). Unfortunately it seems like the debug build causes the memory space to be different than that of the release build. |
@giordano or @staticfloat could we get some help retriggering the build? |
You want a debug build of Sundials (presumably v5.2.0)? As the official build or something to be used only for testing? |
I don't think it needs to be the debug build? But I'm also confused why there's now a memory issue with the old builds. |
So the current jll that's used with Sundials.jl is Remember, Julia 1.7 works just fine with the Sundials jll, but for Julia 1.8 I get segfaults. |
Well, in Julia v1.7 and v1.8 you should get the same binary. I don't see how rebuilding would help anything here. Maybe the problem is in how the routines from the libraries are called. |
My guess is that a function pointer got garbage collected (i.e. the unknown function message in Linux) and that's how we're segfaulting. I understand that we should get the same Sundials binary between Julia 1.7 and 1.8, so I'm guessing that the way Julia is handling memory has changed. Using a debug build of C++ Sundials would change how memory is managed overall in the computer, which is why re-building Sundials with a debug build removes the segfault issue I experienced. |
It would be helpful for someone to track down what exactly is getting garbage collected too early here. |
If someone can guide me through the process that would help me. Would I build julia from source with assertions to debug this? |
I just retested our application (that uses Kinsol), putting GC.enable(false) and GC.enable(true) around the call to Sundials, and this appears to remove the segfault (an example failure was in PALEOtoolkit/PALEOmodel.jl#25). So this is at least consistent with the suggestion above that the problem is connected to something getting garbage collected in Julia 1.8 (NB: not a definitive test as the failures are not fully reproducable, so this could just be a statistical fluke). Some speculation on a possible cause: looking at the Sundials.jl code, it seems to me that there might be a problem here with the use of Sundials.jl/src/nvector_wrapper.jl Line 61 in c7469aa
The The most obvious case (which explicitly creates temporary Sundials.jl/src/nvector_wrapper.jl Lines 65 to 73 in c7469aa
but in fact this pattern is used widely by autogenerated code, eg Sundials.jl/lib/libsundials_api.jl Lines 1840 to 1849 in c7469aa
where the problem in this case could be that __yout.ref_nv is garbage collected after convert(N_Vector, __yout) but before the ccall ?
I admit I can't understand the documentation for If this is the problem, then given that there is a comment contradicting this in Lines 134 to 138 in c7469aa
and if this comment is in fact incorrect there could I imagine be other similar problems, any fix will need to touch (or at least regenerate) a lot of code. So perhaps the first priority is to understand the intended usage of ccall , unsafe_convert , and cconvert and review the design here?
|
Note that the required vectors are stored in the Julia types for the function and integrator: https://github.com/SciML/Sundials.jl/blob/master/src/common_interface/integrator_types.jl#L63 So in theory the GC would know those finalizers are not hit until after the integrator's reference is gone. At least, that's how it was working before v1.8. |
The explicitly stored NVectors should of course be fine (as well as the explicitly stored memory handles). But this doesn't look like it covers everything: the root cause here I think is that there exists a Some examples: Sundials.jl/src/common_interface/solve.jl Lines 544 to 548 in c7469aa
here u0nv is not subsequently used so could I imagine be garbage collected at any time.
or (perhaps in the vicinity of one of the reported issues with CVODE): Sundials.jl/src/common_interface/solve.jl Line 283 in c7469aa
where u0 here is a Julia Vector, so this will hit the convert path that creates a temporary NVector.
It seems to me the design here is not robust: it is incorrect in providing a |
In fact looking at this again, I'm not sure there isn't another problem? According to the Julia manual Sundials.jl/src/nvector_wrapper.jl Lines 9 to 26 in c7469aa
|
Also, what requirements does the Sundials C API place on the lifetime of C N_Vector arguments to C function calls ? Can they be deleted once the C API call returns (which is what I was assuming above), or is the caller required to manage them ? (seems a bit unlikely, this would provide another whole set of failure modes ...) |
@bolognam Have you found a solution or workaround to this? I'm working on a project and can consistently reproduce a segfault using The beginning of the error message is
Turning off garbage collection avoids the issue, but isn't feasible for all runs. |
Sorry @markowkes I haven't found a solution. I think @sjdaines is onto a solution but nobody (as far as I'm aware) has been working on making a fix in Sundials. |
With Julia1.9.0-beta2 and our application code, CVODE segfaults in more places (and almost every time...) Example stack trace, looks very similar to the initial report above, showing this is consistent with a problem due to N_Vector:
where the first few functions in stack trace are: CVode at C:\Users\sd336.julia\packages\Sundials\Y6XpH\lib\libsundials_api.jl:1841 [inlined]
CVode at C:\Users\sd336.julia\packages\Sundials\Y6XpH\lib\libsundials_api.jl:1848
solver_step at C:\Users\sd336.julia\packages\Sundials\Y6XpH\src\common_interface\solve.jl:1406
|
Can you try the PR branch #380 ? |
oh it's you haha. Did the PR branch not fix your case? |
Yup this was my fix for the case above ! (I just got the order of issue update and submit PR wrong) @bolognam @markowkes this might not be the only problem, but #380 fixes the issue I see here |
Issue #367 This fixes a problem seen with Julia >= 1.8, where a temporary NVector was created and the N_Vector pointer it contains passed to ccall, which is unsafe as the temporary NVector may be garbage collected. The fix is to remove convert(N_Vector, x) and replace with the combination of cconvert and unsafe_convert. An NVector can new be supplied as an argument to ccall with memory management then handled correctly ( and it is now impossible to explicitly create an N_Vector). Changes: - change to cconvert / unsafe_convert is in src/nvector_wrapper.jl . - A global edit (~400 changes) to the autogenerated code in lib/libsundials_api.jl. - small number of changes elsewhere, replacing N_Vector with NVector NB: The correct fix would be to update the code generator in gen/generate.jl, and regenerate the wrappers. NB: As the comments in src/nvector_wrapper.jl show this was an intentional and (now) incorrect use of temporary variables to avoid garbage collection, it is possible there are other similar problems (pointers from other types of temporary variables passed to ccall).
Thanks for digging into this! Good to see this finally all cleaned up to a newer version of Julia. |
I'm consistently getting segfaults in both Windows 10 and Ubuntu 20.04 on Julia 1.8.0 when running my ODE solves compared to Julia 1.7.2. I've tried the latest in Sundials (v4.10.0) as well as the version I was using in Julia 1.7.2 (v4.9.4).
Here's the stack trace in Windows (Sundials v4.9.4):
And in Linux (Sundials v4.10.0):
This is closed source software, but if an NDA is signed then I can provide an rr trace.
The text was updated successfully, but these errors were encountered: