-
-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make libsingular multivariate polynomial rings collectable #13447
Comments
comment:1
On 5.4-beta0 + #715 + #11521, there is a doctest failure on
The segmentation fault happens reliably, but is hard to study because
The segfault happens in the doctest for
Further instrumentation showed that the segfault happens in + sys.stderr.write("before _names allocation\n")
_names = <char**>omAlloc0(sizeof(char*)*(len(names)))
+ sys.stderr.write("after _names allocation\n")
for i from 0 <= i < n:
_name = names[i]
+ sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i])
_names[i] = omStrDup(_name)
+ sys.stderr.write("after omStrDup\n") The call for i from 0 <= i < n:
_name = names[i]
+ sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i]))
- _names[i] = omStrDup(_name)
+ j = 0
+ while <bint> _name[j]:
+ j+=1
+ j+=1 #increment to include the 0
+ sys.stderr.write("string length (including 0) seems to be %s\n"%j)
+ copiedname = <char*>omAlloc(sizeof(char)*(j+perturb))
+ sys.stderr.write("Done reserving memory buffer; got address %x\n"%(<long>copiedname))
+ for 0 <= offset < j:
+ sys.stderr.write("copying character nr %s\n"%offset)
+ copiedname[offset] = _name[offset]
+ _names[i] = copiedname
+ sys.stderr.write("after omStrDup\n") shows that it's actually the
Note the I have tried and the problem seems to persist with the old singular (5.4b0 has a It would help a lot if someone could build singular to use plain malloc See also [#715 comment:295 #715,comment |
comment:3
OK, I did a little experiment and tried to build singular with plain malloc rather than omalloc. In principle, omalloc has an http://sage.math.washington.edu/home/nbruin/singular-3-1-5.malloc.spkg One problem is supplying a The package above is very dirty, but on bsd.math it did provide me with an apparently working libsingular. The singular executable produced didn't seem usable, so keep your old one! The doctest passes! Not exactly what we were hoping for, but it does make a double-free unlikely. That would have been detected. Corruption after freeing could still be possible, since There is of course also the possibility of some routine writing out of bounds, which is less likely to trigger problems with malloc too. Singular people might be interested in incorporating the changes to omalloc (and preferrably extend them a little bit) so that |
Upstream: Reported upstream. No feedback yet. |
comment:5
I have contacted Hans Schönemann. |
comment:6
I think Martin should be Cc for this as well. I am not sure if changing to malloc is an acceptable option for Singular. If I understand correctly, omalloc is very important for having a good performance in Gröbner basis computations. |
comment:7
Replying to @simon-king-jena:
I am sure it is not acceptable for production, but being able to swap out omalloc for debugging can be very useful. That's why I tried. I understand that there are great tools to do memory management debugging and omalloc puts them all out of play because it hides all memory alloc/free activity. It seems omalloc has its own tools but I wasn't able to get them working, I've seen indications that they don't work on 64 bits, and there's a good chance they're not as good as the general ones because they're for a smaller market. I'm sure someone more familiar with the Singular and omalloc code bases can make a more informed decision on whether having the option of straight malloc memory management for debugging is worthwhile. My initial finding is that it quite likely can be done with relatively small modifications. I got it to more or less work in an evening, while being unfamiliar with the code. |
comment:8
At least the problem is a real one. I've found a similar iMac:
and
Both these machines exhibit the same problem that on 5.4b0 + #715 + #11521, the doctest for Have we actually established that this bug does not occur on newer OSX versions? |
comment:9
Replying to @nbruin:
And have we actually established that this problem does not occur with older Singular versions? I am not totally sure, but I think the problem with #715+#11521 first emerged in sage-5.4.beta0, when Singular-3-1-5 was merged. |
comment:10
Replying to @simon-king-jena:
Quoting from comment:1
In the mean time, a bit of googling led me to OSX's "GuardMalloc". While sage+
Since gmalloc is a memory manager that places each allocation on its own page with protected/unmapped memory as close as possible around the block and that unmaps the block as soon as freed (I'm just parroting the manpage), a segfault is likely due to an access-after-free or access-out-of-bounds -- the one that would normally cause the corruption and then the segfault much later. (that's the whole idea of replacing omalloc -- I don't think it's doable to get omalloc to segfault on an access-after-free). This all comes at a significant speed penalty of course, so experiments are painful and I wouldn't even be able to interpret the backtrace/coredump if I got it (I'd hope that the gmalloc-induced segfault would be reproducible in gdb). It would really be useful if the test file would be pared down to an absolute minimum. That's basically just a backtracking search on which elements can be removed while still triggering a segfault. However, I think this is a strong indication that there is a real memory violation at the base of this and that it is tracable. |
comment:11
I tried to track the problem as follows: diff --git a/sage/libs/singular/ring.pyx b/sage/libs/singular/ring.pyx
--- a/sage/libs/singular/ring.pyx
+++ b/sage/libs/singular/ring.pyx
@@ -16,6 +16,8 @@
include "../../ext/stdsage.pxi"
+import sys
+
from sage.libs.gmp.types cimport __mpz_struct
from sage.libs.gmp.mpz cimport mpz_init_set_ui, mpz_init_set
@@ -495,6 +497,8 @@
cdef object r = wrap_ring(existing_ring)
refcount = ring_refcount_dict.pop(r)
ring_refcount_dict[r] = refcount+1
+ sys.stderr.write("reference %d to %d, wrapper %d\n"%(refcount+1,<size_t>existing_ring, id(r)))
+ sys.stderr.flush()
return existing_ring
@@ -536,6 +540,8 @@
cdef ring_wrapper_Py r = wrap_ring(doomed)
refcount = ring_refcount_dict.pop(r)
+ sys.stderr.write("dereference level %d of %d, wrapper %d\n"%(refcount-1,<size_t>doomed, id(r)))
+ sys.stderr.flush()
if refcount > 1:
ring_refcount_dict[r] = refcount-1
return
diff --git a/sage/rings/polynomial/multi_polynomial_libsingular.pyx b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
--- a/sage/rings/polynomial/multi_polynomial_libsingular.pyx
+++ b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
@@ -151,6 +151,7 @@
sage: b-j*c
b - 1728*c
"""
+import sys
# The Singular API is as follows:
#
@@ -242,7 +243,7 @@
import sage.libs.pari.gen
import polynomial_element
-
+from sage.libs.singular.ring cimport wrap_ring
cdef class MPolynomialRing_libsingular(MPolynomialRing_generic):
def __cinit__(self):
@@ -364,6 +365,8 @@
from sage.rings.polynomial.polynomial_element import PolynomialBaseringInjection
base_inject = PolynomialBaseringInjection(base_ring, self)
self.register_coercion(base_inject)
+ sys.stderr.write("At %d, creating %s\n"%(<size_t>self._ring, self))
+ sys.stderr.flush()
def __dealloc__(self):
r"""
@@ -390,6 +393,16 @@
sage: _ = gc.collect()
"""
if self._ring != NULL: # the constructor did not raise an exception
+ from sage.libs.singular.ring import ring_refcount_dict
+ try:
+ level = ring_refcount_dict[wrap_ring(self._ring)]
+ except KeyError:
+ level = -1
+ if level > 1:
+ sys.stderr.write("WARNING: %d\n"%(<size_t>self._ring))
+ else:
+ sys.stderr.write("__dealloc__: %s\n"%(<size_t>self._ring))
+ sys.stderr.flush()
singular_ring_delete(self._ring)
def __copy__(self): Then, I ran In both cases it is
However, I am not totally sure whether this indicates a problem, because in both cases the remaining references are immediately removed. Also, it is always the case that 4 references are set to the libsingular ring before actually creating the polynomial ring in Sage. One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of
Seems like a wild-goose chase to me, though. |
comment:12
Replying to @simon-king-jena:
And this ring is in fact |
comment:13
OK! good progress. Instrumenting
so that indeed seems to be alphabetical order. Now let's run the doctests with singular-using-malloc. Result: No segfault. OSX comes with
Here's a session with
So I think the issue is in
The evidence points absolutely to It looks suspicious to me that Libsingular specialists: Keep in mind that in principle, singular code can get executed in rather awkward moments, possibly as part of clean-ups of circular garbage and call-backs on weakref cleanup, where equality might be tested of objects that are soon to be deallocated themselves. The fickleness of the bug is consistent with this condition arising during a cyclic garbage collection with just the right amount of junk around. That would make the occurrence of the bug depend on just about everything in memory. Or at least if you depend on the corruption leading to a segfault, it depends on which location exactly gets corrupted. I think we might be getting close to a badge for debugging excellence here! |
Attachment: trac_13447-double_refcount.patch.gz take into account both refcount_dict and ring*.ref fields on deletion. |
Attachment: trac_13447-consolidated_refcount.patch.gz Consolidate two refcount systems (cruft not yet removed from patch) |
comment:14
OK, two independent patches. Either prevents the segfault. I may just have removed the symptom, but not the cause. If I'm correctly understanding the problem, attachment: trac_13447-consolidated_refcount.patch should be the preferred solution. However, my unfamiliarity with (lib)singular's intricacies might have caused an oversight somewhere. I think my interpretation is consistent with the use in If people agree, we can clean out the cruft remaining from the refcounting method implemented locally. |
Work Issues: Input from libsingular experts |
comment:16
Replying to @nbruin:
I didn't test the patch yet. However, it seems very straight forward to me: There already is a refcounting, and thus one should use it. I am Cc'ing Volker Braun and Martin von Gagern, the authors of #11339. Does attachment: trac_13447-consolidated_refcount.patch make sense to you as well? Keeping a double refcount (as with attachment: trac_13447-double_refcount.patch seems suspicious to me. Perhaps one should let the patchbots test it? Thus, I'll add this as dependency for #715, and for the patchbot: Apply trac_13447-consolidated_refcount.patch PS: You really deserve a badge for debugging excellence! Do I understand correctly that the bug is not on the side of Singular? I'll inform Hans accordingly. |
This comment has been minimized.
This comment has been minimized.
comment:17
With the new refcounting, I think it could be that
seems to have no ill effect (I put a print statement there that did produce some output, so it does happen), but perhaps I'm overlooking something. Are there other places where references are liable to be lost? |
comment:18
For the record, the following tests fail:
with
So, not all is good, but almost... |
comment:19
I don't know if Upstream plans to get rid of the whole |
comment:20
Two errors mentioned in comment:18 look (again) difficult. The first one:
Hence, the way how one refcounts libsingular rings influences the dimension of Hecke modules. Strange at least... Note, however, that the value returned by the "dimension()" method above is not constant, because it only returns a lower bound (if I recall correctly) that is increased when one learns more about the Hecke module. Hence, it could very well be that The second error is apparently ignored and only printed to stderr:
Since these were parallel tests, I can't tell were the ignored attribute errors actually came from. |
comment:21
PS: When consolidating refcounters, we must not forget #13145, which only got merged in sage-5.4.beta1. |
comment:162
Ticket retargeted after milestone closed |
comment:163
Moving tickets to milestone sage-9.2 based on a review of last modification date, branch status, and severity. |
comment:164
I have merged the latest beta and get the two test failures below. It also seems that indeed this ticket will help with the problem in #29528, which came up in the upgrade of Singular #25993.
The above is equal to
New commits:
|
comment:166
Setting new milestone based on a cursory review of ticket status, priority, and last modification date. |
comment:167
Setting a new milestone for this ticket based on a cursory review. |
Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of
MPolynomialRing_libsingular
. We have only observed the problem onbsd.math.washington.edu
, MacOSX 10.6 on x86_64.The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to
bsd.math
)Depends on #11521
CC: @simon-king-jena @malb @vbraun @gagern @robertwb @sagetrac-ylchapuy @jpflori @burcin
Component: memleak
Author: Simon King
Branch/Commit: public/make_libsingular_multivariate_polynomial_rings_collectable @
b4df239
Reviewer: Travis Scrimshaw
Issue created by migration from https://trac.sagemath.org/ticket/13447
The text was updated successfully, but these errors were encountered: