Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make libsingular multivariate polynomial rings collectable #13447

Open
nbruin opened this issue Sep 11, 2012 · 195 comments
Open

Make libsingular multivariate polynomial rings collectable #13447

nbruin opened this issue Sep 11, 2012 · 195 comments

Comments

@nbruin
Copy link
Contributor

nbruin commented Sep 11, 2012

Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of MPolynomialRing_libsingular. We have only observed the problem on bsd.math.washington.edu, MacOSX 10.6 on x86_64.

The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to bsd.math)

Depends on #11521

CC: @simon-king-jena @malb @vbraun @gagern @robertwb @sagetrac-ylchapuy @jpflori @burcin

Component: memleak

Author: Simon King

Branch/Commit: public/make_libsingular_multivariate_polynomial_rings_collectable @ b4df239

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/13447

@nbruin nbruin added this to the sage-5.11 milestone Sep 11, 2012
@nbruin
Copy link
Contributor Author

nbruin commented Sep 11, 2012

comment:1

On 5.4-beta0 + #715 + #11521, there is a doctest failure on bsd.math.washington.edu, an x86_64 machine running MacOSX 10.6:

bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
sage -t  "devel/sage-main/sage/misc/cachefunc.pyx"          
The doctested process was killed by signal 11
         [12.7 s]
 
----------------------------------------------------------------------
The following tests failed:


        sage -t  "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed
Total time for all tests: 12.7 seconds

The segmentation fault happens reliably, but is hard to study because

  • running the same examples in an interactive session does not trigger the problem
  • running with sage -t -gdb (yes, that's possible!) or sage -t --verbose does not trigger the problem.
  • the order of tests in the file seems fairly important. You can change and
    delete some tests but not others. The likely explanation is that a garbage
    collection has to be triggered under the right conditions, so that the memory
    corruption (which likely happens upon deallocate somewhere) happens in the right
    spot.

The segfault happens in the doctest for CachedMethodCaller._instance_call (line 1038 in the sage source; example_27 in the file ~/.sage/tmp/cachefunc_*.py left after doctesting), in the line

            sage: P.<a,b,c,d> = QQ[]

Further instrumentation showed that the segfault happens in
sage/libs/singular/ring. pyx, in singular_ring_new, in the part that copies the strings over.

+    sys.stderr.write("before _names allocation\n")
     _names = <char**>omAlloc0(sizeof(char*)*(len(names)))
+    sys.stderr.write("after _names allocation\n")

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i])
        _names[i] = omStrDup(_name)
+        sys.stderr.write("after omStrDup\n")

The call _omStrDup segfaults for i=1. Unwinding the _omStrDup call:

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i]))
-        _names[i] = omStrDup(_name)
+        j = 0
+        while <bint> _name[j]:
+            j+=1
+        j+=1     #increment to include the 0
+        sys.stderr.write("string length (including 0) seems to be %s\n"%j)
+        copiedname =  <char*>omAlloc(sizeof(char)*(j+perturb))
+        sys.stderr.write("Done reserving memory buffer; got address %x\n"%(<long>copiedname))
+        for 0 <= offset < j:
+            sys.stderr.write("copying character nr %s\n"%offset)
+            copiedname[offset] = _name[offset]
+        _names[i] = copiedname
+        sys.stderr.write("after omStrDup\n")

shows that it's actually the omAlloc call segfaulting. For perturb=7 or higher, the segfault does not happen. For perturb a lower value it does. Given that the omAlloc addresses returned on earlier calls do not seem close to a page boundary, the only way omAlloc can fail is basically by a corrupted freelist an 8-byte bin. Likely culprits:

  • a double free (although I'd expect that would trigger problems on more architectures)
  • someone writing out-of-bounds in omAlloc-managed memory.
    Perhaps someone claiming an <int>,<int> structure and storing a <void *> in the second one?

Note the <char*> to <long> cast in the print statement. With an <int>, the compiler complains
about loss of precision, but not with <long>. I haven't checked whether
<long> is really 64 bits on this machine, though.

I have tried and the problem seems to persist with the old singular (5.4b0 has a
recently upgraded singular).

It would help a lot if someone could build singular to use plain malloc
throughout and then use valgrind or a similar tool, which should be able to
immediately catch a double free or out-of-bounds error. If the root of the
problem is not OSX-specific, this would even show up on other architectures.

See also [#715 comment:295 #715,comment
295] and below for some more details on how the diagnosis above was obtained.

@nbruin nbruin changed the title Make libsindular multivariate polynomial rings collectable Make libsingular multivariate polynomial rings collectable Sep 11, 2012
@nbruin
Copy link
Contributor Author

nbruin commented Sep 12, 2012

comment:3

OK, I did a little experiment and tried to build singular with plain malloc rather than omalloc. In principle, omalloc has an --with-emulate... flag, but the API offered in that mode is woefully incomplete for singular use. I tried to extend it a little. Very rough result:

http://sage.math.washington.edu/home/nbruin/singular-3-1-5.malloc.spkg

One problem is supplying a memdup, which needs to know the size of an allocated block from its pointer. On BSD, you can use malloc_size for that. On linux one could use malloc_usable_size. The rest is a whole swath of routines that need to be supplied.

The package above is very dirty, but on bsd.math it did provide me with an apparently working libsingular. The singular executable produced didn't seem usable, so keep your old one!

The doctest passes! Not exactly what we were hoping for, but it does make a double-free unlikely. That would have been detected. Corruption after freeing could still be possible, since malloc allocates way bigger blocks, so freelist data is likely missed.

There is of course also the possibility of some routine writing out of bounds, which is less likely to trigger problems with malloc too.

Singular people might be interested in incorporating the changes to omalloc (and preferrably extend them a little bit) so that --with-emulate... becomes a viable option for Singular debugging. Then you can valgrind singular code.

@simon-king-jena
Copy link
Member

Upstream: Reported upstream. No feedback yet.

@simon-king-jena
Copy link
Member

comment:5

I have contacted Hans Schönemann.

@simon-king-jena
Copy link
Member

comment:6

I think Martin should be Cc for this as well.

I am not sure if changing to malloc is an acceptable option for Singular. If I understand correctly, omalloc is very important for having a good performance in Gröbner basis computations.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 12, 2012

comment:7

Replying to @simon-king-jena:

I am not sure if changing to malloc is an acceptable option for Singular.

I am sure it is not acceptable for production, but being able to swap out omalloc for debugging can be very useful. That's why I tried. I understand that there are great tools to do memory management debugging and omalloc puts them all out of play because it hides all memory alloc/free activity.

It seems omalloc has its own tools but I wasn't able to get them working, I've seen indications that they don't work on 64 bits, and there's a good chance they're not as good as the general ones because they're for a smaller market.

I'm sure someone more familiar with the Singular and omalloc code bases can make a more informed decision on whether having the option of straight malloc memory management for debugging is worthwhile. My initial finding is that it quite likely can be done with relatively small modifications. I got it to more or less work in an evening, while being unfamiliar with the code.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 13, 2012

comment:8

At least the problem is a real one. I've found a similar iMac:

    Hardware Overview:

      Model Name: iMac
      Model Identifier: iMac10,1
      Processor Name: Intel Core 2 Duo
      Processor Speed: 3.06 GHz
      Number Of Processors: 1
      Total Number Of Cores: 2
      L2 Cache: 3 MB
      Memory: 4 GB
      Bus Speed: 1.07 GHz
      Boot ROM Version: IM101.00CC.B00
      SMC Version (system): 1.52f9

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: No
      Time since boot: 5 days 8:46

and bsd.math.washington.edu:

    Hardware Overview:

      Model Name: Mac Pro
      Model Identifier: MacPro5,1
      Processor Name: Quad-Core Intel Xeon
      Processor Speed: 2.4 GHz
      Number Of Processors: 2
      Total Number Of Cores: 8
      L2 Cache (per core): 256 KB
      L3 Cache (per processor): 12 MB
      Memory: 32 GB
      Processor Interconnect Speed: 5.86 GT/s
      Boot ROM Version: MP51.007F.B03
      SMC Version (system): 1.39f11
      SMC Version (processor tray): 1.39f11

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: Yes
      Time since boot: 16 days 6:14

Both these machines exhibit the same problem that on 5.4b0 + #715 + #11521, the doctest for cachefunc.pyx segfaults in the same spot.
Note that the iMac claims to not have a 64-bit kernel. Sage is compiled to be 64-bit on that machine, though (and seems to work).

Have we actually established that this bug does not occur on newer OSX versions?

@simon-king-jena
Copy link
Member

comment:9

Replying to @nbruin:

Have we actually established that this bug does not occur on newer OSX versions?

And have we actually established that this problem does not occur with older Singular versions? I am not totally sure, but I think the problem with #715+#11521 first emerged in sage-5.4.beta0, when Singular-3-1-5 was merged.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 13, 2012

comment:10

Replying to @simon-king-jena:

And have we actually established that this problem does not occur with older Singular versions?

Quoting from comment:1

I have tried and the problem seems to persist with the old singular (5.4b0 has a recently upgraded singular).

In the mean time, a bit of googling led me to OSX's "GuardMalloc". While sage+singular-malloc does not crash on the doctest, it does crash when run with

export DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib

Since gmalloc is a memory manager that places each allocation on its own page with protected/unmapped memory as close as possible around the block and that unmaps the block as soon as freed (I'm just parroting the manpage), a segfault is likely due to an access-after-free or access-out-of-bounds -- the one that would normally cause the corruption and then the segfault much later. (that's the whole idea of replacing omalloc -- I don't think it's doable to get omalloc to segfault on an access-after-free). This all comes at a significant speed penalty of course, so experiments are painful and I wouldn't even be able to interpret the backtrace/coredump if I got it (I'd hope that the gmalloc-induced segfault would be reproducible in gdb). It would really be useful if the test file would be pared down to an absolute minimum. That's basically just a backtracking search on which elements can be removed while still triggering a segfault.

However, I think this is a strong indication that there is a real memory violation at the base of this and that it is tracable.

@simon-king-jena
Copy link
Member

comment:11

I tried to track the problem as follows:

diff --git a/sage/libs/singular/ring.pyx b/sage/libs/singular/ring.pyx
--- a/sage/libs/singular/ring.pyx
+++ b/sage/libs/singular/ring.pyx
@@ -16,6 +16,8 @@

 include "../../ext/stdsage.pxi"

+import sys
+
 from sage.libs.gmp.types cimport __mpz_struct
 from sage.libs.gmp.mpz cimport mpz_init_set_ui, mpz_init_set

@@ -495,6 +497,8 @@
     cdef object r = wrap_ring(existing_ring)
     refcount = ring_refcount_dict.pop(r)
     ring_refcount_dict[r] = refcount+1
+    sys.stderr.write("reference %d to %d, wrapper %d\n"%(refcount+1,<size_t>existing_ring, id(r)))
+    sys.stderr.flush()
     return existing_ring


@@ -536,6 +540,8 @@

     cdef ring_wrapper_Py r = wrap_ring(doomed)
     refcount = ring_refcount_dict.pop(r)
+    sys.stderr.write("dereference level %d of %d, wrapper %d\n"%(refcount-1,<size_t>doomed, id(r)))
+    sys.stderr.flush()
     if refcount > 1:
         ring_refcount_dict[r] = refcount-1
         return
diff --git a/sage/rings/polynomial/multi_polynomial_libsingular.pyx b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
--- a/sage/rings/polynomial/multi_polynomial_libsingular.pyx
+++ b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
@@ -151,6 +151,7 @@
     sage: b-j*c
     b - 1728*c
 """
+import sys

 # The Singular API is as follows:
 #
@@ -242,7 +243,7 @@

 import sage.libs.pari.gen
 import polynomial_element
-
+from sage.libs.singular.ring cimport wrap_ring
 cdef class MPolynomialRing_libsingular(MPolynomialRing_generic):

     def __cinit__(self):
@@ -364,6 +365,8 @@
         from sage.rings.polynomial.polynomial_element import PolynomialBaseringInjection
         base_inject = PolynomialBaseringInjection(base_ring, self)
         self.register_coercion(base_inject)
+        sys.stderr.write("At %d, creating %s\n"%(<size_t>self._ring, self))
+        sys.stderr.flush()

     def __dealloc__(self):
         r"""
@@ -390,6 +393,16 @@
             sage: _ = gc.collect()
         """
         if self._ring != NULL:  # the constructor did not raise an exception
+            from sage.libs.singular.ring import ring_refcount_dict
+            try:
+                level = ring_refcount_dict[wrap_ring(self._ring)]
+            except KeyError:
+                level = -1
+            if level > 1:
+                sys.stderr.write("WARNING: %d\n"%(<size_t>self._ring))
+            else:
+                sys.stderr.write("__dealloc__: %s\n"%(<size_t>self._ring))
+            sys.stderr.flush()
             singular_ring_delete(self._ring)

     def __copy__(self):

Then, I ran python -t on the segfaulting test. Observation: It happens precisely twice that "WARNING" is printed, i.e., the __dealloc__ method is called even though there remain multiple references to the underlying libsingular ring.

In both cases it is QQ['a','b','c','d']. Here is a snipped from the output:

reference 2 to 4409548912, wrapper 4302568952
reference 3 to 4409548912, wrapper 4302569000
reference 4 to 4409548912, wrapper 4302568952
At 4409548912, creating Multivariate Polynomial Ring in a, b, c, d over Rational Field
reference 5 to 4409548912, wrapper 4302569000
reference 6 to 4409548912, wrapper 4302568952
reference 7 to 4409548912, wrapper 4302569000
reference 8 to 4409548912, wrapper 4302568952
reference 9 to 4409548912, wrapper 4302569000
reference 10 to 4409548912, wrapper 4302568952
reference 2 to 4409549416, wrapper 4302568928
reference 3 to 4409549416, wrapper 4302569000
dereference level 9 of 4409548912, wrapper 4302568928
dereference level 8 of 4409548912, wrapper 4302568952
dereference level 7 of 4409548912, wrapper 4302568928
dereference level 6 of 4409548912, wrapper 4302568952
dereference level 5 of 4409548912, wrapper 4302568928
dereference level 4 of 4409548912, wrapper 4302568952
dereference level 3 of 4409548912, wrapper 4302568928
WARNING: 4409548912
dereference level 2 of 4409548912, wrapper 4302568952
dereference level 1 of 4409548912, wrapper 4302568928
dereference level 0 of 4409548912, wrapper 4302568952

However, I am not totally sure whether this indicates a problem, because in both cases the remaining references are immediately removed. Also, it is always the case that 4 references are set to the libsingular ring before actually creating the polynomial ring in Sage.

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

Seems like a wild-goose chase to me, though.

@simon-king-jena
Copy link
Member

comment:12

Replying to @simon-king-jena:

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

And this ring is in fact currRing when it crashes.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

comment:13

OK! good progress. Instrumenting sagedoc.py a little bit we can indeed see the order in which the doctests are executed:

__main__
__main__.change_warning_output
__main__.check_with_tolerance
__main__.example_0
__main__.example_1
__main__.example_10
__main__.example_11
__main__.example_12
__main__.example_13
__main__.example_14
__main__.example_15
__main__.example_16
__main__.example_17
__main__.example_18
__main__.example_19
__main__.example_2
__main__.example_20
__main__.example_21
__main__.example_22
__main__.example_23
__main__.example_24
__main__.example_25
__main__.example_26
__main__.example_27
Unhandled SIGSEGV

so that indeed seems to be alphabetical order.

Now let's run the doctests with singular-using-malloc. Result: No segfault. OSX comes with gmalloc, which is a guarded malloc for debugging purposes. It places every allocation on a separate page and unmaps that page upon freeing. So, any access-after-free leads to a segfault. Now we do get a segfault and it happens a lot sooner than example_27. In fact, now the segfault survives in gdb. The error happens when executing

G = I.groebner_basis()###line 921:_sage_    >>> G = I.groebner_basis()

Here's a session with gdb once the segfault has happened. I think I have been able to extract enough data to point at the probably problem.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x00000001850dbf44
__pyx_f_4sage_4libs_8singular_8function_call_function (__pyx_v_self=0x190ab8960, __pyx_v_args=0x190a8e810, __pyx_v_R=0x19c39be70, __pyx_optional_args=<value temporarily unavailable, due to optimizations>) at sage/libs/singular/function.cpp:13253
13253       currRingHdl->data.uring->ref = (currRingHdl->data.uring->ref - 1);
####NB: This is line 1410 in sage/libs/singular/function.pyx
(gdb) print currRingHdl
$1 = (idhdl) 0x17c2b5fd0
(gdb) print currRingHdl->data
$2 = {
  i = -2062696816,
  uring = 0x1850dbe90,
  p = 0x1850dbe90,
  n = 0x1850dbe90,
  uideal = 0x1850dbe90,
  umap = 0x1850dbe90,
  umatrix = 0x1850dbe90,
  ustring = 0x1850dbe90 <Address 0x1850dbe90 out of bounds>,
  iv = 0x1850dbe90,
  bim = 0x1850dbe90,
  l = 0x1850dbe90,
  li = 0x1850dbe90,
  pack = 0x1850dbe90,
  pinf = 0x1850dbe90
}
(gdb) print currRingHdl->data.uring
$3 = (ring) 0x1850dbe90
(gdb) print currRingHdl->data.uring->ref
Cannot access memory at address 0x1850dbf44
(gdb) print  *__pyx_v_si_ring
$10 = {
  idroot = 0x0, 
  order = 0x19c3cbff0, 
  block0 = 0x19c3cdff0, 
  block1 = 0x19c3cfff0, 
  parameter = 0x0, 
  minpoly = 0x0, 
  minideal = 0x0, 
  wvhdl = 0x19c3c9fe0, 
  names = 0x19c3bdfe0, 
  ordsgn = 0x19c3ddfe0, 
  typ = 0x19c3dffd0, 
  NegWeightL_Offset = 0x0, 
  VarOffset = 0x19c3d9ff0, 
  qideal = 0x0, 
  firstwv = 0x0, 
  PolyBin = 0x104ee8440, 
  ringtype = 0, 
  ringflaga = 0x0, 
  ringflagb = 0, 
  nr2mModul = 0, 
  nrnModul = 0x0, 
  options = 100663424, 
  ch = 0, 
  ref = 0, 
  float_len = 0, 
  float_len2 = 0, 
  N = 3, 
  P = 0, 
  OrdSgn = 1, 
  firstBlockEnds = 3, 
  real_var_start = 0, 
  real_var_end = 0, 
  isLPring = 0, 
  VectorOut = 0, 
  ShortOut = 0, 
  CanShortOut = 1, 
  LexOrder = 0, 
  MixedOrder = 0, 
  ComponentOrder = -1, 
  ExpL_Size = 3, 
  CmpL_Size = 3, 
  VarL_Size = 1, 
  BitsPerExp = 20, 
  ExpPerLong = 3, 
  pCompIndex = 2, 
  pOrdIndex = 0, 
  OrdSize = 1, 
  VarL_LowIndex = 1, 
  MinExpPerLong = 3, 
  NegWeightL_Size = 0, 
  VarL_Offset = 0x19c3e3ff0, 
  bitmask = 1048575, 
  divmask = 1152922604119523329, 
  p_Procs = 0x19c3e7f80, 
  pFDeg = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDeg = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  pFDegOrig = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDegOrig = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  p_Setm = 0x104a7ff40 <p_Setm_TotalDegree(spolyrec*, sip_sring*)>, 
  cf = 0x11e487e70, 
  algring = 0x0, 
  _nc = 0x0
}
(gdb) print __pyx_v_si_ring
$11 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_28multi_polynomial_libsingular_MPolynomialRing_libsingular *)__pyx_v_R)->_ring
$12 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_6plural_NCPolynomialRing_plural *)__pyx_v_R)->_ring
$13 = (ip_sring *) 0x10019ff30
####NB: so PY_TYPE_CHECK(R, MPolynomialRing_libsingular) is true
(gdb) print (__pyx_v_si_ring != currRing)
$15 = false
####NB: does this mean that rChangeCurrRing(si_ring) got executed or that si_ring already equalled currRing?
(gdb) print (currRingHdl->data.uring != currRing)
$16 = true
####NB: of course, that's why we segfault on the statement that follows:
####NB:       currRingHdl.data.uring.ref -= 1
(gdb) print *(currRingHdl->data.uring)
Cannot access memory at address 0x1850dbe90
####NB: It looks like currRingHdl.data.uring has been unbound.
####NB: naturally, changing a field on that pointer will corrupt memory (or in this case
####NB: because gmalloc has unmapped the page, cause a segfault)
####NB: Could it be that the code here should really test for uring being still valid?
####NB: (if it can do that at all)?

So I think the issue is in sage.lib.singular.function.call_function:

...
    if currRingHdl.data.uring!= currRing:
        currRingHdl.data.uring.ref -= 1
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

The evidence points absolutely to currRingHdl.data.uring pointing to unallocated (probably freed) memory. The access then of course can have all kinds of effects. At this point it is probably doable for a LibSingular expert to reason about the code whether uring should always be valid at this point (I suspect not).

It looks suspicious to me that sage.libs.singular.ring.singular_ring_delete does do a careful dance to zero out the currRing variable but doesn't seem to care about currRngHdl. I also find it worrying that there apparently is a refcount system right on the ring structures (as you can see above) and yet in singular_ring_delete a separate refcounting dict is used. One would think the same refcounting system should be borrowed by singular_ring_new and singular_ring_delete. It looks to me the code above thinks that by increasing ...uring.ref the reference is protected, but singular_ring_delete doesn't seem to take into account this refcount. It could well be that I'm misinterpreting the code and that this is all perfectly safe, though.

Libsingular specialists: Keep in mind that in principle, singular code can get executed in rather awkward moments, possibly as part of clean-ups of circular garbage and call-backs on weakref cleanup, where equality might be tested of objects that are soon to be deallocated themselves.

The fickleness of the bug is consistent with this condition arising during a cyclic garbage collection with just the right amount of junk around. That would make the occurrence of the bug depend on just about everything in memory. Or at least if you depend on the corruption leading to a segfault, it depends on which location exactly gets corrupted.

I think we might be getting close to a badge for debugging excellence here!

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

Attachment: trac_13447-double_refcount.patch.gz

take into account both refcount_dict and ring*.ref fields on deletion.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

Attachment: trac_13447-consolidated_refcount.patch.gz

Consolidate two refcount systems (cruft not yet removed from patch)

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

comment:14

OK, two independent patches. Either prevents the segfault. I may just have removed the symptom, but not the cause.

If I'm correctly understanding the problem, attachment: trac_13447-consolidated_refcount.patch should be the preferred solution. However, my unfamiliarity with (lib)singular's intricacies might have caused an oversight somewhere. I think my interpretation is consistent with the use in sage.lib.singular.function.call_function, which is my main source of inspiration.

If people agree, we can clean out the cruft remaining from the refcounting method implemented locally.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

Work Issues: Input from libsingular experts

@simon-king-jena
Copy link
Member

comment:16

Replying to @nbruin:

If I'm correctly understanding the problem, attachment: trac_13447-consolidated_refcount.patch should be the preferred solution.

I didn't test the patch yet. However, it seems very straight forward to me: There already is a refcounting, and thus one should use it. I am Cc'ing Volker Braun and Martin von Gagern, the authors of #11339. Does attachment: trac_13447-consolidated_refcount.patch make sense to you as well?

Keeping a double refcount (as with attachment: trac_13447-double_refcount.patch seems suspicious to me.

Perhaps one should let the patchbots test it? Thus, I'll add this as dependency for #715, and for the patchbot:

Apply trac_13447-consolidated_refcount.patch

PS: You really deserve a badge for debugging excellence! Do I understand correctly that the bug is not on the side of Singular? I'll inform Hans accordingly.

@simon-king-jena

This comment has been minimized.

@nbruin
Copy link
Contributor Author

nbruin commented Sep 15, 2012

comment:17

With the new refcounting, I think it could be that currRingHdl.data.uring holds the last reference to a ring. In fact, it seems that was the source of the segfaults. If that reference is removed in call_function, shouldn't we delete the ring? The naive solution

...
        currRingHdl.data.uring.ref -= 1
        if currRingHdl.data.uring.ref == 0:
            rDelete(currRingHdl.data.uring)
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

seems to have no ill effect (I put a print statement there that did produce some output, so it does happen), but perhaps I'm overlooking something. Are there other places where references are liable to be lost?

@simon-king-jena
Copy link
Member

comment:18

For the record, the following tests fail:

        sage -t  -force_lib devel/sage/sage/libs/singular/ring.pyx # 6 doctests failed
        sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py # 1 doctests failed
        sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx # 1 doctests failed

with

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13447-consolidated_refcount.patch

So, not all is good, but almost...

@vbraun
Copy link
Member

vbraun commented Sep 15, 2012

comment:19

I don't know if ring.ref has any meaning to Singular. If we are indeed free to use that field for reference counting in Sage then I'm fine with trac_13447-consolidated_refcount.patch.

Upstream plans to get rid of the whole currRing global variable eventually, for the record.

@simon-king-jena
Copy link
Member

comment:20

Two errors mentioned in comment:18 look (again) difficult.

The first one:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

Hence, the way how one refcounts libsingular rings influences the dimension of Hecke modules. Strange at least...

Note, however, that the value returned by the "dimension()" method above is not constant, because it only returns a lower bound (if I recall correctly) that is increased when one learns more about the Hecke module. Hence, it could very well be that ModularSymbols(20,2).boundary_space() used to be cached but is now garbage collected, so that information on the dimension is lost.

The second error is apparently ignored and only printed to stderr:

sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring.py
         [2.1 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial.pyx
         [4.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/groebner_fan.py
         [7.8 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/rings/polynomial/multi_polynomial_libsingular.pyx", line 409:
    sage: len(ring_refcount_dict) == n + 1
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of  19 in __main__.example_4
***Test Failed*** 1 failures.
For whitespace errors, see the file /Users/SimonKing/.sage/tmp/bsd.math.washington.edu-96119/multi_polynomial_libsingular_8098.py
         [4.2 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring_generic.pyx
         [2.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/polydict.pyx
         [2.0 s]

Since these were parallel tests, I can't tell were the ignored attribute errors actually came from.

@simon-king-jena
Copy link
Member

comment:21

PS: When consolidating refcounters, we must not forget #13145, which only got merged in sage-5.4.beta1.

@embray embray modified the milestones: sage-8.8, sage-8.9 Jun 14, 2019
@embray
Copy link
Contributor

embray commented Dec 30, 2019

comment:162

Ticket retargeted after milestone closed

@embray embray modified the milestones: sage-8.9, sage-9.1 Dec 30, 2019
@mkoeppe
Copy link
Contributor

mkoeppe commented May 1, 2020

comment:163

Moving tickets to milestone sage-9.2 based on a review of last modification date, branch status, and severity.

@mkoeppe mkoeppe modified the milestones: sage-9.1, sage-9.2 May 1, 2020
@mwageringel
Copy link

comment:164

I have merged the latest beta and get the two test failures below. It also seems that indeed this ticket will help with the problem in #29528, which came up in the upgrade of Singular #25993.

File "src/sage/rings/polynomial/multi_polynomial_libsingular.pyx", line 438, in sage.rings.polynomial.multi_polynomial_libsingular.MPolynomialRing_libsingular.__copy__
Failed example:
    total_ring_reference_count() == n + 4
Expected:
    True
Got:
    False

The above is equal to n + 3.

File "src/sage/rings/function_field/element.pyx", line 575, in sage.rings.function_field.element.FunctionFieldElement.valuation
Failed example:
    O = L.maximal_order()
Exception raised:
    Traceback (most recent call last):
      File "sage/misc/cachefunc.pyx", line 996, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:5891)
        return self.cache[k]
      File "sage/misc/weak_dict.pyx", line 704, in sage.misc.weak_dict.WeakValueDictionary.__getitem__ (build/cythonized/sage/misc/weak_dict.c:3671)
        raise KeyError(k)
    KeyError: ((<class 'sage.rings.function_field.order.FunctionFieldMaximalOrder_polymod'>, Function field in y defined by y^2 + y + (x^2 + 1)/x), ())

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/doctest/forker.py", line 680, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/doctest/forker.py", line 1104, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.rings.function_field.element.FunctionFieldElement.valuation[6]>", line 1, in <module>
        O = L.maximal_order()
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/rings/function_field/function_field.py", line 1884, in maximal_order
        return FunctionFieldMaximalOrder_polymod(self)
      File "sage/misc/classcall_metaclass.pyx", line 322, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1741)
        return cls.classcall(cls, *args, **kwds)
      File "sage/misc/cachefunc.pyx", line 1001, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6017)
        w = self.f(*args, **kwds)
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/structure/unique_representation.py", line 1008, in __classcall__
        instance = typecall(cls, *args, **options)
      File "sage/misc/classcall_metaclass.pyx", line 486, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2191)
        return (<PyTypeObject*>type).tp_call(cls, args, kwds)
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/rings/function_field/order.py", line 1245, in __init__
        basis = [from_model(g) for g in model._maximal_order_basis()]
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/rings/function_field/function_field.py", line 3572, in _maximal_order_basis
        pols_in_S = _singular_normal(S.ideal(g))[0]
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/rings/qqbar_decorators.py", line 68, in wrapper
        return func(*args, **kwds)
      File "/amd/compute/mwagerin/git/sage_compute/python3/local/lib/python3.7/site-packages/sage/rings/function_field/function_field.py", line 3491, in _singular_normal
        lib('normal.lib')
      File "sage/libs/singular/function.pyx", line 1846, in sage.libs.singular.function.lib (build/cythonized/sage/libs/singular/function.cpp:18116)
        sig_on()
    cysignals.signals.SignalError: Segmentation fault

New commits:

b4df23913447: Merge tag '9.2.beta0' into #13447

@mwageringel
Copy link

Changed commit from c2b0bf6 to b4df239

@mkoeppe mkoeppe modified the milestones: sage-9.2, sage-9.3 Sep 5, 2020
@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 13, 2021

comment:166

Setting new milestone based on a cursory review of ticket status, priority, and last modification date.

@mkoeppe mkoeppe modified the milestones: sage-9.3, sage-9.4 Feb 13, 2021
@mkoeppe
Copy link
Contributor

mkoeppe commented Jul 19, 2021

comment:167

Setting a new milestone for this ticket based on a cursory review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants