Iterating over versions leads to id error #162

ArvidJB · 2021-02-17T16:11:21Z

Not sure what's going on here - is this a reference counting problem perhaps? This does not seem to fail every time I run it, but fails 90% of the time?

In [1]: from versioned_hdf5 import VersionedHDF5File

In [2]: with h5py.File('foo.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('0') as sv:
   ...:         sv.create_dataset('bar', data=np.zeros(0, dtype='double'))
   ...:
[PYFLYBY] import h5py
[PYFLYBY] import numpy as np

In [3]: for i in range(1, 100):
   ...:     with h5py.File('foo.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version(str(i)) as sv:
   ...:             sv['bar'].resize((i,))
   ...:             sv['bar'][i-1] = i
   ...:

In [4]: with h5py.File('foo.h5', 'r') as f:
   ...:     versions = list(f['_version_data/versions'].keys())
   ...:

In [5]: with h5py.File('foo.h5', 'r') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     for version in versions:
   ...:         if version != '__first_version__':
   ...:             cv = vf[version]
   ...:             cv['bar'][:]
   ...:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-73795d691ba0> in <module>
      4         if version != '__first_version__':
      5             cv = vf[version]
----> 6             cv['bar'][:]
      7

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in __exit__(self, *args)
    459     def __exit__(self, *args):
    460         if self.id:
--> 461             self.close()
    462
    463     @with_phil

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in close(self)
    429                 # Close file-resident objects first, then the files.
    430                 # Otherwise we get errors in MPI mode.
--> 431                 id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
    432                 file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
    433

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.get_obj_ids()

RuntimeError: Can't increment id ref count (can't locate ID)

The text was updated successfully, but these errors were encountered:

asmeurer · 2021-02-18T20:36:12Z

Looks like it's the same error from #125 (comment) and #148 again.

Indeed, the h5py commit 773680edcfd868a32a2b047ad654fb3d28da92b2 that "fixed" the problem on the other issue seems to fix this as well. But the problem reappears again in the latest h5py (see the discussions on #148). Might be time for me to dig into this further. I still am not sure if this is a bug in h5py or if it is an issue with something we are doing.

ArvidJB · 2021-02-18T21:59:14Z

If you could get to the bottom of this it would be great.
This code is fairly "complex" since it reads multiple versions, which seems to be sufficient to reproduce this. I think if you increase the number of versions/datasets/etc. it will become even more probable to reproduce.
Most of our code is much simpler (open file, read a couple datasets, close file), which is probably why we don't observe this in practice... yet.

This lets us avoid using a weakref dictionary to have a global reference. Unfortunately, this does not appear to fix the garbage collection bug with hdf5 (deshaw#162). That seems to indicate that the issue is not directly related to weakrefs. Either way, this does seem like it might be a better design for the registry of groups.

asmeurer · 2021-03-04T20:40:50Z

Calling gc.collect() right before closing the file appears to make the issue go away, e.g.,

        with h5py.File('foo.h5', 'r') as f:
            vf = VersionedHDF5File(f)
            for version in versions:
                if version != '__first_version__':
                    cv = vf[version]
                    cv['bar'][:]
            gc.collect()

Some more testing is needed to see if this is a fluke, but so far, it makes the error go away in every reproducer I know of. I'm still working out what this means in terms of the actual source of the error, but this may be useful as a workaround.

asmeurer · 2021-03-05T21:02:22Z

It looks like disabling the garbage collector inside of h5py.File.close also fixes the issue.

diff --git a/h5py/_hl/files.py b/h5py/_hl/files.py
index 73ce5a75..3f605204 100644
--- a/h5py/_hl/files.py
+++ b/h5py/_hl/files.py
@@ -428,22 +428,27 @@ class File(Group):
 
                 # Close file-resident objects first, then the files.
                 # Otherwise we get errors in MPI mode.
-                id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
-                file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
-
-                id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
-                file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
-
-                for id_ in id_list:
-                    while id_.valid:
-                        h5i.dec_ref(id_)
-
-                for id_ in file_list:
-                    while id_.valid:
-                        h5i.dec_ref(id_)
-
-                self.id.close()
-                _objects.nonlocal_close()
+                import gc
+                gc.disable()
+                try:
+                    id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
+                    file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
+
+                    id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
+                    file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
+
+                    for id_ in id_list:
+                        while id_.valid:
+                            h5i.dec_ref(id_)
+
+                    for id_ in file_list:
+                        while id_.valid:
+                            h5i.dec_ref(id_)
+
+                    self.id.close()
+                    _objects.nonlocal_close()
+                finally:
+                    gc.enable()
 
     def flush(self):
         """ Tell the HDF5 library to flush its buffers.

Are either of these workarounds tenable? I don't think they really get to the source of the issue, but I'm not sure how much more time I should spend digging into this to figure out what is going on.

quarl · 2021-03-05T22:16:52Z

I took a look just now and have some good news.

Adding a call to VersionedHDF5File.close() before the h5py.File.close() makes the problem go away:

    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version in versions:
            vf[version]['bar'][:]
        vf.close()                       # <-- adding this line makes the problem go away

Similarly, instead of calling vf.close(), instead using del vf or vf._versions = vf._version_data = None also makes the problem go away.

Notes:

I think making sure to close the vf before the f should be a totally fine workaround for us. Arvid, can you confirm this makes the problem go away in D E Shaw code?
We should use it in context manager, e.g. with closing(VersionedHDF5File(f)) as vf (or a try..finally).
Aaron, can we make VersionedHDF5File support __enter__ and __exit__ (calling self.close())? Then as clients we can just directly use with VersionedHDF5File(f) as vf.
It's less urgent to fix the underlying bug, now that we have this good workaround, though I would still keep this issue open.
This information may help with getting to the bottom of the bug. A necessary condition to reproduce the bug seems to be that there needs to be a python reference to the h5py.Group object at the time h5py.File.close() is called.
Possibly making VersionedHDF5File._versions and VersionedHDF5File._version_data be weakrefs could help (I know you changed some things to/from weakrefs, not sure if these were involved)
I noticed that if you do something like vf[version] after f.close(), it raises KeyError as if the version doesn't exist. Can we make operations like this raise an exception that indicates the file was closed?
I noticed that if you do something like print(vf) after f.close(), that leads to slightly confusing exceptions. Can we make VersionedHDF5File.__repr__ return a string in this situation?

asmeurer · 2021-03-05T23:55:00Z

Similarly, instead of calling vf.close(), instead using del vf or vf._versions = vf._version_data = None also makes the problem go away.

I can still reproduce the issue with all those things. Here is the script I'm using

import h5py
from versioned_hdf5 import VersionedHDF5File

N = 1000
def run():
    with h5py.File('test.hdf5', 'w') as f:
        file = VersionedHDF5File(f)
        for i in range(N):
            print(f'{i}/{N}')
            with file.stage_version(str(i), '') as g:
                g.create_dataset('a/' + str(i), data=list(range(100000)))

if __name__ == '__main__':
    run()

Note that you have to be a little careful when playing with this, because a seemingly innocuous change can change how the garbage collector runs and cause the issue to go away for a specific example. I am able to consistently reproduce the issue by creating a large number of objects (here by making a lot of versions).

I also use this script. This is the script from above but run in a loop. It usually takes a few iterations before it fails (up to 36 iterations in some cases that I've tested).

from versioned_hdf5 import VersionedHDF5File
import numpy as np
import h5py
import numpy

def run():
    for i in range(50):
        print(i)
        with h5py.File('foo.h5', 'w') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version('0') as sv:
                sv.create_dataset('bar', data=np.zeros(0, dtype='double'))

        for i in range(1, 100):
            with h5py.File('foo.h5', 'r+') as f:
                vf = VersionedHDF5File(f)
                with vf.stage_version(str(i)) as sv:
                    sv['bar'].resize((i,))
                    sv['bar'][i-1] = i

        with h5py.File('foo.h5', 'r') as f:
            versions = list(f['_version_data/versions'].keys())

        with h5py.File('foo.h5', 'r') as f:
            vf = VersionedHDF5File(f)
            for version in versions:
                if version != '__first_version__':
                    cv = vf[version]
                    cv['bar'][:]

if __name__ == '__main__':
    run()

I didn't mention it, but completely disabling the garbage collector also makes the problem go away (obviously that's not a viable workaround, but it does show that that is the source of the problem).

I agree with your notes that things should behave better once the file is closed. I've tried implement proper behavior for this for version groups that are already committed, but I haven't paid as close attention to what happens once the file itself is closed.

asmeurer · 2021-03-06T00:13:38Z

Here is some technical context for this issue. The issue occurs when the h5py File object is closed. The way HDF5 works is that every object has an id. Ids are integers, and work kind of like file descriptors, but apply to all object types (files, gruops, datasets, various metadata objects). Every kind of object can be "closed", not just the top-level file object. Object IDs can be reused once they are closed. I'm not sure yet if this fact is relevant to this issue. If it is, then this comment in the h5py source is relevant.

In pure HDF5, when a file is closed, this does not close the containing objects. So h5py works around this by manually closing every open object when a file is closed.

This is what is happening in the h5py code in the traceback:

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in close(self)
    429                 # Close file-resident objects first, then the files.
    430                 # Otherwise we get errors in MPI mode.
--> 431                 id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
    432                 file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
    433

It is trying to get a list of every open object ID corresponding to the current file. This fails in Cython code that wraps the HDF5 API. One HDF5 function returns some set of supposedly open object ids, but another function, called right below that, fails on one of those IDs (as if it were actually closed). The only other place that object IDs are closed is in the object deallocation code (i.e., during garbage collection).

This is as far as I've gotten with figuring this out. It smells like an HDF5 issue, because the one HDF5 API function gives an object ID that is rejected by another (specifically, this function returns an ID which causes this function to return nonzero). The h5py code even has a locking mechanism to prevent multiple threads from causing race conditions from this (not that I have reason to suspect our code is using threads).

Disabling the garbage collector before closing the file seems to fix the issue, suggesting that somehow garbage collection creates some inconsistency in the HDF5/h5py object ids.

ArvidJB · 2021-03-06T12:17:52Z

The garbage collector has some debug output which can be enabled with gc.set_debug. It might be interesting to see what exactly is going on, maybe the logging will reveal which objects are causing this issue?

As a short-term fix for us I could create a context wrapping around VersionedHDF5File which just calls gc.collect when it exits. I will try that and confirm if it fixes all our issues.

quarl · 2021-03-06T17:34:32Z

Confirmed that with Aaron's N=1000 reproducer, calling VersionedHDF5File.close doesn't make the symptom go away, but gc.collect() does make the symptom go away.
How about we add a call to gc.collect() inside (at the end of) VersionedHDF5File.close? I think it's more reasonable to require that clients of vf=VersionedHDF5File(f) must call vf.close() before f.close(), than to require that clients must call gc.collect().

FTR, here is Reproducer1B based on Arvid's original reproducer. Here, VersionedHDF5File.close() makes the problem go away even without a gc.collect(). I was using this since it runs faster, though I guess different reproducers exhibit symptoms differently.


import gc
import h5py
import numpy as np
import os
from   versioned_hdf5           import VersionedHDF5File

if not os.path.exists('foo.h5'):
    with h5py.File('foo.h5', 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('0') as sv:
            sv.create_dataset('bar', data=np.zeros(1, dtype='double'))

    for i in range(1, 100):
        with h5py.File('foo.h5', 'r+') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version(str(i)) as sv:
                sv['bar'][0] = i


# with h5py.File('foo.h5', 'r') as f:
#     versions = list(f['_version_data/versions'].keys())
# versions.remove("__first_version__")
# versions.sorted(key=int)

versions = list(map(str, range(100)))


for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version in versions[:100]:
            vf[version]['bar'][:]
        # vf.close()

quarl · 2021-03-06T18:27:57Z

For debugging, I added a print out of the result of get_obj_ids() before and after a gc.collect() after a bunch of VersionedHDF5File activity.
I see a lot of h5py.h5d.DatasetID and h5py.h5g.GroupID references before the gc.collect, which are almost all gone after the gc.collect() (just 3 remaining, probably f, f['_version_data'], and f['_version_data/versions']).

Hypothesis:

HypothesisA. In the VersionedHDF5 use case, there are some python references to h5py things that python wasn't able to automatically destruct at the time of a regular DECREF; this could be due to cyclic references or weakrefs or whatnot; therefore these python refs are only cleaned up at the next gc.collect
HypothesisB. If a gc.collect happens in the middle of h5py.File.close, then some of these references get cleaned up at such a time that the h5py cleanup logic breaks because it has a reference to something it thinks needs to be cleaned up, but it already no longer exists.

If this hypothesis is correct, it seems indeed a bug that h5py.File.close doesn't take into account the possibility that objects can disappear during the function due to gc getting triggered. In that case h5py.File.close ought to fixed, either by modifying the "get all outstanding objects then cleanup all outstanding objects" logic, or by simply add a gc.collect() at the top of h5py.File.close and/or disable gc temporarily. I.e. Aaron's "workaround" patch above may in fact be a real fix worth getting merged into h5py.

quarl · 2021-03-06T18:31:11Z

Here's Reproducer4:


versions = list(map(str, range(100)))

for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        refs = []
        for version in versions:
            refs.append(versioned_hdf5.wrappers.InMemoryGroup(f["_version_data/versions/%s" % version].id))
        refs.append(refs)
        del refs

The del refs is necessary to make it reproduce. That's consistent with the hypothesis above.

quarl · 2021-03-06T18:40:23Z

Reproducer5. I managed to create a totally self-contained reproducer that doesn't use VersionedHdf5 at all.
I think this reproducer ought to be sufficient to take to the h5py folks and ask them to merge a change to add a gc.collect and/or gc.disable/enable to h5py.File.close.

import h5py

N = 100
with h5py.File('foo.h5', 'w') as f:
    for i in range(N):
        f[str(i)] = []

for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        refs = []
        for i in range(N):
            refs.append(f[str(i)].id)
        refs.append(refs)
        del refs  # <-- important to make it likely that gc of id refs happens during h5py.File.close

asmeurer · 2021-03-08T23:23:00Z

Nice. I was actually thinking the same thing, that we could make a reproducer independent of versioned-hdf5 by forcing some h5py objects into cyclical references so that they have to be cleaned by the garbage collector. I guess most use cases of h5py don't actually do this, so the garbage collector doesn't come into play. I intended to do something similar but you beat me to it. Your reproducer still works in h5py master.

Here is the upstream h5py issue h5py/h5py#1852. I will work on some of the points from #162 (comment).

ericdatakelly · 2021-04-08T21:44:40Z

Can this be closed? I think the upstream issue is resolved.

ArvidJB · 2021-04-09T12:08:14Z

Yes, this can be closed. We have patched our h5py with the upstream patch.

This is the issue from deshaw#162. h5py has a fix, but we won't be able to use it in 2.10.0 unless we manually backport the patch and custom build h5py. The bug doesn't affect the actual tests, it just causes them to crash sometimes when the file is closed.

… is closed This addresses some comments that were brought up in issue deshaw#162.

ericdatakelly added this to the March 2021 milestone Feb 22, 2021

ericdatakelly assigned asmeurer Feb 22, 2021

ericdatakelly modified the milestones: March 2021, April 2021 Mar 8, 2021

ericdatakelly modified the milestones: April 2021, May 2021 Apr 5, 2021

ArvidJB closed this as completed Apr 9, 2021

asmeurer mentioned this issue Apr 15, 2021

Workaround the upstream h5py garbage collection issue #178

Merged

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Apr 20, 2021

Workaround deshaw#162 with error message from h5py 3.1.0

330eb6a

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Apr 21, 2021

Better error messages from VersionedHDF5File when the underlying file…

7f882cd

… is closed This addresses some comments that were brought up in issue deshaw#162.

asmeurer mentioned this issue Apr 21, 2021

Better error messages from VersionedHDF5File when the underlying file is closed #182

Merged

ericdatakelly modified the milestones: May 2021, April 2021 Apr 26, 2021

asmeurer mentioned this issue Feb 1, 2022

h5py reference counting bug with release 1.3.7 #216

Open

ArvidJB mentioned this issue Oct 5, 2023

Delete versions OOM fix #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over versions leads to id error #162

Iterating over versions leads to id error #162

ArvidJB commented Feb 17, 2021

asmeurer commented Feb 18, 2021

ArvidJB commented Feb 18, 2021

asmeurer commented Mar 4, 2021

asmeurer commented Mar 5, 2021

quarl commented Mar 5, 2021

asmeurer commented Mar 5, 2021

asmeurer commented Mar 6, 2021 •

edited

Loading

ArvidJB commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

asmeurer commented Mar 8, 2021

ericdatakelly commented Apr 8, 2021

ArvidJB commented Apr 9, 2021

Iterating over versions leads to id error #162

Iterating over versions leads to id error #162

Comments

ArvidJB commented Feb 17, 2021

asmeurer commented Feb 18, 2021

ArvidJB commented Feb 18, 2021

asmeurer commented Mar 4, 2021

asmeurer commented Mar 5, 2021

quarl commented Mar 5, 2021

asmeurer commented Mar 5, 2021

asmeurer commented Mar 6, 2021 • edited Loading

ArvidJB commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

quarl commented Mar 6, 2021

asmeurer commented Mar 8, 2021

ericdatakelly commented Apr 8, 2021

ArvidJB commented Apr 9, 2021

asmeurer commented Mar 6, 2021 •

edited

Loading