Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating over versions leads to id error #162

Closed
ArvidJB opened this issue Feb 17, 2021 · 15 comments
Closed

Iterating over versions leads to id error #162

ArvidJB opened this issue Feb 17, 2021 · 15 comments
Assignees

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Feb 17, 2021

Not sure what's going on here - is this a reference counting problem perhaps? This does not seem to fail every time I run it, but fails 90% of the time?

In [1]: from versioned_hdf5 import VersionedHDF5File

In [2]: with h5py.File('foo.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('0') as sv:
   ...:         sv.create_dataset('bar', data=np.zeros(0, dtype='double'))
   ...:
[PYFLYBY] import h5py
[PYFLYBY] import numpy as np

In [3]: for i in range(1, 100):
   ...:     with h5py.File('foo.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version(str(i)) as sv:
   ...:             sv['bar'].resize((i,))
   ...:             sv['bar'][i-1] = i
   ...:

In [4]: with h5py.File('foo.h5', 'r') as f:
   ...:     versions = list(f['_version_data/versions'].keys())
   ...:

In [5]: with h5py.File('foo.h5', 'r') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     for version in versions:
   ...:         if version != '__first_version__':
   ...:             cv = vf[version]
   ...:             cv['bar'][:]
   ...:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-73795d691ba0> in <module>
      4         if version != '__first_version__':
      5             cv = vf[version]
----> 6             cv['bar'][:]
      7

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in __exit__(self, *args)
    459     def __exit__(self, *args):
    460         if self.id:
--> 461             self.close()
    462
    463     @with_phil

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in close(self)
    429                 # Close file-resident objects first, then the files.
    430                 # Otherwise we get errors in MPI mode.
--> 431                 id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
    432                 file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
    433

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.get_obj_ids()

RuntimeError: Can't increment id ref count (can't locate ID)
@asmeurer
Copy link
Collaborator

Looks like it's the same error from #125 (comment) and #148 again.

Indeed, the h5py commit 773680edcfd868a32a2b047ad654fb3d28da92b2 that "fixed" the problem on the other issue seems to fix this as well. But the problem reappears again in the latest h5py (see the discussions on #148). Might be time for me to dig into this further. I still am not sure if this is a bug in h5py or if it is an issue with something we are doing.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Feb 18, 2021

If you could get to the bottom of this it would be great.
This code is fairly "complex" since it reads multiple versions, which seems to be sufficient to reproduce this. I think if you increase the number of versions/datasets/etc. it will become even more probable to reproduce.
Most of our code is much simpler (open file, read a couple datasets, close file), which is probably why we don't observe this in practice... yet.

@ericdatakelly ericdatakelly added this to the March 2021 milestone Feb 22, 2021
asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Feb 25, 2021
This lets us avoid using a weakref dictionary to have a global reference.

Unfortunately, this does not appear to fix the garbage collection bug with
hdf5 (deshaw#162). That seems to indicate that the issue is not directly related to
weakrefs. Either way, this does seem like it might be a better design for the
registry of groups.
@asmeurer
Copy link
Collaborator

asmeurer commented Mar 4, 2021

Calling gc.collect() right before closing the file appears to make the issue go away, e.g.,

        with h5py.File('foo.h5', 'r') as f:
            vf = VersionedHDF5File(f)
            for version in versions:
                if version != '__first_version__':
                    cv = vf[version]
                    cv['bar'][:]
            gc.collect()

Some more testing is needed to see if this is a fluke, but so far, it makes the error go away in every reproducer I know of. I'm still working out what this means in terms of the actual source of the error, but this may be useful as a workaround.

@asmeurer
Copy link
Collaborator

asmeurer commented Mar 5, 2021

It looks like disabling the garbage collector inside of h5py.File.close also fixes the issue.

diff --git a/h5py/_hl/files.py b/h5py/_hl/files.py
index 73ce5a75..3f605204 100644
--- a/h5py/_hl/files.py
+++ b/h5py/_hl/files.py
@@ -428,22 +428,27 @@ class File(Group):
 
                 # Close file-resident objects first, then the files.
                 # Otherwise we get errors in MPI mode.
-                id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
-                file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
-
-                id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
-                file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
-
-                for id_ in id_list:
-                    while id_.valid:
-                        h5i.dec_ref(id_)
-
-                for id_ in file_list:
-                    while id_.valid:
-                        h5i.dec_ref(id_)
-
-                self.id.close()
-                _objects.nonlocal_close()
+                import gc
+                gc.disable()
+                try:
+                    id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
+                    file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
+
+                    id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
+                    file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
+
+                    for id_ in id_list:
+                        while id_.valid:
+                            h5i.dec_ref(id_)
+
+                    for id_ in file_list:
+                        while id_.valid:
+                            h5i.dec_ref(id_)
+
+                    self.id.close()
+                    _objects.nonlocal_close()
+                finally:
+                    gc.enable()
 
     def flush(self):
         """ Tell the HDF5 library to flush its buffers.

Are either of these workarounds tenable? I don't think they really get to the source of the issue, but I'm not sure how much more time I should spend digging into this to figure out what is going on.

@quarl
Copy link
Member

quarl commented Mar 5, 2021

I took a look just now and have some good news.

Adding a call to VersionedHDF5File.close() before the h5py.File.close() makes the problem go away:

    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version in versions:
            vf[version]['bar'][:]
        vf.close()                       # <-- adding this line makes the problem go away

Similarly, instead of calling vf.close(), instead using del vf or vf._versions = vf._version_data = None also makes the problem go away.

Notes:

  • I think making sure to close the vf before the f should be a totally fine workaround for us. Arvid, can you confirm this makes the problem go away in D E Shaw code?
  • We should use it in context manager, e.g. with closing(VersionedHDF5File(f)) as vf (or a try..finally).
  • Aaron, can we make VersionedHDF5File support __enter__ and __exit__ (calling self.close())? Then as clients we can just directly use with VersionedHDF5File(f) as vf.
  • It's less urgent to fix the underlying bug, now that we have this good workaround, though I would still keep this issue open.
  • This information may help with getting to the bottom of the bug. A necessary condition to reproduce the bug seems to be that there needs to be a python reference to the h5py.Group object at the time h5py.File.close() is called.
  • Possibly making VersionedHDF5File._versions and VersionedHDF5File._version_data be weakrefs could help (I know you changed some things to/from weakrefs, not sure if these were involved)
  • I noticed that if you do something like vf[version] after f.close(), it raises KeyError as if the version doesn't exist. Can we make operations like this raise an exception that indicates the file was closed?
  • I noticed that if you do something like print(vf) after f.close(), that leads to slightly confusing exceptions. Can we make VersionedHDF5File.__repr__ return a string in this situation?

@asmeurer
Copy link
Collaborator

asmeurer commented Mar 5, 2021

Similarly, instead of calling vf.close(), instead using del vf or vf._versions = vf._version_data = None also makes the problem go away.

I can still reproduce the issue with all those things. Here is the script I'm using

import h5py
from versioned_hdf5 import VersionedHDF5File

N = 1000
def run():
    with h5py.File('test.hdf5', 'w') as f:
        file = VersionedHDF5File(f)
        for i in range(N):
            print(f'{i}/{N}')
            with file.stage_version(str(i), '') as g:
                g.create_dataset('a/' + str(i), data=list(range(100000)))

if __name__ == '__main__':
    run()

Note that you have to be a little careful when playing with this, because a seemingly innocuous change can change how the garbage collector runs and cause the issue to go away for a specific example. I am able to consistently reproduce the issue by creating a large number of objects (here by making a lot of versions).

I also use this script. This is the script from above but run in a loop. It usually takes a few iterations before it fails (up to 36 iterations in some cases that I've tested).

from versioned_hdf5 import VersionedHDF5File
import numpy as np
import h5py
import numpy

def run():
    for i in range(50):
        print(i)
        with h5py.File('foo.h5', 'w') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version('0') as sv:
                sv.create_dataset('bar', data=np.zeros(0, dtype='double'))

        for i in range(1, 100):
            with h5py.File('foo.h5', 'r+') as f:
                vf = VersionedHDF5File(f)
                with vf.stage_version(str(i)) as sv:
                    sv['bar'].resize((i,))
                    sv['bar'][i-1] = i

        with h5py.File('foo.h5', 'r') as f:
            versions = list(f['_version_data/versions'].keys())

        with h5py.File('foo.h5', 'r') as f:
            vf = VersionedHDF5File(f)
            for version in versions:
                if version != '__first_version__':
                    cv = vf[version]
                    cv['bar'][:]

if __name__ == '__main__':
    run()

I didn't mention it, but completely disabling the garbage collector also makes the problem go away (obviously that's not a viable workaround, but it does show that that is the source of the problem).

I agree with your notes that things should behave better once the file is closed. I've tried implement proper behavior for this for version groups that are already committed, but I haven't paid as close attention to what happens once the file itself is closed.

@asmeurer
Copy link
Collaborator

asmeurer commented Mar 6, 2021

Here is some technical context for this issue. The issue occurs when the h5py File object is closed. The way HDF5 works is that every object has an id. Ids are integers, and work kind of like file descriptors, but apply to all object types (files, gruops, datasets, various metadata objects). Every kind of object can be "closed", not just the top-level file object. Object IDs can be reused once they are closed. I'm not sure yet if this fact is relevant to this issue. If it is, then this comment in the h5py source is relevant.

In pure HDF5, when a file is closed, this does not close the containing objects. So h5py works around this by manually closing every open object when a file is closed.

This is what is happening in the h5py code in the traceback:

/usr/local/python/python-3.7/std/lib64/python3.7/site-packages/h5py/_hl/files.py in close(self)
    429                 # Close file-resident objects first, then the files.
    430                 # Otherwise we get errors in MPI mode.
--> 431                 id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
    432                 file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
    433

It is trying to get a list of every open object ID corresponding to the current file. This fails in Cython code that wraps the HDF5 API. One HDF5 function returns some set of supposedly open object ids, but another function, called right below that, fails on one of those IDs (as if it were actually closed). The only other place that object IDs are closed is in the object deallocation code (i.e., during garbage collection).

This is as far as I've gotten with figuring this out. It smells like an HDF5 issue, because the one HDF5 API function gives an object ID that is rejected by another (specifically, this function returns an ID which causes this function to return nonzero). The h5py code even has a locking mechanism to prevent multiple threads from causing race conditions from this (not that I have reason to suspect our code is using threads).

Disabling the garbage collector before closing the file seems to fix the issue, suggesting that somehow garbage collection creates some inconsistency in the HDF5/h5py object ids.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Mar 6, 2021

The garbage collector has some debug output which can be enabled with gc.set_debug. It might be interesting to see what exactly is going on, maybe the logging will reveal which objects are causing this issue?

As a short-term fix for us I could create a context wrapping around VersionedHDF5File which just calls gc.collect when it exits. I will try that and confirm if it fixes all our issues.

@quarl
Copy link
Member

quarl commented Mar 6, 2021

Confirmed that with Aaron's N=1000 reproducer, calling VersionedHDF5File.close doesn't make the symptom go away, but gc.collect() does make the symptom go away.
How about we add a call to gc.collect() inside (at the end of) VersionedHDF5File.close? I think it's more reasonable to require that clients of vf=VersionedHDF5File(f) must call vf.close() before f.close(), than to require that clients must call gc.collect().

FTR, here is Reproducer1B based on Arvid's original reproducer. Here, VersionedHDF5File.close() makes the problem go away even without a gc.collect(). I was using this since it runs faster, though I guess different reproducers exhibit symptoms differently.


import gc
import h5py
import numpy as np
import os
from   versioned_hdf5           import VersionedHDF5File

if not os.path.exists('foo.h5'):
    with h5py.File('foo.h5', 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('0') as sv:
            sv.create_dataset('bar', data=np.zeros(1, dtype='double'))

    for i in range(1, 100):
        with h5py.File('foo.h5', 'r+') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version(str(i)) as sv:
                sv['bar'][0] = i


# with h5py.File('foo.h5', 'r') as f:
#     versions = list(f['_version_data/versions'].keys())
# versions.remove("__first_version__")
# versions.sorted(key=int)

versions = list(map(str, range(100)))


for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version in versions[:100]:
            vf[version]['bar'][:]
        # vf.close()

@quarl
Copy link
Member

quarl commented Mar 6, 2021

For debugging, I added a print out of the result of get_obj_ids() before and after a gc.collect() after a bunch of VersionedHDF5File activity.
I see a lot of h5py.h5d.DatasetID and h5py.h5g.GroupID references before the gc.collect, which are almost all gone after the gc.collect() (just 3 remaining, probably f, f['_version_data'], and f['_version_data/versions']).

Hypothesis:

  • HypothesisA. In the VersionedHDF5 use case, there are some python references to h5py things that python wasn't able to automatically destruct at the time of a regular DECREF; this could be due to cyclic references or weakrefs or whatnot; therefore these python refs are only cleaned up at the next gc.collect
  • HypothesisB. If a gc.collect happens in the middle of h5py.File.close, then some of these references get cleaned up at such a time that the h5py cleanup logic breaks because it has a reference to something it thinks needs to be cleaned up, but it already no longer exists.

If this hypothesis is correct, it seems indeed a bug that h5py.File.close doesn't take into account the possibility that objects can disappear during the function due to gc getting triggered. In that case h5py.File.close ought to fixed, either by modifying the "get all outstanding objects then cleanup all outstanding objects" logic, or by simply add a gc.collect() at the top of h5py.File.close and/or disable gc temporarily. I.e. Aaron's "workaround" patch above may in fact be a real fix worth getting merged into h5py.

@quarl
Copy link
Member

quarl commented Mar 6, 2021

Here's Reproducer4:


versions = list(map(str, range(100)))

for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        refs = []
        for version in versions:
            refs.append(versioned_hdf5.wrappers.InMemoryGroup(f["_version_data/versions/%s" % version].id))
        refs.append(refs)
        del refs

The del refs is necessary to make it reproduce. That's consistent with the hypothesis above.

@quarl
Copy link
Member

quarl commented Mar 6, 2021

Reproducer5. I managed to create a totally self-contained reproducer that doesn't use VersionedHdf5 at all.
I think this reproducer ought to be sufficient to take to the h5py folks and ask them to merge a change to add a gc.collect and/or gc.disable/enable to h5py.File.close.

import h5py

N = 100
with h5py.File('foo.h5', 'w') as f:
    for i in range(N):
        f[str(i)] = []

for i in range(100):
    print("i =",i)
    with h5py.File('foo.h5', 'r') as f:
        refs = []
        for i in range(N):
            refs.append(f[str(i)].id)
        refs.append(refs)
        del refs  # <-- important to make it likely that gc of id refs happens during h5py.File.close

@ericdatakelly ericdatakelly modified the milestones: March 2021, April 2021 Mar 8, 2021
@asmeurer
Copy link
Collaborator

asmeurer commented Mar 8, 2021

Nice. I was actually thinking the same thing, that we could make a reproducer independent of versioned-hdf5 by forcing some h5py objects into cyclical references so that they have to be cleaned by the garbage collector. I guess most use cases of h5py don't actually do this, so the garbage collector doesn't come into play. I intended to do something similar but you beat me to it. Your reproducer still works in h5py master.

Here is the upstream h5py issue h5py/h5py#1852. I will work on some of the points from #162 (comment).

@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 Apr 5, 2021
@ericdatakelly
Copy link

Can this be closed? I think the upstream issue is resolved.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Apr 9, 2021

Yes, this can be closed. We have patched our h5py with the upstream patch.

@ArvidJB ArvidJB closed this as completed Apr 9, 2021
asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Apr 15, 2021
This is the issue from deshaw#162.
h5py has a fix, but we won't be able to use it in 2.10.0 unless we manually
backport the patch and custom build h5py. The bug doesn't affect the actual
tests, it just causes them to crash sometimes when the file is closed.
asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Apr 20, 2021
asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Apr 21, 2021
… is closed

This addresses some comments that were brought up in issue deshaw#162.
@ericdatakelly ericdatakelly modified the milestones: May 2021, April 2021 Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants