-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterating over versions leads to id error #162
Comments
Looks like it's the same error from #125 (comment) and #148 again. Indeed, the h5py commit 773680edcfd868a32a2b047ad654fb3d28da92b2 that "fixed" the problem on the other issue seems to fix this as well. But the problem reappears again in the latest h5py (see the discussions on #148). Might be time for me to dig into this further. I still am not sure if this is a bug in h5py or if it is an issue with something we are doing. |
If you could get to the bottom of this it would be great. |
This lets us avoid using a weakref dictionary to have a global reference. Unfortunately, this does not appear to fix the garbage collection bug with hdf5 (deshaw#162). That seems to indicate that the issue is not directly related to weakrefs. Either way, this does seem like it might be a better design for the registry of groups.
Calling with h5py.File('foo.h5', 'r') as f:
vf = VersionedHDF5File(f)
for version in versions:
if version != '__first_version__':
cv = vf[version]
cv['bar'][:]
gc.collect() Some more testing is needed to see if this is a fluke, but so far, it makes the error go away in every reproducer I know of. I'm still working out what this means in terms of the actual source of the error, but this may be useful as a workaround. |
It looks like disabling the garbage collector inside of diff --git a/h5py/_hl/files.py b/h5py/_hl/files.py
index 73ce5a75..3f605204 100644
--- a/h5py/_hl/files.py
+++ b/h5py/_hl/files.py
@@ -428,22 +428,27 @@ class File(Group):
# Close file-resident objects first, then the files.
# Otherwise we get errors in MPI mode.
- id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
- file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
-
- id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
- file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
-
- for id_ in id_list:
- while id_.valid:
- h5i.dec_ref(id_)
-
- for id_ in file_list:
- while id_.valid:
- h5i.dec_ref(id_)
-
- self.id.close()
- _objects.nonlocal_close()
+ import gc
+ gc.disable()
+ try:
+ id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
+ file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
+
+ id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
+ file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
+
+ for id_ in id_list:
+ while id_.valid:
+ h5i.dec_ref(id_)
+
+ for id_ in file_list:
+ while id_.valid:
+ h5i.dec_ref(id_)
+
+ self.id.close()
+ _objects.nonlocal_close()
+ finally:
+ gc.enable()
def flush(self):
""" Tell the HDF5 library to flush its buffers. Are either of these workarounds tenable? I don't think they really get to the source of the issue, but I'm not sure how much more time I should spend digging into this to figure out what is going on. |
I took a look just now and have some good news. Adding a call to
Similarly, instead of calling vf.close(), instead using Notes:
|
I can still reproduce the issue with all those things. Here is the script I'm using import h5py
from versioned_hdf5 import VersionedHDF5File
N = 1000
def run():
with h5py.File('test.hdf5', 'w') as f:
file = VersionedHDF5File(f)
for i in range(N):
print(f'{i}/{N}')
with file.stage_version(str(i), '') as g:
g.create_dataset('a/' + str(i), data=list(range(100000)))
if __name__ == '__main__':
run() Note that you have to be a little careful when playing with this, because a seemingly innocuous change can change how the garbage collector runs and cause the issue to go away for a specific example. I am able to consistently reproduce the issue by creating a large number of objects (here by making a lot of versions). I also use this script. This is the script from above but run in a loop. It usually takes a few iterations before it fails (up to 36 iterations in some cases that I've tested). from versioned_hdf5 import VersionedHDF5File
import numpy as np
import h5py
import numpy
def run():
for i in range(50):
print(i)
with h5py.File('foo.h5', 'w') as f:
vf = VersionedHDF5File(f)
with vf.stage_version('0') as sv:
sv.create_dataset('bar', data=np.zeros(0, dtype='double'))
for i in range(1, 100):
with h5py.File('foo.h5', 'r+') as f:
vf = VersionedHDF5File(f)
with vf.stage_version(str(i)) as sv:
sv['bar'].resize((i,))
sv['bar'][i-1] = i
with h5py.File('foo.h5', 'r') as f:
versions = list(f['_version_data/versions'].keys())
with h5py.File('foo.h5', 'r') as f:
vf = VersionedHDF5File(f)
for version in versions:
if version != '__first_version__':
cv = vf[version]
cv['bar'][:]
if __name__ == '__main__':
run() I didn't mention it, but completely disabling the garbage collector also makes the problem go away (obviously that's not a viable workaround, but it does show that that is the source of the problem). I agree with your notes that things should behave better once the file is closed. I've tried implement proper behavior for this for version groups that are already committed, but I haven't paid as close attention to what happens once the file itself is closed. |
Here is some technical context for this issue. The issue occurs when the h5py File object is closed. The way HDF5 works is that every object has an id. Ids are integers, and work kind of like file descriptors, but apply to all object types (files, gruops, datasets, various metadata objects). Every kind of object can be "closed", not just the top-level file object. Object IDs can be reused once they are closed. I'm not sure yet if this fact is relevant to this issue. If it is, then this comment in the h5py source is relevant. In pure HDF5, when a file is closed, this does not close the containing objects. So h5py works around this by manually closing every open object when a file is closed. This is what is happening in the h5py code in the traceback:
It is trying to get a list of every open object ID corresponding to the current file. This fails in Cython code that wraps the HDF5 API. One HDF5 function returns some set of supposedly open object ids, but another function, called right below that, fails on one of those IDs (as if it were actually closed). The only other place that object IDs are closed is in the object deallocation code (i.e., during garbage collection). This is as far as I've gotten with figuring this out. It smells like an HDF5 issue, because the one HDF5 API function gives an object ID that is rejected by another (specifically, this function returns an ID which causes this function to return nonzero). The h5py code even has a locking mechanism to prevent multiple threads from causing race conditions from this (not that I have reason to suspect our code is using threads). Disabling the garbage collector before closing the file seems to fix the issue, suggesting that somehow garbage collection creates some inconsistency in the HDF5/h5py object ids. |
The garbage collector has some debug output which can be enabled with As a short-term fix for us I could create a context wrapping around |
Confirmed that with Aaron's N=1000 reproducer, calling FTR, here is Reproducer1B based on Arvid's original reproducer. Here,
|
For debugging, I added a print out of the result of get_obj_ids() before and after a gc.collect() after a bunch of VersionedHDF5File activity. Hypothesis:
If this hypothesis is correct, it seems indeed a bug that h5py.File.close doesn't take into account the possibility that objects can disappear during the function due to gc getting triggered. In that case h5py.File.close ought to fixed, either by modifying the "get all outstanding objects then cleanup all outstanding objects" logic, or by simply add a |
Here's Reproducer4:
The |
Reproducer5. I managed to create a totally self-contained reproducer that doesn't use VersionedHdf5 at all.
|
Nice. I was actually thinking the same thing, that we could make a reproducer independent of versioned-hdf5 by forcing some h5py objects into cyclical references so that they have to be cleaned by the garbage collector. I guess most use cases of h5py don't actually do this, so the garbage collector doesn't come into play. I intended to do something similar but you beat me to it. Your reproducer still works in h5py master. Here is the upstream h5py issue h5py/h5py#1852. I will work on some of the points from #162 (comment). |
Can this be closed? I think the upstream issue is resolved. |
Yes, this can be closed. We have patched our h5py with the upstream patch. |
This is the issue from deshaw#162. h5py has a fix, but we won't be able to use it in 2.10.0 unless we manually backport the patch and custom build h5py. The bug doesn't affect the actual tests, it just causes them to crash sometimes when the file is closed.
… is closed This addresses some comments that were brought up in issue deshaw#162.
Not sure what's going on here - is this a reference counting problem perhaps? This does not seem to fail every time I run it, but fails 90% of the time?
The text was updated successfully, but these errors were encountered: