Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Break build_data_dict out into separate python implementation #379

Closed

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Oct 15, 2024

Fixes #356. This seems to resolve the issue for me. I'm going to rebase #377 onto this to see if benchmarks can complete. Will talk more seriously about merging some version of this if it seems to work.

@peytondmurray peytondmurray changed the title Break build_data_dict out into separate python implementation [WIP] Break build_data_dict out into separate python implementation Oct 15, 2024
@peytondmurray
Copy link
Collaborator Author

After some additional testing, it seems like this case gets resolved:

import h5py
from versioned_hdf5 import VersionedHDF5File
import numpy as np


d = './testdata.h5'

with h5py.File(d, mode="w") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r0") as sv:
        sv.create_dataset('values', data=np.arange(100), chunks=(10,), maxshape=(None,))

with h5py.File(d, mode="r+") as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version("r1") as sv:
        values = sv['values']
        values.resize((110,))  # <-- Exception raised here

Here's the case that fails during the asv benchmarks. It also gets resolved:

import os
import h5py
import numpy as np
import tempfile
import shutil
from versioned_hdf5 import VersionedHDF5File, delete_versions

filename = 'delete_versions_bench.h5'
n = 10

if not os.path.exists(filename):
    with h5py.File(filename, 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('init') as sv:
            sv.create_dataset('values', shape=(0, 0), dtype='float', fillvalue=np.nan,
                              chunks=(22, 100), maxshape=(None, None), compression='lzf')

    # generate some test data with around 1000 versions
    v = 1
    with h5py.File(filename, 'r+') as f:
        vf = VersionedHDF5File(f)
        for d in range(22):
            with vf.stage_version(str(v)) as sv:
                values_ds = sv['values']
                values_ds.resize((values_ds.shape[0] + 1, values_ds.shape[1] + 5000))
                values_ds[-1, -5000] = np.random.rand()
                v += 1
            for c in range(n):
                with vf.stage_version(str(v)) as sv:
                    values_ds = sv['values']
                    idxs = np.random.choice(values_ds.shape[1], 50, replace=False)
                    values_ds[-1, idxs] = np.random.rand(50)
                    v += 1


tmp_name = tempfile.mktemp('.h5')
shutil.copy2(filename, tmp_name)
try:
    # want to keep only every 10th version
    versions_to_delete = []
    with h5py.File(tmp_name, 'r') as f:
        vf = VersionedHDF5File(f)
        versions = sorted([(v, vf._versions[v].attrs['timestamp']) for v in vf._versions], key=lambda t: t[1])
        for i, v in enumerate(versions):
            if i % 10 != 0:
                versions_to_delete.append(v[0])

    with h5py.File(tmp_name, 'r+') as f:
        delete_versions(f, versions_to_delete)
finally:
    os.remove(tmp_name)

os.remove(filename)

@peytondmurray
Copy link
Collaborator Author

Lint failure unrelated: pre-commit/action#210, apparently something to do with the latest ubuntu runner being changed to 24.04 (?).

Tested this on my fork, but since the previous version is also benchmarked, the failure will only disappear after the next commit: https://github.com/peytondmurray/versioned-hdf5/actions/runs/11340699115/job/31537569085

@crusaderky crusaderky mentioned this pull request Oct 15, 2024
@peytondmurray
Copy link
Collaborator Author

Closing as this is the result of not calling H5Sclose on the dataspace identifiers returned by H5Pget_virtual_vspace and H5Pget_virtual_srcspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slicetools platform-specific error could not get virtual filename
1 participant