High dimensional versioned Datasets are slow to read #224

ArvidJB · 2022-05-02T21:34:52Z

We observed that high dimensional datasets are much slower to read when they are virtual (versioned) Datasets:

In [12]: shape = (19, 36, 26, 1)

In [14]: a = np.random.rand(*shape)
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         vf = VersionedHDF5File(f)
    ...:         with vf.stage_version('v0') as sv:
    ...:             sv.create_dataset('bar', data=a, chunks=a.shape)
    ...:     with h5py.File(d / 'foo.h5', 'r') as f:
    ...:         vf = VersionedHDF5File(f)
    ...:         cv = vf[vf.current_version]
    ...:         bar = cv['bar']
    ...:         %time _ = [bar[:] for _ in range(1000)]
    ...:
CPU times: user 2.95 s, sys: 61.8 ms, total: 3.01 s
Wall time: 3.01 s

In [15]: a = np.random.rand(*shape)
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         f.create_dataset('bar', data=a, chunks=a.shape)
    ...:     with h5py.File(d / 'foo.h5', 'r') as f:
    ...:         bar = f['bar']
    ...:         %time _ = [bar[:] for _ in range(1000)]
    ...:
CPU times: user 37.3 ms, sys: 60.2 ms, total: 97.5 ms
Wall time: 97.2 ms

A little bit of profiling points to H5S__hyper_project_intersection being an expensive function:

Function Stack	CPU Time: Total	CPU Time: Self	Module	Function (Full)	Source File	Start Address
__pyx_f_4h5py_4defs_H5Dread	67.0%	0s	defs.cpython-39-x86_64-linux-gnu.so	__pyx_f_4h5py_4defs_H5Dread	defs.c	0x1b810
  H5Dread	67.0%	0s	libhdf5.so.103	H5Dread	H5Dio.c	0xd8934
    H5D__read	67.0%	0s	libhdf5.so.103	H5D__read	H5Dio.c	0xd7e57
      H5D__virtual_read	65.2%	0s	libhdf5.so.103	H5D__virtual_read	H5Dvirtual.c	0xe3e9b
        H5D__virtual_read_one	34.3%	0s	libhdf5.so.103	H5D__virtual_read_one	H5Dvirtual.c	0xe1c64
          H5S_select_project_intersection	31.9%	0s	libhdf5.so.103	H5S_select_project_intersection	H5Sselect.c	0x252426
            H5S__hyper_project_intersection	31.9%	0.740s	libhdf5.so.103	H5S__hyper_project_intersection	H5Shyper.c	0x248352
              H5S__hyper_append_span	10.4%	0.090s	libhdf5.so.103	H5S__hyper_append_span	H5Shyper.c	0x2421cd
                H5S__hyper_new_span	3.0%	0.080s	libhdf5.so.103	H5S__hyper_new_span	H5Shyper.c	0x2418d7
                H5FL_reg_calloc	2.8%	0.030s	libhdf5.so.103	H5FL_reg_calloc	H5FL.c	0x13d99a
                H5S__hyper_cmp_spans	2.6%	0.070s	libhdf5.so.103	H5S__hyper_cmp_spans	H5Shyper.c	0x23e61f
              H5S__hyper_free_span_info	5.4%	0.080s	libhdf5.so.103	H5S__hyper_free_span_info	H5Shyper.c	0x240d4f
                H5S__hyper_free_span	2.4%	0.020s	libhdf5.so.103	H5S__hyper_free_span	H5Shyper.c	0x240cb4
                H5FL_reg_free	1.3%	0.060s	libhdf5.so.103	H5FL_reg_free	H5FL.c	0x13cca1
          H5D__read	2.4%	0s	libhdf5.so.103	H5D__read	H5Dio.c	0xd7e57
        H5D__virtual_pre_io	30.9%	0s	libhdf5.so.103	H5D__virtual_pre_io	H5Dvirtual.c	0xe23d8
          H5S_select_project_intersection	30.9%	0s	libhdf5.so.103	H5S_select_project_intersection	H5Sselect.c	0x252426
            H5S__hyper_project_intersection	30.9%	0.630s	libhdf5.so.103	H5S__hyper_project_intersection	H5Shyper.c	0x248352
              H5S__hyper_append_span	11.7%	0.270s	libhdf5.so.103	H5S__hyper_append_span	H5Shyper.c	0x2421cd
                H5FL_reg_calloc	2.4%	0.060s	libhdf5.so.103	H5FL_reg_calloc	H5FL.c	0x13d99a
                H5S__hyper_new_span	2.0%	0.050s	libhdf5.so.103	H5S__hyper_new_span	H5Shyper.c	0x2418d7
                H5S__hyper_cmp_spans	1.5%	0.070s	libhdf5.so.103	H5S__hyper_cmp_spans	H5Shyper.c	0x23e61f
              H5S__hyper_free_span_info	5.4%	0.141s	libhdf5.so.103	H5S__hyper_free_span_info	H5Shyper.c	0x240d4f
                H5S__hyper_free_span	1.3%	0s	libhdf5.so.103	H5S__hyper_free_span	H5Shyper.c	0x240cb4
                H5FL_reg_free	0.7%	0.030s	libhdf5.so.103	H5FL_reg_free	H5FL.c	0x13cca1
                func@0x44aa0	0.4%	0.020s	libhdf5.so.103	func@0x44aa0	[Unknown]	0x44aa0
      H5D__chunk_read	1.7%	0s	libhdf5.so.103	H5D__chunk_read	H5Dchunk.c	0xb782a

Is it possible to speed up this function?

The text was updated successfully, but these errors were encountered:

magsol added this to the June 2022 milestone Jul 14, 2022

magsol assigned asmeurer and telamonian Jul 14, 2022

magsol modified the milestones: June 2022, May 2022 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High dimensional versioned Datasets are slow to read #224

High dimensional versioned Datasets are slow to read #224

ArvidJB commented May 2, 2022

High dimensional versioned Datasets are slow to read #224

High dimensional versioned Datasets are slow to read #224

Comments

ArvidJB commented May 2, 2022