Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

Merged
merged 17 commits into from
Aug 29, 2019

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Aug 7, 2019

Motivation

  • When wrapping an h5py.Dataset object with DataChunkIterator, DataChunkIterator currently uses the iterator from the dataset to read the dataset one-slice-at-a-time. In particular when slices are buffered into a chunk, this is inefficient as we can instead just read the next block in one read, rather than iteratively. This helps to significantly reduce the I/O operatiosn needed and accoringly helps to improve performance significantly.
  • It is currently not possible to skip empty blocks in DataChunkIterator. This PR allows iterators to return None for empty chunks and in turn DataChunkIterator will ignore those chunks and fast-forward to the next or alterantively, if an empty chunk is encountered while buffering, it will stop the chunk and return the data is has accumulated so far. Skipping chunks is useful when we have sparse data as it avoids allocating empty data blocks.
  • DataChunkIterator currently only supports iteration over the first dimension of an nD array. This PR enhances DataChunkIterator to support iterator over an arbitrary user-defined dimension.

Status Need to add tests to make sure this is working.

How to test the behavior?

Show here how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Have you checked our Contributing document?
  • Have you ensured the PR description clearly describes problem and the solution?
  • Is your contribution compliant with our coding style ? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using #XXX notation where XXX is the issue number ? By including "Fix #XXX" you allow GitHub to close the corresponding issue.

@codecov
Copy link

codecov bot commented Aug 7, 2019

Codecov Report

Merging #132 into dev will increase coverage by 0.31%.
The diff coverage is 98.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #132      +/-   ##
==========================================
+ Coverage   68.74%   69.06%   +0.31%     
==========================================
  Files          24       24              
  Lines        4918     4968      +50     
  Branches     1126     1137      +11     
==========================================
+ Hits         3381     3431      +50     
+ Misses       1172     1170       -2     
- Partials      365      367       +2
Impacted Files Coverage Δ
src/hdmf/backends/hdf5/h5tools.py 63.6% <ø> (+0.15%) ⬆️
src/hdmf/data_utils.py 83.94% <98.63%> (+2.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e13fc0b...ce1d250. Read the comment docs.

src/hdmf/data_utils.py Outdated Show resolved Hide resolved
src/hdmf/data_utils.py Outdated Show resolved Hide resolved
@rly
Copy link
Contributor

rly commented Aug 7, 2019

If the first value returned by the iterator is None, then this error is generated:

Traceback (most recent call last):
  File "c:\users\ryan\documents\nwb\hdmf\src\hdmf\backends\hdf5\h5tools.py", line 875, in __chunked_iter_fill__
    dset = parent.create_dataset(name, **io_settings)
  File "C:\Users\Ryan\Miniconda3\envs\pynwb-hdmf-dev\lib\site-packages\h5py-2.9.0-py3.7-win-amd64.egg\h5py\_hl\group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "C:\Users\Ryan\Miniconda3\envs\pynwb-hdmf-dev\lib\site-packages\h5py-2.9.0-py3.7-win-amd64.egg\h5py\_hl\dataset.py", line 89, in make_new_dset
    raise TypeError("One of data, shape or dtype must be specified")
TypeError: One of data, shape or dtype must be specified

If we can't get around this, it would be useful to raise a warning in the DataChunkIterator.

@oruebel oruebel changed the title [WIP] Updated DataChunkIterator to read HDF5 datasets in blocks [WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim Aug 10, 2019
@rly
Copy link
Contributor

rly commented Aug 13, 2019

TODO before review:

  • write tests for reading h5 datasets
  • write tests for writing data with datachunkiterator on a different dimension

@oruebel could you take a look at my changes so far? it looks like everything works but i messed with the internal logic of _read_next_chunk quite a bit.

@rly rly marked this pull request as ready for review August 20, 2019 00:03
@rly rly added this to the HDMF 1.2 milestone Aug 28, 2019
@rly
Copy link
Contributor

rly commented Aug 29, 2019

@oruebel this is finally ready for review

@rly rly merged commit a463d2a into dev Aug 29, 2019
@rly rly deleted the enh/dci4h5 branch August 29, 2019 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants