[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

oruebel · 2019-08-07T18:01:56Z

Motivation

When wrapping an h5py.Dataset object with DataChunkIterator, DataChunkIterator currently uses the iterator from the dataset to read the dataset one-slice-at-a-time. In particular when slices are buffered into a chunk, this is inefficient as we can instead just read the next block in one read, rather than iteratively. This helps to significantly reduce the I/O operatiosn needed and accoringly helps to improve performance significantly.
It is currently not possible to skip empty blocks in DataChunkIterator. This PR allows iterators to return None for empty chunks and in turn DataChunkIterator will ignore those chunks and fast-forward to the next or alterantively, if an empty chunk is encountered while buffering, it will stop the chunk and return the data is has accumulated so far. Skipping chunks is useful when we have sparse data as it avoids allocating empty data blocks.
DataChunkIterator currently only supports iteration over the first dimension of an nD array. This PR enhances DataChunkIterator to support iterator over an arbitrary user-defined dimension.

Status Need to add tests to make sure this is working.

How to test the behavior?

Show here how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

Have you checked our Contributing document?
Have you ensured the PR description clearly describes problem and the solution?
Is your contribution compliant with our coding style ? This can be checked running flake8 from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using #XXX notation where XXX is the issue number ? By including "Fix #XXX" you allow GitHub to close the corresponding issue.

codecov · 2019-08-07T18:04:37Z

Codecov Report

Merging #132 into dev will increase coverage by 0.31%.
The diff coverage is 98.63%.

@@            Coverage Diff             @@
##              dev     #132      +/-   ##
==========================================
+ Coverage   68.74%   69.06%   +0.31%     
==========================================
  Files          24       24              
  Lines        4918     4968      +50     
  Branches     1126     1137      +11     
==========================================
+ Hits         3381     3431      +50     
+ Misses       1172     1170       -2     
- Partials      365      367       +2

Impacted Files	Coverage Δ
src/hdmf/backends/hdf5/h5tools.py	`63.6% <ø> (+0.15%)`	⬆️
src/hdmf/data_utils.py	`83.94% <98.63%> (+2.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e13fc0b...ce1d250. Read the comment docs.

…pping consecutive blocks of None values

src/hdmf/data_utils.py

rly · 2019-08-07T20:51:50Z

If the first value returned by the iterator is None, then this error is generated:

Traceback (most recent call last):
  File "c:\users\ryan\documents\nwb\hdmf\src\hdmf\backends\hdf5\h5tools.py", line 875, in __chunked_iter_fill__
    dset = parent.create_dataset(name, **io_settings)
  File "C:\Users\Ryan\Miniconda3\envs\pynwb-hdmf-dev\lib\site-packages\h5py-2.9.0-py3.7-win-amd64.egg\h5py\_hl\group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "C:\Users\Ryan\Miniconda3\envs\pynwb-hdmf-dev\lib\site-packages\h5py-2.9.0-py3.7-win-amd64.egg\h5py\_hl\dataset.py", line 89, in make_new_dset
    raise TypeError("One of data, shape or dtype must be specified")
TypeError: One of data, shape or dtype must be specified

If we can't get around this, it would be useful to raise a warning in the DataChunkIterator.

rly · 2019-08-13T22:58:21Z

TODO before review:

write tests for reading h5 datasets
write tests for writing data with datachunkiterator on a different dimension

@oruebel could you take a look at my changes so far? it looks like everything works but i messed with the internal logic of _read_next_chunk quite a bit.

rly · 2019-08-29T22:08:08Z

@oruebel this is finally ready for review

Updated DataChunkIterator to read HDF5 datasets in blocks

6357c90

oruebel added 2 commits August 7, 2019 12:05

Remove debug print

e25f889

Add ability to skip data blocks for write in DataChunkIterator by ski…

febb3c8

…pping consecutive blocks of None values

oruebel commented Aug 7, 2019

View reviewed changes

src/hdmf/data_utils.py Outdated Show resolved Hide resolved

rly reviewed Aug 7, 2019

View reviewed changes

src/hdmf/data_utils.py Outdated Show resolved Hide resolved

rly added 2 commits August 7, 2019 15:12

bug fixes to handle empty data

7d80ae9

Make DataChunkIterator work on other axes

ce0380c

oruebel changed the title ~~[WIP] Updated DataChunkIterator to read HDF5 datasets in blocks~~ [WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim Aug 10, 2019

rly added 6 commits August 9, 2019 18:00

Bug fixes. Need to add tests

3aa98e9

Merge branch 'dev' into enh/dci4h5

1ffe224

Fixes issues with iter_axis, add tests

f614b09

Add tests, fix for when buffer terminates early

25a378a

Add tests, fix for setting maxshape

6f8cf78

flake8 fix

e31ab2d

rly marked this pull request as ready for review August 20, 2019 00:03

rly added this to the HDMF 1.2 milestone Aug 28, 2019

rly added 3 commits August 28, 2019 15:25

Add matrix test of __chunked_iter_fill__

9f1fc59

Refactoring cleanup

92ce40b

Fix and add tests for creating datachunkiterator on an h5dataset

1b5d95a

rly added 3 commits August 29, 2019 15:09

Remove test file

f6d4a31

Update tests

15e90a3

Remvoe exception that should never be hit

ce1d250

rly approved these changes Aug 29, 2019

View reviewed changes

rly merged commit a463d2a into dev Aug 29, 2019

rly deleted the enh/dci4h5 branch August 29, 2019 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

oruebel commented Aug 7, 2019 •

edited

Loading

codecov bot commented Aug 7, 2019 •

edited

Loading

rly commented Aug 7, 2019

rly commented Aug 13, 2019

rly commented Aug 29, 2019

[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

[WIP] Enh. DataChunkIterator, support: 1) h5 dataset, 2) skipping of blocks, 3) arbit. iter dim #132

Conversation

oruebel commented Aug 7, 2019 • edited Loading

Motivation

How to test the behavior?

Checklist

codecov bot commented Aug 7, 2019 • edited Loading

Codecov Report

rly commented Aug 7, 2019

rly commented Aug 13, 2019

rly commented Aug 29, 2019

oruebel commented Aug 7, 2019 •

edited

Loading

codecov bot commented Aug 7, 2019 •

edited

Loading