Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should there be a chunk iterator for writing datasets using 'create_dataset'? #88

Open
jbhatch opened this issue Jun 9, 2020 · 1 comment
Assignees

Comments

@jbhatch
Copy link

jbhatch commented Jun 9, 2020

When writing an HDF5 file to the HSDS with H5PYD, it appears that although chunks are being created in the final output file, the initial writing of the data seems to operate in a contiguous manner. This would sometimes produce interrupts (http request errors) when writing large, ~GB-size HDF5 files (~GB-size) with H5PYD to the HSDS despite having more than enough memory in each of the HSDS data nodes. Writing smaller, ~MB-sized files was hit and miss, and KB-sized files had no issues. The 3D datasets in the HDF5 files of varied sizes (~GB, ~MB, and ~KB-size) used in these tests were filled with 3D random numpy arrays.

In order to use the H5PYD Chunk Interator in create_dataset, the following fix is suggested:

The line below is added to the import statements in the group.py file in h5pyd/_hl:

from h5pyd._apps.chunkiter import ChunkIterator

In the group.py file under h5pyd/_hl, change lines 334-336 from this:

if data is not None:
self.log.info("initialize data")
dset[...] = data

to this:

    if data is not None:
        self.log.info("initialize data")
        # dset[...] = data
        it = ChunkIterator(dset)
        for chunk in it:
           dset[chunk] = data[chunk]
@jreadey
Copy link
Member

jreadey commented Jun 12, 2020

In the h5pyd dataset.py that's a good solution for initializing the dataset.

There's a max request size limit (defaults to 100mb) so there server will respond with a 413 error if you try to write more than that much data in one request. I don't know if that explains the problems you had with writing larger datasets or not.

I'd been planning to make changes that would paginate large writes - basically have the code for dset[...] = data send multiple requests to the server if that data is too large. Read operations are supported this way now. Your approach would be easier to implement since it just needs to deal with the dataset initialization. Have you tried making this change yourself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants