Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.float64 vs float when compression is on. #140

Closed
r-xue opened this issue Jul 28, 2020 · 4 comments
Closed

np.float64 vs float when compression is on. #140

r-xue opened this issue Jul 28, 2020 · 4 comments
Labels
Milestone

Comments

@r-xue
Copy link
Contributor

r-xue commented Jul 28, 2020

This was back to the discussion at the bottom of #136.
I just found out this...

test={'a':0.,'b':np.ones((1024,1024))}
hkl.dump(test, 'dummy.hkl', mode='w',compression='gzip')

is okay

test={'a':np.float64(0),'b':np.ones((1024,1024))}
hkl.dump(test, 'dummy.hkl', mode='w',compression='gzip')

will fail.

Looks like loaders.load_builtins.create_scalar_dataset() has this line:

# Make sure 'compression' is not in kwargs
kwargs.pop('compression', None)

which I guess let it pass.
Should we do the same in loaders.load_numpy.create_np_scalar_dataset()? though I am not sure if there are side-effects on this.

@1313e Thank you for any insight.

@1313e
Copy link
Collaborator

1313e commented Jul 29, 2020

I knew there was something where I ignored it.
Although both implementations have their advantages and disadvantages, they should definitely be consistent.
I personally think we should remove that line in builtins, for the reason I gave in #136.

@telegraphic What's your opinion on this?

@r-xue r-xue changed the title np.float64 vs float when compression is one. np.float64 vs float when compression is on. Jul 29, 2020
@1313e 1313e added the bug label Jul 30, 2020
@telegraphic
Copy link
Owner

I vote to not pass any kwargs to create_dataset for numpy scalars, except track_times.

I agree it's a bit of a tough one. Compression is hugely useful and I think we should make every attempt to make it work for 'reasonable' cases. As mentioned, there are two straightforward solutions: we keep kwargs.pop, or don't pass kwargs.

Looking at the parameters for create_dataset:

Provide None to create an anonymous dataset, to be linked into the file later.
shape – Shape of new dataset (Tuple).
dtype – Data type for new dataset
data – Initialize dataset to this (NumPy array).
chunks – Chunk shape, or True to enable auto-chunking.
maxshape – Dataset will be resizable up to this shape (Tuple). Automatically enables chunking. Use None for the axes you want to be unlimited.
compression – Compression strategy. See Filter pipeline.
compression_opts – Parameters for compression filter.
scaleoffset – See Scale-Offset filter.
shuffle – Enable shuffle filter (T/F). See Shuffle filter.
fletcher32 – Enable Fletcher32 checksum (T/F). See Fletcher32 filter.
fillvalue – This value will be used when reading uninitialized parts of the dataset.
track_times – Enable dataset creation timestamps (T/F).
track_order – Track attribute creation order if True. Default is h5.get_config().track_order.
external – Store the dataset in one or more external, non-HDF5 files. This should be a list of tuples of (filename[, offset[, size]]), to store data from offset to offset + size in the specified file. The last file in the list may have size h5py.h5s.UNLIMITED to let it grow as needed.

A large number of these kwargs would cause TypeError: Scalar datasets don't support chunk/filter options for scalar datasets. Of these parameters, I think only track_times is supported/useful. So my vote is to only pass track_times, which is a more robust against other errant kwargs, and fewer lines to boot.

@hernot
Copy link
Contributor

hernot commented Aug 1, 2020

I also encountered the very same issue when working on pull-request #138 that compression can cause problems with scalars and strings. The latter up to now are stored as bytes with dtype='S<length>' for which compression is also not possible as for scalars.
This lead to several changes how code proposed by pull-request #138 handles scalars. For numerical scalars (builtin and numpy) and alike compression is temporarily deactivated. This is done decorating kwargs with no_compression generator.

Strings and byte strings are instead mapped to bytearray for which compression is accepted by h5py. The down of bytearray mapping is that they appear in hdf5 file as int8 or uint8 streams and not as readable char arrays. But given the fact that strings can be come quite long and may use plenty of space this should be a fair trade-of.

The no_compression decorating generator filters
compression, compression_opts, shuffle, fletcher_32, chunks, scaleoffset

# the kwargs are not kwargs to no_compression just
# a dictionary which needs the mentioned keys to be
# filtered
def no_compression(kwargs):
    return {
        key:value
        for key,value in kwargs.items()
        if key not in {"compression","shuffle","compression_opts","chunks","fletcher32","scaleoffset"}
    }

The follwowing would have to be added to set of options filtered by no_compression according to documentation pasted by @telegraphic :

maxshape as it implicitly sets chunks=True

That one is not listed by exception raised when compression is on and i also missed it form documentation.

I vote as opposit to you @telegraphic for filtering compression and chunking related keywords on scalars as proposed and convert strings and bytestrings to bytarray instead of disallowing comression of content completely.

Thus i do suggest to timely fix this disable comression globally through filtering in dump method and issuing a corresponding warning for now. A more sensitive solution would than used from 4.1.x versions on which might need some well thought design descision. The above sayed could serve as valuable input to design discussion.

hernot added a commit to hernot/hickle that referenced this issue Sep 1, 2020
Quick fixes issue telegraphic#140 by filtering any hdf5 kwarg related to
compression. A warning is issued that currently compression is not
supported.
@1313e 1313e added this to the h5py v3 milestone Dec 17, 2020
@hernot
Copy link
Contributor

hernot commented Dec 20, 2020

Ok i think i have a solution or better some approach for rerunning all tests (at least those necessary) passing them with compression activated. Requires that systematic unit tests which are introduced in pull request #138 along with implementation of HEP001 #135. Will be included in finalize and clean-up pull request being the last of the in total four planned pull request. The first of which (#138) is currently under review. The other two will provide implementation of H4EP002 (#139) and H4EP003 (#145). Any thing else which can or has to be address by final polishing will be in the finalize and clean-up.

Ah and yes my tests already identified more yet unnoticed spots which need special handling of compression keyword arguments. Locally I already ammended them ;-)

@1313e 1313e closed this as completed in 06d82a7 Jan 5, 2021
hernot added a commit to hernot/hickle that referenced this issue Jan 18, 2021
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
hernot added a commit to hernot/hickle that referenced this issue Feb 17, 2021
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
hernot added a commit to hernot/hickle that referenced this issue Feb 19, 2021
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
hernot added a commit to hernot/hickle that referenced this issue Mar 1, 2021
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
hernot added a commit to hernot/hickle that referenced this issue Mar 2, 2021
Compression keywords safety tests:
==================================

In issue telegraphic#140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants