Hickle not working with h5py 3.0 #143

cyfra · 2020-11-02T12:34:40Z

Hickle seems to stop working with h5py 3.0.0. It works fine with 2.10.

Example code:

import hickle as hkl

hkl.dump([1,2,3], "/tmp/foo.hkl")
hkl.load("/tmp/foo.hkl")

Fails with

ValueError: Provided argument 'file_obj' does not appear to be a valid hickle file! (Cannot load <HDF5 dataset "data": shape (3,), type "<i8"> data type)

The text was updated successfully, but these errors were encountered:

1313e · 2020-11-03T11:47:15Z

Hi @cyfra,

we need a bit more information than that.
For example, what version of hickle are you using?

1313e · 2020-11-04T01:15:48Z

Alright, I found the issue.
In h5py 3.0, strings stored in HDF5-files are now returned as unicode rather than bytes.
This would require changing all places where strings are being read in to function properly.

@telegraphic I think the best option for now is to exclude h5py 3.x from the requirements until they have released a few versions of it.
Besides writing a function that checks what version of h5py you are using; uses the proper method of reading strings; and applying it literally everywhere where hickle reads strings, we cannot make a compatibility change for this.
Forcing everyone to upgrade to h5py 3.0 right now, feels like a bit too much to me.

cyfra · 2020-11-07T14:14:27Z

@1313e - thanks a lot for looking into this so quickly.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143

hernot · 2020-12-12T12:11:06Z

@1313e @telegraphic just in case it might be of any interest to you or even any help at all i wanted to let you know that:
Since a while ago i got in contact with tox package on another project I'm contributing to and i have to admit i like it. Therefore i was curious whether it would be possible to simplify maintaining production release while at the same time in parallel work on another branch on eg. issues arising when wanting to support h5py. 2.X versions in parallel to recent 3.0 and future 3.X versions. The results of this curiosity can be found in the detached concept_memp_compact_expand branch of my hickle fork. Be warned the only test working therein is the test_stateful.py for which at least one > 100 MB file is required i have to share with you via file-sharing service as github does not allow to link forked repos with git(hub) large file service. Further even though i have edited .travis.yaml and .appveyor.yaml according to documentation to be aware of running tests through tox instead of directly calling pytest i have, thanks to the other project some idea about travis but none about appveyor, if that works as intended or at all.

1313e · 2020-12-17T01:17:36Z

Alright, given that I have had very little time lately to do things, I am just removing h5py 3.x from the requirements for now.

hernot · 2020-12-20T21:32:02Z

This is the only one where i just have as a by product of compression safety proofing an unfinished and to further discuss and extend suggestion, which will be included in finalize and clean up pull-request.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143

hernot · 2021-01-01T20:54:11Z

Ok one issue which will definitely will persist even after memoisation implementation is the fact that h5py >= 3.0 converts all bytes strings like 'item_type' which were assigned to an attribute to a str assuming that they were created form python2 strings or are just manually encoded python3 str objects. The only option i have found to enforce creating an bytes type attribute is to use the create method of the attrs property this allows to explicitly specify the dtype of the attribute

some_dataset_or_group.attrs.create('item_type',data=b'str',dtype = 'S{}'.format(len(b'str')))
print(some_dataset_or_group.attrs['item_type'],type(some_dataset_or_group.attrs['item_type']))
print(some_dataset_or_group.attrs.get('item_type'),type(some_dataset_or_group.get('item_type')))

Alternatively just support h5py >= 3 in newer version of hickle and sacrifice support of h5py < 3 which I'm not sure if this is currently an option. Just as a side node with type memoisation in place this will not be an issue for 'type' and 'base' type attributes. But reading older files will be impossible with hickle versions upporting h5py >= 3 as pickle string will cause encoding errors when h5py tries to convert them to utf8 used by str.

Not yet seen reported issue from unit tests as added by pr #138. keep you posted as i also would be interested in using hickle on newer systems where h5py >= 3 is already installed and switching back to h5py < 3 is no option.

EDIT:

Ist mostly harmless. For the by far most strings which are stored by intention as python bytes like base_type or str_type it is sufficient to replace for example

  if str_type == b'str':
     ....

by

  if str_type in (b'str','str'):
     .....

And for the base_type type strings it is sufficient when register_class method simply adds two entries for each loader in the hkl_types_table (hickle <= 4.0.3 as of 4th January 2020)

  # in register class
   if load_fcn is not None:
       hkl_types_table[hkl_str] = load_fcn
       hkl_types_table[hkl_str.decode('utf8')] = load_fcn

The same holds for key_base_type strings:

    if key_base_type in (b'str','str'):
        ....

dict_key_types_dict = {
    b'float': float,
    'float': float,
    b'int': int,
    'int': int,
    ...
}

A bit more trickly is how h5py >= 3.0 handles strings stored by dump_scalar_dataset. As these strings are stored without explcitly speciifying the dtype for the dataset it requires special provision when loading data written by hickle <= 4.0.3. on loading.

   #in load scalar dataset
   content = h_node[()] 
   if py_obj_type is str and object in h_node.dtype.name and h_node.attrs.get('str_type',None) is None:
        if not isinstance(content,str):
             content = content.decode('utf8')

The good news is pickle strings store in the 'type' attribute are not affected at all. Likely as they contain bytes with are not part of valid utf8 encoded strings they seem to be returned in any case as bytes strings.

As soon as @telegraphic finds time that we can finalize #138, work on the already perpared but not yet published pull requests for #139 and #145 i could provide also a production ready fix for this in the also already prepared finalize and clean-up pull request. The only thing to be discussed is still if two distinct pytest runs for h5py 2 and h5py 3 compatibility are required or if hickle shall always be tested against latest h5py version by default and only if users report issues with h5py 2 than tests are added with allow to mock h5py 2 behaviour under latest h5py version.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143

looks like this will be broken until telegraphic/hickle#143 is addressed

peendebak · 2021-08-02T12:40:26Z

@hernot Any update on this issue? h5py < 3.x is not providing wheels for python 3.9 on windows, so this blocks hickle on python 3.9

hernot · 2021-08-03T16:30:24Z

The Situation is:

From Python 3.9 on only h5py >= 3.x and that is 64 bit only. In addition manual compilation and installation via pip will also fail, cause h5py >= 3.x requires libhdf5 versions which are too available 64bit only.

The status of hickle as far as I'm involved is:

Forthcomming hickle >= 5.x, currently reviewed by @1313e and @telegraphic, will support h5py >=3.x. From Python 3.9 support 64bit is the only option. On Python < 3.9 it will fallback to h5py 2.10 especially for windows 32 bit.

@1313e are there any plans for hickle 4.x to backport the requirements?

Conclusio

If you need win32 than you are locked in to Python < 3.9 and h5py 2.10.
If you need windows 64 bit for now you are locked in to Python <= 3.8 and h5py 2.10 too.
For using h5py >= 3.0 and/or Python >= 3.9 you will have to wait for hickle 5.

Until hickle >= 5.0 is released explicit installation of h5py 2.10 along with hickle especially for windows 32 is possible for Python <= 3.8.

pip install h5py==2.10 hickle

The bad news is for Python >=3.9 you likely have to wait for hickle 5 and migrate to windows 64 bit unless downgrading to Python 3.8 or older is an option for you.

1313e · 2021-08-04T00:59:48Z

@peendebak @hernot Yeah, I think we should backport this to hickle v4.
I am currently in the process of moving back to my home country, but I should have time to do that afterward.
So, I hope you can wait for a few weeks, as it isn't a very trivial process.
Unless you want to give a go yourself of course.

hernot · 2021-08-04T07:11:36Z

@peendebak if you are intending to attempt the backport i can help and share my knowledge related to the topic i gained through preparation of hickle 5 release candidate.
@1313e on atempting to backport shoud than hickle 4 also switched from travis and appveyor to gh actions ?

1313e · 2021-08-04T08:09:10Z

@hernot Probably, yes.
Moving away entirely from Travis CI and Appveyor would be really great.

hernot mentioned this issue Dec 2, 2020

Implementaion of Container and mixed loaders (H4EP001) #138

Closed

1313e added this to the h5py v3 milestone Dec 9, 2020

fabaff mentioned this issue Apr 9, 2021

python38Packages.h5py: 3.1.0 -> 3.2.1 NixOS/nixpkgs#118844

Merged

10 tasks

risicle mentioned this issue May 13, 2021

python3Packages.hickle: mark as broken NixOS/nixpkgs#122879

Merged

10 tasks

fabaff pushed a commit to NixOS/nixpkgs that referenced this issue May 13, 2021

python3Packages.hickle: mark as broken

277996b

looks like this will be broken until telegraphic/hickle#143 is addressed

peendebak mentioned this issue Aug 2, 2021

Update CI QuTech-Delft/qtt#773

Merged

hernot mentioned this issue Oct 14, 2021

Hickle 5 rc #149

Merged

telegraphic mentioned this issue Dec 13, 2021

TypeError: string argument without an encoding #152

Closed

hernot mentioned this issue Dec 18, 2021

Hickle v5.0.0 #153

Merged

telegraphic closed this as completed Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hickle not working with h5py 3.0 #143

Hickle not working with h5py 3.0 #143

cyfra commented Nov 2, 2020 •

edited

Loading

1313e commented Nov 3, 2020

1313e commented Nov 4, 2020

cyfra commented Nov 7, 2020

hernot commented Dec 12, 2020 •

edited

Loading

1313e commented Dec 17, 2020

hernot commented Dec 20, 2020

hernot commented Jan 1, 2021 •

edited

Loading

peendebak commented Aug 2, 2021

hernot commented Aug 3, 2021 •

edited

Loading

1313e commented Aug 4, 2021 •

edited

Loading

hernot commented Aug 4, 2021 •

edited

Loading

1313e commented Aug 4, 2021

Hickle not working with h5py 3.0 #143

Hickle not working with h5py 3.0 #143

Comments

cyfra commented Nov 2, 2020 • edited Loading

1313e commented Nov 3, 2020

1313e commented Nov 4, 2020

cyfra commented Nov 7, 2020

hernot commented Dec 12, 2020 • edited Loading

1313e commented Dec 17, 2020

hernot commented Dec 20, 2020

hernot commented Jan 1, 2021 • edited Loading

EDIT:

peendebak commented Aug 2, 2021

hernot commented Aug 3, 2021 • edited Loading

The Situation is:

The status of hickle as far as I'm involved is:

Conclusio

1313e commented Aug 4, 2021 • edited Loading

hernot commented Aug 4, 2021 • edited Loading

1313e commented Aug 4, 2021

cyfra commented Nov 2, 2020 •

edited

Loading

hernot commented Dec 12, 2020 •

edited

Loading

hernot commented Jan 1, 2021 •

edited

Loading

hernot commented Aug 3, 2021 •

edited

Loading

1313e commented Aug 4, 2021 •

edited

Loading

hernot commented Aug 4, 2021 •

edited

Loading