H4EP 001: Container and mixed (dataset + Container) loaders (draft) #135

hernot · 2020-06-20T15:04:18Z

Abstract

By the proposed extension all logic specific to individual datatypes including python list, tuple and dict would be shifted to dedicated loaders, reducing the complexity of the core machinery of hickle. On loading the decision how to restore the object is reduced to the selection of the loader based upon the base_type and object_type attributes and the type of hdf5 entry (dataset or group). Any thing else is handled either by the load_function or PyContainer class provided by the loader. This opens the possibility to support additional container like objects, properly map the state of state full objects to hdf5 file structure etc.

Motivation

With hickle 4.0.0 the concept of loaders was introduced. Each loader links a python object to an exporter, a method or function which converts the content of the object to a hdf5 data set, a base type which allows to identify the loader to be used to restore the Python object from hdf5 file and defines some requirements for properly restoring the order of items within the resulting Python container represented by the data set.
This information is encoded in the class_register table of each loader module. An entry within this table is a list or tuple with the following items:

[<ClassType>,<hkl_str>,<dump_function>,<load_function>,ndarray_check_fn = None,to_sort = True]

The last two items can be omitted if their values read None and True respective as each table entry is passed as list of input arguments to the register_class method on importing each loader module when required.

In case the object to be dumped resembles a more complex structure of lists, tuples and/or dicts which can not one to one be mapped to datasets as their items are again lists, tuples, dicts or numpy arrays and others than hickle core machinery creates an hdf5 group instead. Within this group each sub item is dumped to its own sub group if item is still too complex to be handled by loader or a subdata set otherwise.
When loading such a Group it creates an intermediate PyContainer object as a place holder and handler while restoring the contents of the corresponding list, tuple or dict. Like the dumping of theses container types the PyContainer is hard coded within the hickle core containing complex logic required to distinguish between list, tuple, dict etc. Adding support of other container like structures, for example tuple returned by object.__reduce_ex__ (#125) and objects, or implementing concepts like file based dictionary items (#133) to this system is hardly possible and would increase the complexity of the _load function as well as the complexity of the PyContainer object while at the same time worsen the overall maintainability of hickle.

Specification

Therefore the following extension to the loader machinery is proposed.
The class_register table entries and the parameter list of the register_class method are extended by an additional container_class entry as follows

[<ClassType>,<hkl_str>,<dump_function>,<load_function>,<container_class>=None,<ndarray_check_fn> = None,<to_sort> = True]

Where the last three entries, in case set to their default values None and True respective can be omitted.

The PyContainer class is reduced to an an abstract base class any <container_class> must be derived from. It should only define the fields and methods required by the hickle Core machinery to properly restore a hdf5 group representing the corresponding python object. Any additional field or method required to restore a specific type has to be defined and implemented by the corresponding Container class derived from PyContainer. The PyContainer class shall provide the following elements and if reasonable should map basic fields required by the hickle core machinery to work properly to __slots__ instead of the class __dict__.

class PyContainer():
    __slots__ = ("base_type", "object_type", "_h5_attrs", "_content","__dict__" )

    def __init__(self,h5_attrs, base_type, object_type):
         # the base type used to select this PyContainer
        self.base_type = basetype
         # class of python object represented by this PyContainer
        self.object_type = object_type
        # the h5_attrs structure of the h5_group to load the object_type from
        # can be used by the append and convert methods to obtain more
        # information about the container like object to be restored
        self._h5_attrs = h5_attrs
        # intermediate list, tuple, dict, etc. used to collect and store the sub items
        # when calling the append method
        self._content = []
 
    def append(self,item,h5_attrs):
         """
         adds the passed item (object) to the content of this container.
         The h5_attrs parameter corresponds to the attribute list of the h5_group or
         h5_dataset the item was loaded from. This can be used by derived Container classes
         for example a dict container. Such a container could use a dict instead of the default
         list object as intermediate store and extract the information required to restore the
         corresponding key from the h5_attrs list.
         """
         self._content.append(item)

    def convert(self):
         """
         creates the final object and populates it with the items stored in the _content slot
         must be implemented by the derived Container classes
         """
              raise NotImplementedError("convert method must be implemented")

    def __getattr__(self,attr_name):
         # Optionally PyContainer could provide a __getattr__ and python
         # automatic propery (DataDescriptor) based mechanism to access 
         # h5_group attributes as attributes and properties of this container
         # details not included in this proposal and thus subject to further
         # discussion or another H4EP
         value = self._h5_attrs.get(Name,self):
         if value is self:
              raise AttributeError("'{}' attribute not available for Container".format(attr_name))
         return value

The _load method of the core hickle machinery would just take care that the appropriate load_method of the identified base_type is called or an instance of the dedicated PyContainer class is used to restore the object described by an hdf5 group.

Rational

The proposed extension would allow to cover the following 3 Cases:

primitive datatypes, numpy arrays and python containers (list, tuples etc.) which just contain primitive datatypes mappable to datatypes supported by hdf5 datasets.
For theses a class_register table would read as follows (with the default entries omitted):
[<ClassType>,<hkl_str>,<dump_function>,<load_function>]

The dump function would create the corresponding hdf5 dataset and the load_function would restore the corresponding container object or primitive type as is already implemented in hickle 4.0.0. No change here.

container object with no direct correspondence in hdf5 file format like dict described by the following class_register table entry (default items again omitted):

[<ClassType>,<hkl_str>,<dump_function>,<load_function>= None,<container_class>]

The Container representing a Python dict would likely use a dict as internal intermediate storage for the dict items. This would be returned by its convert method without any further modification. All the logic about how to restore the key would be encoded in its append method. This could inspect the value of a hdf5 key_type attribute. If that reads name the key would be extracted from the hdf5 path-name of the item, whereas if it would read item or value or key-value than it would interpret the the passed item as a two element tuple where the first element represents the key and the second the corresponding value. The dump_method would ensure that the value of the key_type attribute is set to the apropriate value and the item is stored accordingly.

mixed container objects like list and tuple which can be mapped to hdf5 dataset if containing only primitive datatypes and to hdf5 group if their sub items resemble complex objects including list, tuple, etc. The corresponding class_register table entry would read as follows (default items omitted)

[<ClassType>,<hkl_str>,<dump_function>,<load_function>,<container_class>]

In case Container is mapped to hdf5 dataset the load_function would be called for restoring as already implemented. In case Python object is mapped to hdf5 group an instance of the corresponding PyContainer class is used to restore the represented Python object.
For example a ListConatiner and a TupleContainer would implement their restore methods as follows

class ListContainer(PyContainer):
      def restore(self):
            return self._content

class TupleContainer(PyContainer):
     def restore(self):
           return tuple(self._content)

With this three cases it is possible to dump complex Python structures and objects to hdf5 file format and restore them from it based upon the object_type and base_type attributes only. Any additional logic required is encoded by the dump_function, load_function and PyConatiner only. This allows in future to support custom container types not necessarily derived from list, tuple, dict or numpy arrays as well as the Python copy protocol (#125) for dumping state full objects or allow third party to replace individual loaders. The replaced loaders could thereby redefine the structure representing individual object types within the hdf5 file. For example the a custom dict loader could map dict key value pairs to dedicated hdf5 files on file system (#133) using the main hdf5 file as a sort of table of contents, central or main document.
Complex Python structures can contain items which are linked and shared across the object structure multiple times by sharing a reference to the same Python object. A future extension proposal to hickle V4 could uptake this by introducing a hdf5 soft-link based memoisation mechanism. This mechanism could be managed by _dump and _load methods only similar to the memo dictionary structure utilized by copy.deepcopy (compare https://docs.python.org/3.8/library/copy.html#copy.deepcopy ) function for example.

The maintainability of hickle core machinery and loaders is increased by this proposal, as all logic specific to individual object_type - base_type pair is encoded by corresponding loader only. Each loader can be maintained independent of any other unless one extending the other or utilizing functionality of the other.
Profiling and run time performance optimization is simplified as this can be done for hickle core machinery and each loader independent from each other. Optimizing first dump_function, load_function and/or PyContainer for very frequently used loaders whilst exotic loaders would possibly never be considered to be a reasonable target for optimization.

Open Issues

Common look uptable for load_methods and PyConatiners or two dedicated tables one for load_methods and one for PyContainers.
Necessity of __getattr__ and datadescriptor based mapping of hdf5 attributes to PyContainer attributes as it would just be syntactic sugar with low additional value only.
Check if any optional item in the class_register table becomes obsolete as intrinsically handled by loader (dump_function, load_method and container_class) or managed though hdf5 group and dataset attributes by loader.

References

None

Preconditions

hickle 4.0.0 released

The text was updated successfully, but these errors were encountered:

1313e · 2020-06-21T03:10:57Z

@hernot Wow.
I will go and read all of that pretty soon.
Also, I like the name for the enhancement proposal. :)

hernot · 2020-06-21T07:46:37Z

Consider one thing. for now it is just a draft for discussion, during which i will edit and amend it according to our findings. Edit will also occur when i find while rereading it that some wording is unclear or to complicated. Such edits should not have any effect on the content. Thus no need to react on each and every edit and amendment.

1313e · 2020-06-25T04:22:45Z

After reading this entirely carefully, I have to say that I am impressed.
I had a similar idea in mind (separate out and isolate the loaders as much as possible), but it was far out of the scope for v4, so I didn't do it.
You will probably notice in the code that there are several cases where the separation (for example, for dicts) could not be made because the current API cannot handle that.
I however had not written it out as clearly as this yet, so many thanks for that. :)

Yeah, I fully agree with this proposal and I like the solutions as well.
I would definitely be willing to implement something like this, but it would require a serious rewrite of basically the entirety of hickle to make sure that future enhancements do not require significant API changes (again).
Personally, I think we should focus on rewriting hickle into a very modular framework that can be easily extended.
I already made some steps into that direction (as you mentioned) by isolating the loaders more, but this can definitely be improved.

hernot · 2020-06-25T06:26:11Z

I would definitely be willing to implement something like this, but it would require a serious rewrite of basically the entirety of hickle to make sure that future enhancements do not require significant API changes (again).

I'm not sure how far there much changes are required. load_function is responsible for datasets, which are to be considered leaves when viewing hdf5 document as tree, and thus do not require recursively calling _load. Same for the proposed container_class. Its instance objects act as an intermediate proxies for the inner inner Nodes of the tree and receive their leaves and subnodes through a call to their append method. Thus they also do not require recursive calls to _load. The only part of the loader which might require recursively call _dump function of hickle core is the loaders dump_function. But whether it's parameter list or the value returned requires any modification to further reduce interdependence between loader and hickle core machinery I'm not so sure. The only reason which i could think of is to reduce the levels of alternating recursive calls between _dump and loader dump_functions when the latter also are responsible for proper mapping and dumping of container like objects and recursive object structures.

1313e · 2020-06-25T06:43:12Z

Many functions rely on the fact that dumping and loading is done in a recursive fashion.
Therefore, the removal or addition of recursive calls requires a major rewrite of the core API.
Not that I am against that.

PS: hickle v4 is now released. :)

1313e · 2020-06-25T08:23:38Z

So, let me add my own opinions here, from the implementation standpoint.

The current big problem in hickle is that it uses HDF5 groups for iterables, and HDF5 datasets for objects (that might be iterables, but simple ones).
Whenever it encounters an object, it checks if it is a NumPy array-like object or if it can be iterated over.
If the first, it calls the dump-function of the appropriate type.
If the latter, it creates an HDF5 group, saves the type of iterable, and then iterates over the contents by doing the same check again.
It recursively keeps performing this check until everything can be stored in a dataset.
If neither, a dataset is created immediately.

Loading is done in a similar fashion, where only datasets will trigger the call of a specific load-function (with the exception of dicts).
If a group is encountered (except for dicts), loading is done recursively on all groups/datasets in that group until only datasets are found and then the contents are added to a list (which is PyContainer).
This works for simple objects, but not for complex ones that might require several datasets worth of data to be stored.
This is already a problem in hickle when storing NumPy masked arrays or SciPy sparse matrices, and requires some weird 'hacking' to get working properly.

I noticed the above already while writing v4, but I quickly realized that some extensive changes would be required to get this working properly.
As my aim for v4 was to get hickle into a shape where any object can reliably be dumped and loaded (which, as far as I know, it can now), regardless of whether it is pretty or not, it was out of the scope for v4.
I did however implement some features into hickle v4 already that would be required later for this API change, like the loaders and the separation between the true type and the base type of the dumped object.

So, what needs to be done, in my opinion, is that hickle gets rid of the recursive calling of the _dump and _load functions, and preferably the PyContainer class as well.
In order to do that, the ideas proposed in this proposal are required to be implemented properly, with some modifications.

What I am thinking is the following:
When an object is requested to be dumped, hickle checks if there is a loader available for this specific type of object (hickle v4 already does this by iterating over the MRO of an object, and selecting the first match).
This loader is a subclass of a HickleLoader abstract base class, which must supply functions that are used to dump and load a hickled object.
If this loader is available, the appropriate dump-function is called with the object that must be dumped and the path within the HDF5-file that must be used for dumping.
Afterward, hickle checks if the provided path now exists within the HDF5-file and adds some extra metadata to it to ensure that the object can be loaded again (like, adding what loader was used for this object, which is needed later for loading).
The creation of any groups or datasets is fully up to the dump-function of the loader, but it is only allowed to create a dataset with the provided path, or a group with that path and any groups/datasets within it (basically, the provided path sets the root of the structure that the dump-function is allowed to use).

Now, many objects in Python are containers, meaning that they contain other objects.
Instead of requiring a dump-function to be able to deal with all possible object types, it is allowed to call hickle's dump itself, providing it with the object and the group/path that must be used to dump that object.
This will just execute the same process again as explained in the previous paragraph.
Loading is done in the same fashion, where hickle reads the metadata at a specific path, calls the load-function of the appropriate loader on that path and returns whatever object that load-function returns.
The load-function itself is also allowed to call hickle's load to handle any objects that were dumped in deeper levels.

Basically, this structure will solve the problems that there are currently with the core of hickle.
Objects are theoretically speaking still dumped/loaded in a recursive fashion, but the functions in the loader determine where, when and how many times this is done, as opposed to hickle itself (as it works right now).
Providing an abstract base class whose subclasses are used as loaders, would also make registering loaders really easy: You provide the subclass to a register_loader function and that is it.
All information required by hickle is provided by this subclass, and thus you don't have to provide it separately.

This is not the first time I have dealt with writing modular structures.
For example, my Python plotting GUI GuiPy (still under heavy development) is written to be so modular, that basically anything in the entire GUI can be modified, adapted or added by the user.
I use a similar system there as I propose above, where the program only knows about an object down to a certain scope, and everything below that is a black box.
The object itself knows nothing about the outer scopes, creating a very isolated, modular environment.

@hernot Any comments?

hernot · 2020-06-25T12:13:24Z

Ahm, that is exactly what I already understood from hickle v4.0.0 sorces and what this proposal is about. A few thoughts as they just come to my mind.

Keeping some control inside hickle about general document structure and have hickle core guide respective support the loaders on dumping and on loading: Loaders just need to care about how the object they are responsible for to be stored inside the file and how from the content of dataset or group content to restore the pyton object again.
Keeping recursive calls of _load and _dump inside these to functions: would take from loader the responsability of properly calling back into hickle core. Loaders and hickle core can be strictly separated with minimal interface beeing the calling parameters and return value structure.
rather extend returnvalue of dump_functions to return either hdf5 dataset or tuple not just containing the hdf5 group but also an iterable (sequence, iterator, generator,list, etc. ) of tuples containing the subobject to be included into to the hdf5 group as subitem, the hdf5 name to be used as tail part of the hdf5 path of subitem. Have _dump recall it self on every item yielded by sequence and append to hdf5 group returned.
keep attributes hickle requires to control dumping and loading as low as possible and at the same time take the burden from loaders to bother about more than the object stored and its immediate items to be stored along.
With these it should quite simple to implement in future hdf5 softlinks for objects which are shared between multiple parts of complex object structure and others, independent of which object it is and whether loader would be awave of shared items or not.
_load and _dump need not to be part of interface provided to hickle loader modules thus keeping the two strictly separate. In ideal case loader does not require any access to hickle core functions.

To summarize: there are two basic approaches to handling of recursion:

keep all recursion in _dump and _load and have dump_function just return either single dataset or tuple containing group and iterable providing all items which are subject to recursive call to _dump which would be most symetric to current implementation of load in hickle 4.0.0 if Container as proposed are added.
use _dump and _load as trigger for dump_function and load_function of loaders and have them decide whether another call to _dump and _load is necessary for properly dumping or restoring current object. Ensuring that any information required by _dump and _load only is properly passed on by loaders may be tricky especially if in future releases hickle accepts custom loaders not included within released package.

I'd prefer 1) as i feel that it provides a clean separation between loader and hickle core allowing to make future changes in the kickle core without the need to check whether this change would affect any of the loader or even all. The same for loaders which have to conform to a well defined minimal interface to be useable with hickle and beyond that do not need to bother about any inner working of hickle core.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.0, 4.0.1 files which do not yet support PyContainer concept beyond list, tuple, dict and set

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue telegraphic#143