Skip to content

Releases: p2p-ld/numpydantic

v1.6.4 - Combinatoric testing and public test helpers!

11 Oct 09:33
66ab444
Compare
Choose a tag to compare

PR: #31

We have rewritten our testing system for more rigorous tests,
where before we were limited to only testing dtype or shape cases one at a time,
now we can test all possible combinations together!

This allows us to have better guarantees for behavior that all interfaces
should support, validating it against all possible dtypes and shapes.

We also exposed all the helpers and array testing classes for downstream development
so that it would be easier to test and validate any 3rd-party interfaces
that haven't made their way into mainline numpydantic yet -
see the numpydantic.testing module.

See the testing documentation for more details.

Bugfix

  • Previously, numpy and dask arrays with a model dtype would fail json roundtripping
    because they wouldn't be correctly cast back to the model type. Now they are.
  • Zarr would not dump the dtype of an array when it roundtripped to json,
    causing every array to be interpreted as a random integer or float type.
    dtype is now dumped and used when deserializing.

v1.6.3 - Reinstate `h5py>=3.12`

27 Sep 03:13
69dbe39
Compare
Choose a tag to compare

Bugfix
#28

  • h5py v3.12.0 was actually fine, but we did need to change the way that
    the hdf5 tests work to not hold the file open during the test. Easy enough change.
    the version cap has been removed from h5py (which is optional anyway,
    so any version could be installed separately)

v1.6.2

26 Sep 00:48
b7f7140
Compare
Choose a tag to compare

Very minor bugfix and CI release

PR: #26

Bugfix

  • h5py v3.12.0 broke file locking, so a temporary maximum version cap was added
    until that is resolved. See h5py/h5py#2506
    and #27
  • The _relativize_paths function used in roundtrip dumping was incorrectly
    relativizing paths that are intended to refer to paths within a dataset,
    rather than a file. This, as well as windows-specific bugs was fixed so that
    directories that exist but are just below the filesystem root (like /data)
    are excluded. If this becomes a problem then we will have to make the
    relativization system a bit more robust by specifically enumerating which
    path-like things are not intended to be paths.

CI

  • numpydantic was added as an array range generator in linkml
    (linkml/linkml#2178),
    so tests were added to ensure that changes to numpydantic don't break
    linkml array range generation. numpydantic's tests are naturally a
    superset of the behavior tested in linkml, but this is a good
    paranoia check in case we drift substantially (which shouldn't happen).

v1.6.1 - Union Types

24 Sep 07:28
1cf69eb
Compare
Choose a tag to compare

It's now possible to do this, like it always should have been

class MyModel(BaseModel):
    array: NDArray[Any, int | float]

Features

  • Support for Union Dtypes

Structure

  • New validation module containing shape and dtype convenience methods
    to declutter main namespace and make a grouping for related code
  • Rename all serialized arrays within a container dict to value to be able
    to identify them by convention and avoid long iteration - see perf below.

Perf

  • Avoid iterating over every item in an array trying to convert it to a path for
    a several order of magnitude perf improvement over 1.6.0 (oops)

Docs

  • Page for dtypes, mostly stubs at the moment, but more explicit documentation
    about what kind of dtypes we support.

v1.6.0 - Roundtrip Json Serialization

24 Sep 01:22
bd5b937
Compare
Choose a tag to compare

(as always, please see the changelog in the docs for working links and full information): https://numpydantic.readthedocs.io/en/latest/changelog.html#roundtrip-json-serialization

Roundtrip JSON serialization is here - with serialization to list of lists,
as well as file references that don't require copying the whole array if
used in data modeling, control over path relativization, and stamping of
interface version for the extra provenance conscious.

Please see serialization for narrative documentation :)

Potentially Breaking Changes

  • See development for a statement about API stability
  • An additional {meth}.Interface.deserialize method has been added to
    {meth}.Interface.validate - downstream users are not intended to override the
    validate method, but if they have, then JSON deserialization will not work for them.
  • Interface subclasses now require a name attribute, a short string identifier for that interface,
    and a json_model that inherits from {class}.interface.JsonDict. Interfaces without
    these attributes will not be able to be instantiated.
  • {meth}.Interface.to_json is now an abstract method that all interfaces must define.

Features

  • Roundtrip JSON serialization - by default dump to a list of list arrays, but
    support the round_trip keyword in model_dump_json for provenance-preserving dumps
  • JSON Schema generation has been separated from core_schema generation in {class}.NDArray.
    Downstream interfaces can customize json schema generation without compromising ability to validate.
  • All proxy classes must have an __eq__ dunder method to compare equality -
    in proxy classes, these compare equality of arguments, since the arrays that
    are referenced on disk should be equal by definition. Direct array comparison
    should use {func}numpy.array_equal
  • Interfaces previously couldn't be instantiated without explicit shape and dtype arguments,
    these have been given Any defaults.
  • New {mod}numpydantic.serialization module to contain serialization logic.

New Classes
See the docstrings for descriptions of each class

  • MarkMismatchError for when an array serialized with mark_interface doesn't match
    the interface that's deserializing it
  • {class}.interface.InterfaceMark
  • {class}.interface.MarkedJson
  • {class}.interface.JsonDict
    • {class}.dask.DaskJsonDict
    • {class}.hdf5.H5JsonDict
    • {class}.numpy.NumpyJsonDict
    • {class}.video.VideoJsonDict
    • {class}.zarr.ZarrJsonDict

Bugfix

  • #17 - Arrays are re-validated as lists, rather than arrays
  • Some proxy classes would fail to be serialized becauase they lacked an __array__ method.
    __array__ methods have been added, and tests for coercing to an array to prevent regression.
  • Some proxy classes lacked a __name__ attribute, which caused failures to serialize
    when the __getattr__ methods attempted to pass it through. These have been added where needed.

Docs

  • Add statement about versioning and API stability to development
  • Add docs for serialization!
  • Remove stranded docs from hooks and monkeypatch
  • Added myst_nb to docs dependencies for direct rendering of code and output

Tests

  • Marks have been added for running subsets of the tests for a given interface,
    package feature, etc.
  • Tests for all the above functionality

v1.5.3 - [bugfix] Validation with empty HDF5 datasets

04 Sep 00:51
99a6571
Compare
Choose a tag to compare

#16: Empty HDF5 datasets shouldn't break validation
if the NDArray spec allows Any shaped arrays.

v1.5.2 - `datetime` support for HDF5

04 Sep 00:09
2c625e4
Compare
Choose a tag to compare

#15

HDF5 can't support datetimes natively, but we can fake it with 32-bit strings.

This PR allows one to specify a datetime dtype, and encodes datetime objects as strings on storage, and decodes them on access.

Getting to the point where we need to start making a generalized type conversion/serialization system because this interface in particular is getting gnarly, but don't have time just yet

import h5py
from datetime import datetime
import numpy as np
from numpydantic import NDArray
from pydantic import BaseModel
from typing import Any

data = np.array([datetime.now().isoformat().encode('utf-8')], dtype="S32")
h5f = h5py.File('test.hdf5', 'w')
h5f.create_dataset('data', data=data)

class MyModel(BaseModel):
    array: NDArray[Any, datetime]

instance = MyModel(array=('test.hdf5', '/data'))
instance.array[0]
# np.datetime64('2024-09-03T23:50:45.897980') 
instance.array[0] = datetime.now()

v1.5.1 - [bugfix] Allow revalidation with proxied arrays

03 Sep 20:23
c46015d
Compare
Choose a tag to compare

See: #14

When a proxy object is passed to some validators after having already been validated, validation fails.

This should always succeed:

from numpydantic import NDArray
from pydantic import BaseModel

class MyModel(BaseModel):
    array: NDArray

instance = MyModel(array=valid_input)
_ = MyModel(array=instance.array)

but it's currently failing for the proxied interfaces.

This PR

  • adds passthrough checks for h5proxy and videoproxy
  • adds a testing module for tests against all interfaces, and tests that an already-instantiated model can be re-instantiated using the same array field after passing through the interface

v1.5.0 - String support for HDF5

03 Sep 06:01
2ed0be8
Compare
Choose a tag to compare

Strings in hdf5 are tricky! HDF5 doesn't have native support for unicode,
but it can be persuaded to store data in ASCII or virtualized utf-8 under somewhat obscure conditions.

This PR uses h5py's string methods to expose string datasets (compound or not)
via the h5proxy with the asstr() view method.
This also allows us to set strings with normal python strings,
although hdf5 datasets can only be created with bytes or other non-unicode encodings.

Since numpydantic isn't necessarily a tool for creating hdf5 files
(nobody should be doing that), but rather an interface to them,
tests are included for reading and validating (unskip the existing string tests)
as well as setting/getting.

import h5py
import numpy as np
from pydantic import BaseModel
from numpydantic import NDArray
from typing import Any

class MyModel(BaseModel):
  array: NDArray[Any, str]

h5f = h5py.File('my_data.h5', 'w')
data = np.random.random((10,10)).astype(bytes)
_ = h5f.create_dataset('/dataset', data=data)

instance = MyModel(array=('my_data.h5', '/dataset'))
instance[0,0] = 'hey'
assert instance[0,0] == 'hey'

v1.4.1 - Support `len()` and test dunder methods for all interfaces

03 Sep 01:25
f699d2a
Compare
Choose a tag to compare

It's pretty natural to want to do len(array) as a shorthand for array.shape[0],
but since some of the numpydantic classes are passthrough proxy objects,
they don't implement all the dunder methods of the classes they wrap
(though they should attempt to via __getattr__).

This PR adds __len__ to the two interfaces that are missing it,
and adds fixtures and makes a testing module specifically for testing dunder methods
that should be true across all interfaces.
Previously we have had fixtures that test all of a set of dtype and shape cases for each interface,
but we haven't had a way of asserting that something should be true for all interfaces.
There is a certain combinatoric explosion when we start testing across all interfaces,
for all input types, for all dtype and all shape cases,
but for now numpydantic is fast enough that this doesn't matter <3.