Skip to content

Commit

Permalink
Merge pull request #24 from p2p-ld/dtype-union
Browse files Browse the repository at this point in the history
[dtype] Support Unions
  • Loading branch information
sneakers-the-rat authored Sep 24, 2024
2 parents 7f2c79b + 2e7031c commit 1cf69eb
Show file tree
Hide file tree
Showing 23 changed files with 443 additions and 120 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,6 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.pdm-python
ndarray.pyi
ndarray.pyi

prof/
7 changes: 7 additions & 0 deletions docs/api/validation/dtype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# dtype

```{eval-rst}
.. automodule:: numpydantic.validation.dtype
:members:
:undoc-members:
```
6 changes: 6 additions & 0 deletions docs/api/validation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# validation

```{toctree}
dtype
shape
```
2 changes: 1 addition & 1 deletion docs/api/shape.md → docs/api/validation/shape.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# shape

```{eval-rst}
.. automodule:: numpydantic.shape
.. automodule:: numpydantic.validation.shape
:members:
:undoc-members:
```
27 changes: 27 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,33 @@

### 1.6.*

#### 1.6.1 - 24-09-23 - Support Union Dtypes

It's now possible to do this, like it always should have been

```python
class MyModel(BaseModel):
array: NDArray[Any, int | float]
```

**Features**
- Support for Union Dtypes

**Structure**
- New `validation` module containing `shape` and `dtype` convenience methods
to declutter main namespace and make a grouping for related code
- Rename all serialized arrays within a container dict to `value` to be able
to identify them by convention and avoid long iteration - see perf below.

**Perf**
- Avoid iterating over every item in an array trying to convert it to a path for
a several order of magnitude perf improvement over `1.6.0` (oops)

**Docs**
- Page for `dtypes`, mostly stubs at the moment, but more explicit documentation
about what kind of dtypes we support.


#### 1.6.0 - 24-09-23 - Roundtrip JSON Serialization

Roundtrip JSON serialization is here - with serialization to list of lists,
Expand Down
98 changes: 98 additions & 0 deletions docs/dtype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# dtype

```{todo}
This section is under construction as of 1.6.1
Much of the details of dtypes are covered in [syntax](./syntax.md)
and in {mod}`numpydantic.dtype` , but this section will specifically
address how dtypes are handled both generically and by interfaces
as we expand custom dtype handling <3.
For details of support and implementation until the docs have time for some love,
please see the tests, which are the source of truth for the functionality
of the library for now and forever.
```

Recall the general syntax:

```
NDArray[Shape, dtype]
```

These are the docs for what can do in `dtype`.

## Scalar Dtypes

Python builtin types and numpy types should be handled transparently,
with some exception for complex numbers and objects (described below).

### Numbers

#### Complex numbers

```{todo}
Document limitations for complex numbers and strategies for serialization/validation
```

### Datetimes

```{todo}
Datetimes are supported by every interface except :class:`.VideoInterface` ,
with the caveat that HDF5 loses timezone information, and thus all timestamps should
be re-encoded to UTC before saving/loading.
More generic datetime support is TODO.
```

### Objects

```{todo}
Generic objects are supported by all interfaces except
:class:`.VideoInterface` , :class;`.HDF5Interface` , and :class:`.ZarrInterface` .
this might be expected, but there is also hope, TODO fill in serialization plans.
```

### Strings

```{todo}
Strings are supported by all interfaces except :class:`.VideoInterface` .
TODO is fill in the subtleties of how this works
```

## Generic Dtypes

```{todo}
For now these are handled as tuples of dtypes, see the source of
{ref}`numpydantic.dtype.Float` . They should either be handled as Unions
or as a more prescribed meta-type.
For now, use `int` and `float` to refer to the general concepts of
"any int" or "any float" even if this is a bit mismatched from the numpy usage.
```

## Extended Python Typing Universe

### Union Types

Union types can be used as expected.

Union types are tested recursively -- if any item within a ``Union`` matches
the expected dtype at a given level of recursion, the dtype test passes.

```python
class MyModel(BaseModel):
array: NDArray[Any, int | float]
```

## Compound Dtypes

```{todo}
Compound dtypes are currently unsupported,
though the HDF5 interface supports indexing into compound dtypes
as separable dimensions/arrays using the third "field" parameter in
{class}`.hdf5.H5ArrayPath` .
```


3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,7 @@ dumped = instance.model_dump_json(context={'zarr_dump_array': True})
design
syntax
dtype
serialization
interfaces
```
Expand All @@ -484,13 +485,13 @@ interfaces
api/index
api/interface/index
api/validation/index
api/dtype
api/ndarray
api/maps
api/meta
api/schema
api/serialization
api/shape
api/types
```
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "numpydantic"
version = "1.6.0"
version = "1.6.1"
description = "Type and shape validation and serialization for arbitrary array types in pydantic models"
authors = [
{name = "sneakers-the-rat", email = "[email protected]"},
Expand Down Expand Up @@ -126,7 +126,7 @@ markers = [
]

[tool.ruff]
target-version = "py311"
target-version = "py39"
include = ["src/numpydantic/**/*.py", "pyproject.toml"]
exclude = ["tests"]

Expand Down
2 changes: 1 addition & 1 deletion src/numpydantic/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from numpydantic.ndarray import NDArray
from numpydantic.meta import update_ndarray_stub
from numpydantic.shape import Shape
from numpydantic.validation.shape import Shape

update_ndarray_stub()

Expand Down
3 changes: 3 additions & 0 deletions src/numpydantic/dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
Some types like `Integer` are compound types - tuples of multiple dtypes.
Check these using ``in`` rather than ``==``. This interface will develop in future
versions to allow a single dtype check.
For internal helper functions for validating dtype,
see :mod:`numpydantic.validation.dtype`
"""

import sys
Expand Down
6 changes: 3 additions & 3 deletions src/numpydantic/interface/dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ class DaskJsonDict(JsonDict):
name: str
chunks: Iterable[tuple[int, ...]]
dtype: str
array: list
value: list

def to_array_input(self) -> DaskArray:
"""Construct a dask array"""
np_array = np.array(self.array, dtype=self.dtype)
np_array = np.array(self.value, dtype=self.dtype)
array = from_array(
np_array,
name=self.name,
Expand Down Expand Up @@ -100,7 +100,7 @@ def to_json(
if info.round_trip:
as_json = DaskJsonDict(
type=cls.name,
array=as_json,
value=as_json,
name=array.name,
chunks=array.chunks,
dtype=str(np_array.dtype),
Expand Down
39 changes: 18 additions & 21 deletions src/numpydantic/interface/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
ShapeError,
TooManyMatchesError,
)
from numpydantic.shape import check_shape
from numpydantic.types import DtypeType, NDArrayType, ShapeType
from numpydantic.validation import validate_dtype, validate_shape

T = TypeVar("T", bound=NDArrayType)
U = TypeVar("U", bound="JsonDict")
Expand Down Expand Up @@ -76,6 +76,21 @@ def match_by_name(self) -> Optional[Type["Interface"]]:
class JsonDict(BaseModel):
"""
Representation of array when dumped with round_trip == True.
.. admonition:: Developer's Note
Any JsonDict that contains an actual array should be named ``value``
rather than array (or any other name), and nothing but the
array data should be named ``value`` .
During JSON serialization, it becomes ambiguous what contains an array
of data vs. an array of metadata. For the moment we would like to
reserve the ability to have lists of metadata, so until we rule that out,
we would like to be able to avoid iterating over every element of an array
in any context parameter transformation like relativizing/absolutizing paths.
To avoid that, it's good to agree on a single value name -- ``value`` --
and avoid using it for anything else.
"""

type: str
Expand Down Expand Up @@ -274,25 +289,7 @@ def validate_dtype(self, dtype: DtypeType) -> bool:
Validate the dtype of the given array, returning
``True`` if valid, ``False`` if not.
"""
if self.dtype is Any:
return True

if isinstance(self.dtype, tuple):
valid = dtype in self.dtype
elif self.dtype is np.str_:
valid = getattr(dtype, "type", None) in (np.str_, str) or dtype in (
np.str_,
str,
)
else:
# try to match as any subclass, if self.dtype is a class
try:
valid = issubclass(dtype, self.dtype)
except TypeError:
# expected, if dtype or self.dtype is not a class
valid = dtype == self.dtype

return valid
return validate_dtype(dtype, self.dtype)

def raise_for_dtype(self, valid: bool, dtype: DtypeType) -> None:
"""
Expand Down Expand Up @@ -326,7 +323,7 @@ def validate_shape(self, shape: Tuple[int, ...]) -> bool:
if self.shape is Any:
return True

return check_shape(shape, self.shape)
return validate_shape(shape, self.shape)

def raise_for_shape(self, valid: bool, shape: Tuple[int, ...]) -> None:
"""
Expand Down
6 changes: 3 additions & 3 deletions src/numpydantic/interface/numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ class NumpyJsonDict(JsonDict):

type: Literal["numpy"]
dtype: str
array: list
value: list

def to_array_input(self) -> ndarray:
"""
Construct a numpy array
"""
return np.array(self.array, dtype=self.dtype)
return np.array(self.value, dtype=self.dtype)


class NumpyInterface(Interface):
Expand Down Expand Up @@ -99,6 +99,6 @@ def to_json(

if info.round_trip:
json_array = NumpyJsonDict(
type=cls.name, dtype=str(array.dtype), array=json_array
type=cls.name, dtype=str(array.dtype), value=json_array
)
return json_array
6 changes: 3 additions & 3 deletions src/numpydantic/interface/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class ZarrJsonDict(JsonDict):
type: Literal["zarr"]
file: Optional[str] = None
path: Optional[str] = None
array: Optional[list] = None
value: Optional[list] = None

def to_array_input(self) -> Union[ZarrArray, ZarrArrayPath]:
"""
Expand All @@ -73,7 +73,7 @@ def to_array_input(self) -> Union[ZarrArray, ZarrArrayPath]:
if self.file:
array = ZarrArrayPath(file=self.file, path=self.path)
else:
array = zarr.array(self.array)
array = zarr.array(self.value)
return array


Expand Down Expand Up @@ -202,7 +202,7 @@ def to_json(
as_json["info"]["hexdigest"] = array.hexdigest()

if dump_array or not is_file:
as_json["array"] = array[:].tolist()
as_json["value"] = array[:].tolist()

as_json = ZarrJsonDict(**as_json)
else:
Expand Down
Loading

0 comments on commit 1cf69eb

Please sign in to comment.