Add support for variable length data types

Data type sizes are now represented by `DataTypeSize` instead of usize Also adds `ArraySize`. Both are `Fixed` only for now. Need to support `Variable` throughout the codebase. Change codec API in prep for variable sized data types Enable `{Array,DataType}Size::Variable` Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets` Use `CowArrayBytes::validate()` impl `From` for `CowArrayBytes` for various types Array `_element` methods now use `T: Element` Add `vlen` codec metadata Fix codecs bench Implement an experimental vlen codec Use `impl Into<ArrayBytesCow<'a>>` in array methods Use `RawBytesCow` consistently Remove various vlen todo's Cleanup `ArrayBytes` Use `ArrayError::InvalidElementValue` for invalid string encodings Add `ArraySubset::contains()` Add `FillValue::new_empty()` Add remaining vlen support to array `store_` methods and improve vlen validation Add remaining vlen support to array `retrieve_` methods Partial decoding in the vlen filter Fix async vlen errors Sharding codec vlen support Add vlen support to sharding partial decoder vlen support for sharded_readable_ext `offsets_u64_to_usize` handle 32-bit system Minor FillValue doc update Remove unused ArraySubset methods and add related convenience functions Add cities test Add `Arrow32` vlen encoding Add support for Interleave32 (Zarr V2) vlen encoding fmt clippy Set minimum version for num-complex Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77 Add `binary` data type Vlen improve docs and test various encodings. Fix `cities.csv` encoding. `vlen` change encoding names Validate `vlen` codec `length32` encoding against `zarr-python` v2 Don't store `zarrs` metadata in cities test output Split `vlen` into `vlen` and `vlen_interleaved` Vlen supports separate index/dat encoding with full codec chains. Fix typesize in vlen `index_codecs` metadata Add support for `String` fill value metadata Add `FillValueMetadata::Unsupported` `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail. vlen cleanup Change vlen codec identifiers given they are experimental Move duplicate `extract_decoded_regions` fn into `array_bytes` + other minor changes Minor vlen_partial_decoder cleanup Add support for `zarr-python` nonconformant `|O` V2 data type Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3 Update root docs for new vlen related codecs/data types Cleanup `get_vlen_bytes_and_offsets`
LDeakin · Jul 25, 2024 · 0c114bf · 0c114bf
1 parent d54b89d
commit 0c114bf
Show file tree

Hide file tree

Showing 179 changed files with 53,040 additions and 2,198 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,14 +7,46 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+ - Add `ArrayBytes`, `RawBytes`, `RawBytesOffsets`, and `ArrayBytesError`
+    - These can represent array data with fixed and variable length data types
+ - Add `array::Element[Owned]` traits representing array elements
+    - Supports conversion to and from `ArrayBytes`
+ - Add `array::ElementFixedLength` marker trait
+ - Add experimental `vlen` and `vlen_interleaved` codec for variable length data types
+    - `vlen_interleaved` is for legacy support of Zarr V2 `vlen-utf8`/`vlen-bytes`/`vlen-array` codecs
+ - Add `DataType::{String,Binary}` data types
+    - These are likely to become standardised in the future and are not feature gated
+ - Add `ArraySubset::contains()`
+ - Add `FillValueMetadata::{String,Unsupported}`
+   - `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
+ - Implement `From<{[u8; N],&[u8; N],String,&str}>` for `FillValue`
+ - Add `ArraySize` and `DataTypeSize`
+ - Add `DataType::fixed_size()` that returns `Option<usize>`. Returns `None` for variable length data types.
+ - Add `ArrayError::IncompatibleElementType` (replaces `ArrayError::IncompatibleElementSize`)
+ - Add `ArrayError::InvalidElementValue`
+
 ### Changed
  - Use `[async_]retrieve_array_subset_opt` internally in `Array::[async_]retrieve_chunks_opt`
  - **Breaking**: Replace `[Async]ArrayPartialDecoderTraits::element_size()` with `data_type()`
+ - Array `_store` methods now use `impl Into<ArrayBytes<'a>>` instead of `&[u8]` for the input bytes
+ - **Breaking**: Array `_store_{elements,ndarray}` methods now use `T: Element` instead of `T: bytemuck::Pod`
+ - **Breaking**: Array `_retrieve_{elements,ndarray}` methods now use `T: ElementOwned` instead of `T: bytemuck::Pod`
+ - Optimised `Array::[async_]store_array_subset_opt` when the subset is a subset of a single chunk
+ - Make `transmute_to_bytes` public
+ - Relax `ndarray_into_vec` from `T: bytemuck:Pod` to `T: Clone`
+ - **Breaking**: `DataType::size()` now returns a `DataTypeSize` instead of `usize`
+ - **Breaking**: `ArrayCodecTraits::{encode/decode}` have been specialised into `ArrayTo{Array,Bytes}CodecTraits::{encode/decode}`
 
 ### Removed
  - **Breaking**: Remove `into_array_view` array and codec API
    - This was not fully utilised, not applicable to variable sized data types, and quite unsafe for a public API
- - Remove internal `ChunksPerShardError` and just use `CodecError::Other`
+ - **Breaking**: Remove internal `ChunksPerShardError` and just use `CodecError::Other`
+ - **Breaking**: Remove `array_subset::{ArrayExtractBytesError,ArrayStoreBytesError}`
+ - **Breaking**: Remove `ArraySubset::{extract,store}_bytes[_unchecked]`, they are replaced by methods in `ArrayBytes`
+ - **Breaking**: Remove `array::validate_element_size` and `ArrayError::IncompatibleElementSize`
+    - The internal validation in array `_element` methods is now more strict than just matching the element size
+    - Example: `u16` must match `uint16` data type and will not match `int16` or `float16`
 
 ### Fixed
  - Fix an unnecessary copy in `async_store_set_partial_values`

diff --git a/Cargo.toml b/Cargo.toml
@@ -43,7 +43,7 @@ async-lock = { version = "3.2.0", optional = true }
 async-recursion = { version = "1.0.5", optional = true }
 async-trait = { version = "0.1.74", optional = true }
 blosc-sys = { version = "0.3.4", package = "blosc-src", features = ["snappy", "lz4", "zlib", "zstd"], optional = true }
-bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast"] }
+bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast", "min_const_generics"] }
 bytes = "1.6.0"
 bzip2 = { version = "0.4.4", optional = true, features = ["static"] }
 crc32c = { version = "0.6.5", optional = true }
@@ -75,6 +75,10 @@ zfp-sys = {version = "0.1.15", features = ["static"], optional = true }
 zip = { version = "2.1.3", optional = true }
 zstd = { version = "0.13.1", features = ["zstdmt"], optional = true }
 
+[dependencies.num-complex]
+version = "0.4.3"
+features = ["bytemuck"]
+
 [dev-dependencies]
 chrono = "0.4"
 criterion = "0.5.1"

diff --git a/benches/codecs.rs b/benches/codecs.rs
@@ -8,9 +8,10 @@ use zarrs::array::{
     codec::{
         array_to_bytes::bytes::Endianness,
         bytes_to_bytes::blosc::{BloscCompressor, BloscShuffleMode},
-        ArrayCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits, CodecOptions,
+        ArrayCodecTraits, ArrayToBytesCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits,
+        CodecOptions,
     },
-    BytesRepresentation, ChunkRepresentation, DataType,
+    BytesRepresentation, ChunkRepresentation, DataType, Element,
 };
 
 fn codec_bytes(c: &mut Criterion) {
@@ -35,12 +36,13 @@ fn codec_bytes(c: &mut Criterion) {
         .unwrap();
 
         let data = vec![0u8; size3.try_into().unwrap()];
+        let bytes = Element::into_array_bytes(&DataType::UInt8, &data).unwrap();
         group.throughput(Throughput::Bytes(size3));
         // encode and decode have the same implementation
         group.bench_function(BenchmarkId::new("encode_decode", size3), |b| {
             b.iter(|| {
                 codec
-                    .encode(Cow::Borrowed(&data), &rep, &CodecOptions::default())
+                    .encode(bytes.clone(), &rep, &CodecOptions::default())
                     .unwrap()
             });
         });

diff --git a/doc/status/codecs.md b/doc/status/codecs.md
@@ -1,16 +1,18 @@
-| Codec Type     | Codec<sup>†</sup>         | ZEP               | V3      | V2      | Feature Flag* |
-| -------------- | ------------------------- | ----------------- | ------- | ------- | ------------- |
-| Array to Array | [transpose]               | [ZEP0001]         | &check; |         | **transpose** |
-|                | [bitround] (experimental) |                   | &check; |         | bitround      |
-| Array to Bytes | [bytes]                   | [ZEP0001]         | &check; |         |               |
-|                | [sharding_indexed]        | [ZEP0002]         | &check; |         | **sharding**  |
-|                | [zfp] (experimental)      |                   | &check; |         | zfp           |
-|                | [pcodec] (experimental)   |                   | &check; |         | pcodec        |
-| Bytes to Bytes | [blosc]                   | [ZEP0001]         | &check; | &check; | **blosc**     |
-|                | [gzip]                    | [ZEP0001]         | &check; | &check; | **gzip**      |
-|                | [crc32c]                  | [ZEP0002]         | &check; |         | **crc32c**    |
-|                | [zstd]                    | [zarr-specs #256] | &check; |         | zstd          |
-|                | [bz2] (experimental)      |                   | &check; | &check; | bz2           |
+| Codec Type     | Codec<sup>†</sup>                      | ZEP               | V3      | V2      | Feature Flag* |
+| -------------- | -------------------------------------- | ----------------- | ------- | ------- | ------------- |
+| Array to Array | [transpose]                            | [ZEP0001]         | &check; |         | **transpose** |
+|                | [bitround] (experimental)              |                   | &check; |         | bitround      |
+| Array to Bytes | [bytes]                                | [ZEP0001]         | &check; |         |               |
+|                | [sharding_indexed]                     | [ZEP0002]         | &check; |         | **sharding**  |
+|                | [zfp] (experimental)                   |                   | &check; |         | zfp           |
+|                | [pcodec] (experimental)                |                   | &check; |         | pcodec        |
+|                | [vlen] (experimental)                  |                   | &check; |         |               |
+| | V3 [vlen_interleaved] (experimental)<br>V2 vlen-utf8/vlen-bytes/vlen-array | | &check; | &check; | |
+| Bytes to Bytes | [blosc]                                | [ZEP0001]         | &check; | &check; | **blosc**     |
+|                | [gzip]                                 | [ZEP0001]         | &check; | &check; | **gzip**      |
+|                | [crc32c]                               | [ZEP0002]         | &check; |         | **crc32c**    |
+|                | [zstd]                                 | [zarr-specs #256] | &check; |         | zstd          |
+|                | [bz2] (experimental)                   |                   | &check; | &check; | bz2           |
 
 <sup>\* Bolded feature flags are part of the default set of features.</sup>
 <br>
@@ -31,12 +33,16 @@
 [crc32c]: crate::array::codec::bytes_to_bytes::crc32c
 [zstd]: crate::array::codec::bytes_to_bytes::zstd
 [bz2]: crate::array::codec::bytes_to_bytes::bz2
+[vlen]: crate::array::codec::array_to_bytes::vlen
+[vlen_interleaved]: crate::array::codec::array_to_bytes::vlen_interleaved
 
 The `"name"` of of experimental codecs in array metadata links the codec documentation in this crate.
 
-| Experimental Codec | Name / URI                                        |
-| ------------------ | ------------------------------------------------- |
-| `bitround`         | <https://codec.zarrs.dev/array_to_array/bitround> |
-| `zfp`              | <https://codec.zarrs.dev/array_to_bytes/zfp>      |
-| `pcodec`           | <https://codec.zarrs.dev/array_to_bytes/pcodec>   |
-| `bz2`              | <https://codec.zarrs.dev/bytes_to_bytes/bz2>      |
+| Experimental Codec | Name / URI                                               |
+| ------------------ | -------------------------------------------------------- |
+| `bitround`         | <https://codec.zarrs.dev/array_to_array/bitround>        |
+| `zfp`              | <https://codec.zarrs.dev/array_to_bytes/zfp>             |
+| `pcodec`           | <https://codec.zarrs.dev/array_to_bytes/pcodec>          |
+| `bz2`              | <https://codec.zarrs.dev/bytes_to_bytes/bz2>             |
+| `vlen`             | <https://codec.zarrs.dev/array_to_array/vlen>            |
+| `vlen_interleaved` | <https://codec.zarrs.dev/array_to_array/zfp_interleaved> |
diff --git a/doc/status/data_types.md b/doc/status/data_types.md
@@ -1,8 +1,12 @@
-| Data Type | ZEP | V3 | V2 | Feature Flag |
+| Data Type<sup>†</sup> | ZEP | V3 | V2 | Feature Flag |
 | --------- | --- | ----- | -- | ------------ |
 | [bool]<br>[int8] [int16] [int32] [int64] [uint8] [uint16] [uint32] [uint64]<br>[float16] [float32] [float64]<br>[complex64] [complex128] | [ZEP0001] | &check; | &check; | |
 [r* (raw bits)] | [ZEP0001] | &check; | | |
 | [bfloat16] | [zarr-specs #130] | &check; | | |
+| [string] (experimental) | [ZEP0007 (draft)] | &check; | | |
+| [binary] (experimental) | [ZEP0007 (draft)] | &check; | | |
+
+<sup>† Experimental data types are recommended for evaluation only.</sup>
 
 [bool]: crate::array::data_type::DataType::Bool
 [int8]: crate::array::data_type::DataType::Int8
@@ -20,6 +24,9 @@
 [complex128]: crate::array::data_type::DataType::Complex128
 [bfloat16]: crate::array::data_type::DataType::BFloat16
 [r* (raw bits)]: crate::array::data_type::DataType::RawBits
+[string]: crate::array::data_type::DataType::String
+[binary]: crate::array::data_type::DataType::Binary
 
 [ZEP0001]: https://zarr.dev/zeps/accepted/ZEP0001.html
 [zarr-specs #130]: https://github.com/zarr-developers/zarr-specs/issues/130
+[ZEP0007 (draft)]: https://github.com/zarr-developers/zeps/pull/47
diff --git a/examples/sharded_array_write_read.rs b/examples/sharded_array_write_read.rs
@@ -137,15 +137,15 @@ fn sharded_array_write_read() -> Result<(), Box<dyn std::error::Error>> {
         ArraySubset::new_with_start_shape(vec![0, 4], inner_chunk_shape.clone())?,
     ];
     let decoded_inner_chunks_bytes = partial_decoder.partial_decode(&inner_chunks_to_decode)?;
-    let decoded_inner_chunks_ndarray = decoded_inner_chunks_bytes
-        .into_iter()
-        .map(|bytes| bytes_to_ndarray::<u16>(&inner_chunk_shape, bytes.to_vec()))
-        .collect::<Result<Vec<_>, _>>()?;
     println!("Decoded inner chunks:");
     for (inner_chunk_subset, decoded_inner_chunk) in
-        std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_ndarray)
+        std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_bytes)
     {
-        println!("{inner_chunk_subset}\n{decoded_inner_chunk}\n");
+        let ndarray = bytes_to_ndarray::<u16>(
+            &inner_chunk_shape,
+            decoded_inner_chunk.into_fixed()?.into_owned(),
+        )?;
+        println!("{inner_chunk_subset}\n{ndarray}\n");
     }
 
     // Show the hierarchy

diff --git a/src/array.rs b/src/array.rs
@@ -22,6 +22,7 @@
 //! The documentation for [`Array`] details how to interact with arrays.
 
 mod array_builder;
+mod array_bytes;
 mod array_errors;
 mod array_metadata_options;
 mod array_representation;
@@ -33,6 +34,7 @@ pub mod codec;
 pub mod concurrency;
 pub mod data_type;
 mod dimension_name;
+mod element;
 mod endianness;
 mod fill_value;
 mod nan_representations;
@@ -47,18 +49,20 @@ use std::sync::Arc;
 
 pub use self::{
     array_builder::ArrayBuilder,
+    array_bytes::{ArrayBytes, ArrayBytesError, RawBytes, RawBytesOffsets},
     array_errors::{ArrayCreateError, ArrayError},
     array_metadata_options::ArrayMetadataOptions,
-    array_representation::{ArrayRepresentation, ChunkRepresentation},
+    array_representation::{ArrayRepresentation, ArraySize, ChunkRepresentation},
     bytes_representation::BytesRepresentation,
     chunk_grid::ChunkGrid,
     chunk_key_encoding::{ChunkKeyEncoding, ChunkKeySeparator},
     chunk_shape::{chunk_shape_to_array_shape, ChunkShape},
     codec::ArrayCodecTraits,
     codec::CodecChain,
     concurrency::RecommendedConcurrency,
-    data_type::DataType,
+    data_type::{DataType, DataTypeSize},
     dimension_name::DimensionName,
+    element::{Element, ElementFixedLength},
     endianness::{Endianness, NATIVE_ENDIAN},
     fill_value::FillValue,
     nan_representations::{ZARR_NAN_BF16, ZARR_NAN_F16, ZARR_NAN_F32, ZARR_NAN_F64},
@@ -641,9 +645,7 @@ impl<TStorage: ?Sized> Array<TStorage> {
 
 #[cfg(feature = "ndarray")]
 /// Convert an ndarray into a vec with standard layout
-fn ndarray_into_vec<T: bytemuck::Pod, D: ndarray::Dimension>(
-    array: ndarray::Array<T, D>,
-) -> Vec<T> {
+fn ndarray_into_vec<T: Clone, D: ndarray::Dimension>(array: ndarray::Array<T, D>) -> Vec<T> {
     if array.is_standard_layout() {
         array
     } else {
@@ -695,7 +697,7 @@ pub fn transmute_to_bytes_vec<T: bytemuck::NoUninit>(from: Vec<T>) -> Vec<u8> {
 
 /// Transmute from `&[T]` to `&[u8]`.
 #[must_use]
-fn transmute_to_bytes<T: bytemuck::NoUninit>(from: &[T]) -> &[u8] {
+pub fn transmute_to_bytes<T: bytemuck::NoUninit>(from: &[T]) -> &[u8] {
     bytemuck::must_cast_slice(from)
 }
 
@@ -733,17 +735,6 @@ fn iter_u64_to_usize<'a, I: Iterator<Item = &'a u64>>(iter: I) -> Vec<usize> {
         .collect::<Vec<_>>()
 }
 
-fn validate_element_size<T>(data_type: &DataType) -> Result<(), ArrayError> {
-    if data_type.size() == std::mem::size_of::<T>() {
-        Ok(())
-    } else {
-        Err(ArrayError::IncompatibleElementSize(
-            data_type.size(),
-            std::mem::size_of::<T>(),
-        ))
-    }
-}
-
 #[cfg(feature = "ndarray")]
 /// Convert a vector of elements to an [`ndarray::ArrayD`].
 ///