Skip to content

Commit

Permalink
Add support for variable length data types
Browse files Browse the repository at this point in the history
Data type sizes are now represented by `DataTypeSize` instead of usize

Also adds `ArraySize`.
Both are `Fixed` only for now. Need to support `Variable` throughout the codebase.

Change codec API in prep for variable sized data types

Enable `{Array,DataType}Size::Variable`

Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets`

Use `CowArrayBytes::validate()`

impl `From` for `CowArrayBytes` for various types

Array `_element` methods now use `T: Element`

Add `vlen` codec metadata

Fix codecs bench

Implement an experimental vlen codec

Use `impl Into<ArrayBytesCow<'a>>` in array methods

Use `RawBytesCow` consistently

Remove various vlen todo's

Cleanup `ArrayBytes`

Use `ArrayError::InvalidElementValue` for invalid string encodings

Add `ArraySubset::contains()`

Add `FillValue::new_empty()`

Add remaining vlen support to array `store_` methods and improve vlen validation

Add remaining vlen support to array `retrieve_` methods

Partial decoding in the vlen filter

Fix async vlen errors

Sharding codec vlen support

Add vlen support to sharding partial decoder

vlen support for sharded_readable_ext

`offsets_u64_to_usize` handle 32-bit system

Minor FillValue doc update

Remove unused ArraySubset methods and add related convenience functions

Add cities test

Add `Arrow32` vlen encoding

Add support for Interleave32 (Zarr V2) vlen encoding

fmt

clippy

Set minimum version for num-complex

Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77

Add `binary` data type

Vlen improve docs and test various encodings.

Fix `cities.csv` encoding.

`vlen` change encoding names

Validate `vlen` codec `length32` encoding against `zarr-python` v2

Don't store `zarrs` metadata in cities test output

Split `vlen` into `vlen` and `vlen_interleaved`

Vlen supports separate index/dat encoding with full codec chains.

Fix typesize in vlen `index_codecs` metadata

Add support for `String` fill value metadata

Add `FillValueMetadata::Unsupported`

`ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.

vlen cleanup

Change vlen codec identifiers given they are experimental

Move duplicate `extract_decoded_regions` fn into `array_bytes`

+ other minor changes

Minor vlen_partial_decoder cleanup

Add support for `zarr-python` nonconformant `|O` V2 data type

Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3

Update root docs for new vlen related codecs/data types

Cleanup `get_vlen_bytes_and_offsets`
  • Loading branch information
LDeakin committed Jul 25, 2024
1 parent d54b89d commit 0c114bf
Show file tree
Hide file tree
Showing 179 changed files with 53,040 additions and 2,198 deletions.
34 changes: 33 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,46 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- Add `ArrayBytes`, `RawBytes`, `RawBytesOffsets`, and `ArrayBytesError`
- These can represent array data with fixed and variable length data types
- Add `array::Element[Owned]` traits representing array elements
- Supports conversion to and from `ArrayBytes`
- Add `array::ElementFixedLength` marker trait
- Add experimental `vlen` and `vlen_interleaved` codec for variable length data types
- `vlen_interleaved` is for legacy support of Zarr V2 `vlen-utf8`/`vlen-bytes`/`vlen-array` codecs
- Add `DataType::{String,Binary}` data types
- These are likely to become standardised in the future and are not feature gated
- Add `ArraySubset::contains()`
- Add `FillValueMetadata::{String,Unsupported}`
- `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
- Implement `From<{[u8; N],&[u8; N],String,&str}>` for `FillValue`
- Add `ArraySize` and `DataTypeSize`
- Add `DataType::fixed_size()` that returns `Option<usize>`. Returns `None` for variable length data types.
- Add `ArrayError::IncompatibleElementType` (replaces `ArrayError::IncompatibleElementSize`)
- Add `ArrayError::InvalidElementValue`

### Changed
- Use `[async_]retrieve_array_subset_opt` internally in `Array::[async_]retrieve_chunks_opt`
- **Breaking**: Replace `[Async]ArrayPartialDecoderTraits::element_size()` with `data_type()`
- Array `_store` methods now use `impl Into<ArrayBytes<'a>>` instead of `&[u8]` for the input bytes
- **Breaking**: Array `_store_{elements,ndarray}` methods now use `T: Element` instead of `T: bytemuck::Pod`
- **Breaking**: Array `_retrieve_{elements,ndarray}` methods now use `T: ElementOwned` instead of `T: bytemuck::Pod`
- Optimised `Array::[async_]store_array_subset_opt` when the subset is a subset of a single chunk
- Make `transmute_to_bytes` public
- Relax `ndarray_into_vec` from `T: bytemuck:Pod` to `T: Clone`
- **Breaking**: `DataType::size()` now returns a `DataTypeSize` instead of `usize`
- **Breaking**: `ArrayCodecTraits::{encode/decode}` have been specialised into `ArrayTo{Array,Bytes}CodecTraits::{encode/decode}`

### Removed
- **Breaking**: Remove `into_array_view` array and codec API
- This was not fully utilised, not applicable to variable sized data types, and quite unsafe for a public API
- Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove `array_subset::{ArrayExtractBytesError,ArrayStoreBytesError}`
- **Breaking**: Remove `ArraySubset::{extract,store}_bytes[_unchecked]`, they are replaced by methods in `ArrayBytes`
- **Breaking**: Remove `array::validate_element_size` and `ArrayError::IncompatibleElementSize`
- The internal validation in array `_element` methods is now more strict than just matching the element size
- Example: `u16` must match `uint16` data type and will not match `int16` or `float16`

### Fixed
- Fix an unnecessary copy in `async_store_set_partial_values`
Expand Down
6 changes: 5 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ async-lock = { version = "3.2.0", optional = true }
async-recursion = { version = "1.0.5", optional = true }
async-trait = { version = "0.1.74", optional = true }
blosc-sys = { version = "0.3.4", package = "blosc-src", features = ["snappy", "lz4", "zlib", "zstd"], optional = true }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast"] }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast", "min_const_generics"] }
bytes = "1.6.0"
bzip2 = { version = "0.4.4", optional = true, features = ["static"] }
crc32c = { version = "0.6.5", optional = true }
Expand Down Expand Up @@ -75,6 +75,10 @@ zfp-sys = {version = "0.1.15", features = ["static"], optional = true }
zip = { version = "2.1.3", optional = true }
zstd = { version = "0.13.1", features = ["zstdmt"], optional = true }

[dependencies.num-complex]
version = "0.4.3"
features = ["bytemuck"]

[dev-dependencies]
chrono = "0.4"
criterion = "0.5.1"
Expand Down
8 changes: 5 additions & 3 deletions benches/codecs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ use zarrs::array::{
codec::{
array_to_bytes::bytes::Endianness,
bytes_to_bytes::blosc::{BloscCompressor, BloscShuffleMode},
ArrayCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits, CodecOptions,
ArrayCodecTraits, ArrayToBytesCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits,
CodecOptions,
},
BytesRepresentation, ChunkRepresentation, DataType,
BytesRepresentation, ChunkRepresentation, DataType, Element,
};

fn codec_bytes(c: &mut Criterion) {
Expand All @@ -35,12 +36,13 @@ fn codec_bytes(c: &mut Criterion) {
.unwrap();

let data = vec![0u8; size3.try_into().unwrap()];
let bytes = Element::into_array_bytes(&DataType::UInt8, &data).unwrap();
group.throughput(Throughput::Bytes(size3));
// encode and decode have the same implementation
group.bench_function(BenchmarkId::new("encode_decode", size3), |b| {
b.iter(|| {
codec
.encode(Cow::Borrowed(&data), &rep, &CodecOptions::default())
.encode(bytes.clone(), &rep, &CodecOptions::default())
.unwrap()
});
});
Expand Down
44 changes: 25 additions & 19 deletions doc/status/codecs.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | ------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | -------------------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| | [vlen] (experimental) | | &check; | | |
| | V3 [vlen_interleaved] (experimental)<br>V2 vlen-utf8/vlen-bytes/vlen-array | | &check; | &check; | |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |

<sup>\* Bolded feature flags are part of the default set of features.</sup>
<br>
Expand All @@ -31,12 +33,16 @@
[crc32c]: crate::array::codec::bytes_to_bytes::crc32c
[zstd]: crate::array::codec::bytes_to_bytes::zstd
[bz2]: crate::array::codec::bytes_to_bytes::bz2
[vlen]: crate::array::codec::array_to_bytes::vlen
[vlen_interleaved]: crate::array::codec::array_to_bytes::vlen_interleaved

The `"name"` of of experimental codecs in array metadata links the codec documentation in this crate.

| Experimental Codec | Name / URI |
| ------------------ | ------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| Experimental Codec | Name / URI |
| ------------------ | -------------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| `vlen` | <https://codec.zarrs.dev/array_to_array/vlen> |
| `vlen_interleaved` | <https://codec.zarrs.dev/array_to_array/zfp_interleaved> |
9 changes: 8 additions & 1 deletion doc/status/data_types.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
| Data Type | ZEP | V3 | V2 | Feature Flag |
| Data Type<sup>†</sup> | ZEP | V3 | V2 | Feature Flag |
| --------- | --- | ----- | -- | ------------ |
| [bool]<br>[int8] [int16] [int32] [int64] [uint8] [uint16] [uint32] [uint64]<br>[float16] [float32] [float64]<br>[complex64] [complex128] | [ZEP0001] | &check; | &check; | |
[r* (raw bits)] | [ZEP0001] | &check; | | |
| [bfloat16] | [zarr-specs #130] | &check; | | |
| [string] (experimental) | [ZEP0007 (draft)] | &check; | | |
| [binary] (experimental) | [ZEP0007 (draft)] | &check; | | |

<sup>† Experimental data types are recommended for evaluation only.</sup>

[bool]: crate::array::data_type::DataType::Bool
[int8]: crate::array::data_type::DataType::Int8
Expand All @@ -20,6 +24,9 @@
[complex128]: crate::array::data_type::DataType::Complex128
[bfloat16]: crate::array::data_type::DataType::BFloat16
[r* (raw bits)]: crate::array::data_type::DataType::RawBits
[string]: crate::array::data_type::DataType::String
[binary]: crate::array::data_type::DataType::Binary

[ZEP0001]: https://zarr.dev/zeps/accepted/ZEP0001.html
[zarr-specs #130]: https://github.com/zarr-developers/zarr-specs/issues/130
[ZEP0007 (draft)]: https://github.com/zarr-developers/zeps/pull/47
12 changes: 6 additions & 6 deletions examples/sharded_array_write_read.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,15 @@ fn sharded_array_write_read() -> Result<(), Box<dyn std::error::Error>> {
ArraySubset::new_with_start_shape(vec![0, 4], inner_chunk_shape.clone())?,
];
let decoded_inner_chunks_bytes = partial_decoder.partial_decode(&inner_chunks_to_decode)?;
let decoded_inner_chunks_ndarray = decoded_inner_chunks_bytes
.into_iter()
.map(|bytes| bytes_to_ndarray::<u16>(&inner_chunk_shape, bytes.to_vec()))
.collect::<Result<Vec<_>, _>>()?;
println!("Decoded inner chunks:");
for (inner_chunk_subset, decoded_inner_chunk) in
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_ndarray)
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_bytes)
{
println!("{inner_chunk_subset}\n{decoded_inner_chunk}\n");
let ndarray = bytes_to_ndarray::<u16>(
&inner_chunk_shape,
decoded_inner_chunk.into_fixed()?.into_owned(),
)?;
println!("{inner_chunk_subset}\n{ndarray}\n");
}

// Show the hierarchy
Expand Down
25 changes: 8 additions & 17 deletions src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
//! The documentation for [`Array`] details how to interact with arrays.

mod array_builder;
mod array_bytes;
mod array_errors;
mod array_metadata_options;
mod array_representation;
Expand All @@ -33,6 +34,7 @@ pub mod codec;
pub mod concurrency;
pub mod data_type;
mod dimension_name;
mod element;
mod endianness;
mod fill_value;
mod nan_representations;
Expand All @@ -47,18 +49,20 @@ use std::sync::Arc;

pub use self::{
array_builder::ArrayBuilder,
array_bytes::{ArrayBytes, ArrayBytesError, RawBytes, RawBytesOffsets},
array_errors::{ArrayCreateError, ArrayError},
array_metadata_options::ArrayMetadataOptions,
array_representation::{ArrayRepresentation, ChunkRepresentation},
array_representation::{ArrayRepresentation, ArraySize, ChunkRepresentation},
bytes_representation::BytesRepresentation,
chunk_grid::ChunkGrid,
chunk_key_encoding::{ChunkKeyEncoding, ChunkKeySeparator},
chunk_shape::{chunk_shape_to_array_shape, ChunkShape},
codec::ArrayCodecTraits,
codec::CodecChain,
concurrency::RecommendedConcurrency,
data_type::DataType,
data_type::{DataType, DataTypeSize},
dimension_name::DimensionName,
element::{Element, ElementFixedLength},
endianness::{Endianness, NATIVE_ENDIAN},
fill_value::FillValue,
nan_representations::{ZARR_NAN_BF16, ZARR_NAN_F16, ZARR_NAN_F32, ZARR_NAN_F64},
Expand Down Expand Up @@ -641,9 +645,7 @@ impl<TStorage: ?Sized> Array<TStorage> {

#[cfg(feature = "ndarray")]
/// Convert an ndarray into a vec with standard layout
fn ndarray_into_vec<T: bytemuck::Pod, D: ndarray::Dimension>(
array: ndarray::Array<T, D>,
) -> Vec<T> {
fn ndarray_into_vec<T: Clone, D: ndarray::Dimension>(array: ndarray::Array<T, D>) -> Vec<T> {

Check warning on line 648 in src/array.rs

View check run for this annotation

Codecov / codecov/patch

src/array.rs#L648

Added line #L648 was not covered by tests
if array.is_standard_layout() {
array
} else {
Expand Down Expand Up @@ -695,7 +697,7 @@ pub fn transmute_to_bytes_vec<T: bytemuck::NoUninit>(from: Vec<T>) -> Vec<u8> {

/// Transmute from `&[T]` to `&[u8]`.
#[must_use]
fn transmute_to_bytes<T: bytemuck::NoUninit>(from: &[T]) -> &[u8] {
pub fn transmute_to_bytes<T: bytemuck::NoUninit>(from: &[T]) -> &[u8] {
bytemuck::must_cast_slice(from)
}

Expand Down Expand Up @@ -733,17 +735,6 @@ fn iter_u64_to_usize<'a, I: Iterator<Item = &'a u64>>(iter: I) -> Vec<usize> {
.collect::<Vec<_>>()
}

fn validate_element_size<T>(data_type: &DataType) -> Result<(), ArrayError> {
if data_type.size() == std::mem::size_of::<T>() {
Ok(())
} else {
Err(ArrayError::IncompatibleElementSize(
data_type.size(),
std::mem::size_of::<T>(),
))
}
}

#[cfg(feature = "ndarray")]
/// Convert a vector of elements to an [`ndarray::ArrayD`].
///
Expand Down
Loading

0 comments on commit 0c114bf

Please sign in to comment.