Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests against Zarr 3 #254

Open
tomwhite opened this issue Jun 17, 2024 · 13 comments
Open

Run tests against Zarr 3 #254

tomwhite opened this issue Jun 17, 2024 · 13 comments

Comments

@tomwhite
Copy link
Contributor

There's an alpha release available now that can be installed using pip install --pre

https://pypi.org/project/zarr/3.0.0a0/

@jeromekelleher
Copy link
Contributor

I had a quick look at this and it seems like quite a bit of stuff is broken. I hit a wall at finding a way to specify a name for a new array in a Group. This seems pretty basic and I don't have time to chase down the details, so I'll try again when the next alpha comes out.

@tomwhite
Copy link
Contributor Author

I'll try again when the next alpha comes out

There is no next alpha according to zarr-developers/zarr-python#1777.

@jeromekelleher
Copy link
Contributor

I just had another go with the tip of the v3 branch, and still hitting walls with array creation. The support for creating v2 arrays seems to be pretty thin, and it's not clear at all to me how we're supposed to go about it. I don't really know where to start tbh.

@jeromekelleher
Copy link
Contributor

For reference:

python3 -m pip install pip install git+https://zarr-developers/zarr-python

@tomwhite
Copy link
Contributor Author

tomwhite commented Jul 3, 2024

I'll take a look

@tomwhite
Copy link
Contributor Author

tomwhite commented Jul 3, 2024

Here's the branch where I've been experimenting with Zarr v3: https://github.com/tomwhite/bio2zarr/tree/zarr-v3

After making changes to adapt to the different v3 API, it's now failing because Zarr v3 doesn't support string dtypes:

_____________________________________________________________ TestVcfZarrWriterExample.test_encode_partition[0] _____________________________________________________________

self = <tests.test_vcz.TestVcfZarrWriterExample object at 0x7f9dc9a75b40>
icf_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/data11/example.exploded')
tmp_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/test_encode_partition_0_0'), partition = 0

    @pytest.mark.parametrize("partition", [0, 1, 2])
    def test_encode_partition(self, icf_path, tmp_path, partition):
        zarr_path = tmp_path / "x.zarr"
        vcf2zarr.encode_init(icf_path, zarr_path, 3, variants_chunk_size=3)
        partition_path = zarr_path / "wip" / "partitions" / f"p{partition}"
        assert not partition_path.exists()
>       vcf2zarr.encode_partition(zarr_path, partition)

tests/test_vcz.py:508: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
bio2zarr/vcf2zarr/vcz.py:1064: in encode_partition
    writer.encode_partition(partition)
bio2zarr/vcf2zarr/vcz.py:683: in encode_partition
    self.encode_id_partition(partition_index)
bio2zarr/vcf2zarr/vcz.py:801: in encode_id_partition
    vid.flush()
bio2zarr/core.py:145: in flush
    sync_flush_1d_array(
bio2zarr/core.py:163: in sync_flush_1d_array
    zarr_array[offset : offset + np_buffer.shape[0]] = np_buffer
../zarr-python/src/zarr/array.py:984: in __setitem__
    self.set_basic_selection(cast(BasicSelection, pure_selection), value, fields=fields)
../zarr-python/src/zarr/array.py:1198: in set_basic_selection
    sync(self._async_array._set_selection(indexer, value, fields=fields, prototype=prototype))
../zarr-python/src/zarr/sync.py:92: in sync
    raise return_result
../zarr-python/src/zarr/sync.py:51: in _runner
    return await coro
../zarr-python/src/zarr/array.py:518: in _set_selection
    value_buffer = prototype.nd_buffer.from_ndarray_like(value)
../zarr-python/src/zarr/buffer.py:339: in from_ndarray_like
    return cls(ndarray_like)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NDBuffer' object has no attribute '_data'") raised in repr()] NDBuffer object at 0x7f9ddd467250>
array = array(['.', '.', 'rs6054257'], dtype=object)

    def __init__(self, array: NDArrayLike):
        # assert array.ndim > 0
>       assert array.dtype != object
E       AssertionError

../zarr-python/src/zarr/buffer.py:286: AssertionError

This is a major limitation for us. The Zarr v3 core spec does not cover strings.

They are likely to be a future extension:

The set of data types specified in v3 is less than in v2. Additional data types will be defined via extensions.

@jeromekelleher
Copy link
Contributor

Thanks Tom!

Yeesh, the lack of strings is scary. Looks like we'll be on v2 for a long time then.

@d-v-b
Copy link

d-v-b commented Jul 9, 2024

fwiw I would love to chart a path to getting more dtypes into zarr v3 (since i happen to be working on the v3 fill value normalization right now). as you noted, there's a dtype extension mechanism built into the spec but we haven't exercised it yet. Could you share or link to a description of how you are using string arrays, either here or in a discussion over in zarr specs? That might help kick-start things.

@tomwhite
Copy link
Contributor Author

tomwhite commented Jul 9, 2024

Hi @d-v-b, thanks for commenting! We'd actually like to encode variable-length strings, which use an object dtype and a numcodecs.vlen.VLenUTF8 codec.

I've added a comment explaining our use case to the discussion at zarr-developers/zarr-specs#83

@tomwhite
Copy link
Contributor Author

I filed the following issues to improve Zarr Python v3 API compatibility here:

@tomwhite
Copy link
Contributor Author

As an experiment I tried creating a VLenUTF8Codec which uses numcodecs.vlen.VLenUTF8. I can successfully run a test that writes a VCF Zarr file and then validates it:

pytest 'tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]'

@jeromekelleher
Copy link
Contributor

Amazing! 🎉

@tomwhite
Copy link
Contributor Author

I updated https://github.com/tomwhite/bio2zarr/tree/zarr-v3 to use the code from zarr-developers/zarr-python#2036 and the test still passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants