Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.core.Array.chunk_store changes type after zarr=2.23.3 (after PR 1304) #1362

Closed
valeriupredoi opened this issue Mar 6, 2023 · 21 comments

Comments

@valeriupredoi
Copy link

valeriupredoi commented Mar 6, 2023

Hey guys, just a heads up that the chunk_store object resulted from calling the chunk_store() method on a Zarr array changes type for objects (Zarr arrays) created from loading files with fsspec after zarr=2.13.3, ie after PR #1304 was merged:

  • with 2.13.3 my chunk_store (see below for reproducible code) is KVStore: <fsspec.mapping.FSMap object at 0x7feba08afee0> at 0x7feba08aff40>
  • with 2.13.6 same chunk_store object is <zarr.storage.FSStore object at 0x7fd3ea52bf40>

This causes issues when one wants to call properties of the KVStore object like _mutable_mapping - if this is intended please provide both a mention in the release notes (as a breaking change) and a solution how to still be able to use _mutable_mapping, if it's a bug - then you know what to do 😁 🍺

Here's my minimal code - like super minimal 😁 - and kudos to Zarr for preserving structure in such a Null case

import fsspec
import zarr

def open_zarr_group():
    url = fsspec.filesystem("reference", fo="test.json")
    mapper = url.get_mapper("")
    zarr_group = zarr.open_group(mapper)
    print(zarr_group)
    print(zarr_group.chunk_store)


if __name__ == '__main__':
    open_zarr_group()

where test.json can be an empty JSON file eg {}

Let me know what I can do to help BTW, and many thanks for all your good work! 🍻

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

Hi @valeriupredoi - thanks for reporting this potential bug! It looks like #1304 may have caused more problems than it fixed. 🤦

We'd love to be able to act on this, but I would really like to first encourage you to make your bug report more minimal. It would be great if you could remove the references to kerchunk, local file paths, etc. in your example. Here is a great guide for how to do this.

@valeriupredoi
Copy link
Author

@rabernat cheers muchly! Absolutely, will do that in a sec - for now we've pinned zarr so no rush at our end, but will try minimize the test for you guys to be able to reproduce the issue w/o extra dependencies and stuff 🍺

@valeriupredoi
Copy link
Author

@rabernat I have now edited the minimal test case - my apologies, I didn't know Zarr is so flexible so that it really doesn't need any data whatsoever (all can be mocked), with the functional types and structure still preserved - well done! 🍻

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

Hi @valeriupredoi - thanks for updating your example! However, it is still not reproducible.

When I tried to run your code, I got the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 13
      9     print(zarr_group.chunk_store)
     12 if __name__ == '__main__':
---> 13     open_zarr_group()

Cell In[1], line 5, in open_zarr_group()
      4 def open_zarr_group():
----> 5     url = fsspec.filesystem("reference", fo="test.json")
      6     mapper = url.get_mapper("")
      7     zarr_group = zarr.open_group(mapper)

...

FileNotFoundError: [Errno 2] No such file or directory: '/home/rabernat/test/test.json'

@valeriupredoi
Copy link
Author

hi @rabernat apols, I mentioned in the description but it's not clear - you'll have to create an empty test.json ie empty by JSON standards, just add {} in the file 👍

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

I tried this:

import json

def open_zarr_group():
    with open("test.json", mode="w") as fp:
        json.dump({}, fp)
    url = fsspec.filesystem("reference", fo="test.json")
    mapper = url.get_mapper("")
    zarr_group = zarr.open_group(mapper)
    print(zarr_group)
    print(zarr_group.chunk_store)

And got the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 18
     13     print(zarr_group)
     14     print(zarr_group.chunk_store)
---> 18 open_zarr_group()

Cell In[6], line 12, in open_zarr_group()
     10 url = fsspec.filesystem("reference", fo="test.json")
     11 mapper = url.get_mapper("")
---> 12 zarr_group = zarr.open_group(mapper)
     13 print(zarr_group)
     14 print(zarr_group.chunk_store)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/zarr/hierarchy.py:1465, in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options, zarr_version, meta_array)
   1462 # determine read only status
   1463 read_only = mode == 'r'
-> 1465 return Group(store, read_only=read_only, cache_attrs=cache_attrs,
   1466              synchronizer=synchronizer, path=path, chunk_store=chunk_store,
   1467              zarr_version=zarr_version, meta_array=meta_array)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/zarr/hierarchy.py:164, in Group.__init__(self, store, path, read_only, chunk_store, cache_attrs, synchronizer, zarr_version, meta_array)
    162     mkey = _prefix_to_group_key(self._store, self._key_prefix)
    163     assert not mkey.endswith("root/.group")
--> 164     meta_bytes = store[mkey]
    165 except KeyError:
    166     if self._version == 2:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/zarr/storage.py:1393, in FSStore.__getitem__(self, key)
   1391 key = self._normalize_key(key)
   1392 try:
-> 1393     return self.map[key]
   1394 except self.exceptions as e:
   1395     raise KeyError(key) from e

File /srv/conda/envs/notebook/lib/python3.10/site-packages/fsspec/mapping.py:143, in FSMap.__getitem__(self, key, default)
    141 k = self._key_to_str(key)
    142 try:
--> 143     result = self.fs.cat(k)
    144 except self.missing_exceptions:
    145     if default is not None:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/fsspec/implementations/reference.py:313, in ReferenceFileSystem.cat(self, path, recursive, on_error, **kwargs)
    311 for proto, paths in proto_dict.items():
    312     fs = self.fss[proto]
--> 313     urls, starts, ends = zip(*[self._cat_common(p) for p in paths])
    314     urls2 = []
    315     starts2 = []

File /srv/conda/envs/notebook/lib/python3.10/site-packages/fsspec/implementations/reference.py:313, in <listcomp>(.0)
    311 for proto, paths in proto_dict.items():
    312     fs = self.fss[proto]
--> 313     urls, starts, ends = zip(*[self._cat_common(p) for p in paths])
    314     urls2 = []
    315     starts2 = []

File /srv/conda/envs/notebook/lib/python3.10/site-packages/fsspec/implementations/reference.py:230, in ReferenceFileSystem._cat_common(self, path, start, end)
    228     start1, end1 = start, end
    229 else:
--> 230     url, start0, size = part
    231     logger.debug(f"Reference: {path} => {url}, offset {start0}, size {size}")
    232     end0 = start0 + size

ValueError: too many values to unpack (expected 3)

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

Does this example really need the reference filesystem? Here is one without it

import fsspec
import zarr

fs = fsspec.filesystem("file")
mapper = fs.get_mapper("tmp.zarr")
zarr_group = zarr.open_group(mapper)
print(zarr_group)
# <zarr.hierarchy.Group '/'>
print(zarr_group.chunk_store)
# <zarr.storage.FSStore object at 0x7fe3b15dbfd0>

Does this reproduce the problem?

@valeriupredoi
Copy link
Author

Does this example really need the reference filesystem? Here is one without it

import fsspec
import zarr

fs = fsspec.filesystem("file")
mapper = fs.get_mapper("tmp.zarr")
zarr_group = zarr.open_group(mapper)
print(zarr_group)
# <zarr.hierarchy.Group '/'>
print(zarr_group.chunk_store)
# <zarr.storage.FSStore object at 0x7fe3b15dbfd0>

Does this reproduce the problem?

Indeed that should work! I tried the code in your previous comment (whereby the json file is constructed with the context manager and that worked for me...), but that's true, I don't think one needs the ref FS

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

Ok, glad we found a truly minimum reproducer! 😊

Now, can you help me understand what exactly is the problem? The FSMap has correctly been promoted to an FSStore. This object will give much better performance on concurrent read / write operations, as discussed in #1296.

Can you explain why the FSStore object is problematic for you?

@valeriupredoi
Copy link
Author

valeriupredoi commented Mar 6, 2023

Ok, glad we found a truly minimum reproducer! blush

and a really nice proof Zarr supports Null testing nicely 😁

Can you explain why the FSStore object is problematic for you?

We are treating and using chunk_store as a KVStore so we are using one of its methods _mutable_mapping, which is no longer an attribute of it anymore, see eg test fail: AttributeError: 'FSStore' object has no attribute '_mutable_mapping' - I didn't know this change was as per intended, and it'd be nice if this was communicated in the release notes/changelog together with a suggestion how to transfer to new method/API

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

we are using one of its methods _mutable_mapping

Ok I understand. Thanks for re-explaining!

In general, FSStore and FSMap implement a very similar public interface. They are both MutableMappings, so you can treat them as dictionaries and use getitem / setitem syntax on them. FSStore has a few public methods that FSMap does not, like getitems and setitems.

we are using one of its methods _mutable_mapping,

._mutable_mapping is a private attribute. As developers, we do not make any guarantees about the stability of private attributes. They may change or disappear at any time. So I would suggest that you do not rely on them in your user code.

Can you explain why you were using the _mutable_mapping attribute? Maybe we can find a workaround.

@valeriupredoi
Copy link
Author

valeriupredoi commented Mar 6, 2023

@rabernat that's a very good question - as I understand it, we use it to get the mapping of the input netCDF4 files we input onto the Zarr format, so a full image of the data and metadata as Zarr would see it:

fsref = self.zds.chunk_store._mutable_mapping.fs.references

where self.zds is a Zarr array -

ie the FS reference for the file. My colleague @bnlawrence can tell you more since he's implemented this method, but that should be the gist of it. If there is any way to grab this dictionary from FSStore, then we'd gladly use that, especially if that was a public function 👍

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

You should be able to access the underlying fsspec filesystem object via the .fs attribute of the FSStore object. So zarr_group.chunk_store.fs.references should work.

Looking at the source code is probably helpful:

class FSStore(Store):

👋 @bnlawrence! Lovely to see you around these parts. I'd love to learn what you are up to and how we might collaborate around Zarr / kerchunk / etc.

@valeriupredoi
Copy link
Author

cheers @rabernat - I'll have a look, many thanks for your help so far 🍺

@valeriupredoi
Copy link
Author

whoa! Proves out in the new configuration one doesn't need to access the _mutable_mapping method after all, this does it perfectly:

fsref = self.zds.chunk_store.fs.references

Cheers, @rabernat 🍺 I'll keep this open if you'd still need it for bookkeeping and/or @bnlawrence's reply 😁

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2023

We can close the issue but feel free to keep the conversation going!

@bnlawrence
Copy link

bnlawrence commented Mar 17, 2023

Ah sorry @rabernat, @valeriupredoi has just pointed this thread out to me - I have to admit to mostly silencing my github notifications as I can't cope with the information onslaught.

For what it's worth, this is part of building a tool to allow the file system (or the S3 server) to do basic reductions on chunks rather than do full chunk loads (to avoid data movement). We have four pieces in flight at the moment: a Python client application (cf-python of course), some Python middleware (that uses this piece of the zarr ecosystem), a Posix server, and an S3 proxy server ... but until we put them all together (due in Q2/Q3 this year), there's not much to show or talk about.

We absolutely intend the middleware to be reusable by any client application/library (including xarray), but until we have it working enough (i.e. with the storage clients as well) to sell the idea, we're not talking much :-)

@rabernat
Copy link
Contributor

Very interesting. I would check out https://zarr.dev/zeps/draft/ZEP0005.html, by @hailiangzhang and colleagues at NASA Goddard. It aims to accomplish a very similar thing via a Zarr spec extension. Discussion at zarr-developers/zarr-specs#205

@hailiangzhang
Copy link
Contributor

Thanks for referring to my proposal @rabernat !
@bnlawrence , this proposal introduces a small and adjustable extra storage, but should make the large-volume data reduction/aggregation service much faster and cheaper since we only need to extract the chunks at the boundary. If this appears to be related to what you are trying to achieve, feel free to join the Zarr community call next Wed when I plan to have a walkthrough on this. Thanks!

@bnlawrence
Copy link

Our approach is to push the computation of partial sums etc into the storage itself (which has plenty of compute for such tasks, given that it can do erasure coding etc). What is probably in common with our two approaches is how we would want to tell Dask when it can avoid having to do these calculations itself. @valeriupredoi Perhaps you or @davidhassell could see if you can join that call? If not, maybe we can set something specific up for a conversation - it is a bit niche :-)

@hailiangzhang
Copy link
Contributor

Our approach is to push the computation of partial sums etc into the storage itself (which has plenty of compute for such tasks, given that it can do erasure coding etc). What is probably in common with our two approaches is how we would want to tell Dask when it can avoid having to do these calculations itself. @valeriupredoi Perhaps you or @davidhassell could see if you can join that call? If not, maybe we can set something specific up for a conversation - it is a bit niche :-)

Thanks for your info @bnlawrence ! Yes, there may be something in common between our approaches (at least it looks like we are trying to achieve similar goal:) Do you happen to have any links/resources describing more about your approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants