Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about the dimension_separator keyword #769

Closed
constantinpape opened this issue Jun 5, 2021 · 39 comments · Fixed by #773
Closed

Confusion about the dimension_separator keyword #769

constantinpape opened this issue Jun 5, 2021 · 39 comments · Fixed by #773

Comments

@constantinpape
Copy link

I don't really understand how to use the new dimension_separator keyword, in particular:

  1. Creating a DirectoryStore(dimension_separator="/") does not have the effect I would expect (see code and problem description below).
  2. Why does zarr still have the NestedDirectoryStore? Shouldn't it be the same as DirectoryStore(dimension_separator="/")? Hence I would assume that NestedDirectoryStore could either be removed or (if to be kept for legacy purposes) should just map to DirectoryStore(dimension_seperator="/").

Minimal, reproducible code sample, a copy-pastable example if possible

import zarr
store = zarr.DirectoryStore("test.zarr", dimension_separator="/")
g = zarr.open(store, mode="a")
ds = g.create_dataset("test", shape=(10, 10, 10))
ds[:] = 1

Problem description

Now, I would assume that the chunks are nested, but I get:

$ ls test.zarr/test
0.0.0

but to, to my confusion, also this:

$ cat test.zarr/test/.zarray
...
"dimension_separator": "/",
...

If I use NestedDirectoryStore instead, the chunks are nested as expected.

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.8.3
  • Value of numcodecs.__version__: 0.7.3
  • Version of Python interpreter: 3.8.6
  • Operating system: Linux
  • How Zarr was installed: using conda
@joshmoore
Copy link
Member

2. Why does zarr still have the NestedDirectoryStore?

To prevent a breaking change.

I would assume that the chunks are nested, but I get:

$ ls test.zarr/test
0.0.0

Agreed. This surprises me as well.

@joshmoore
Copy link
Member

So, I think DirectoryStore(dimension_separator="/") is probably just untested. Likely since NestedDirectoryStore exists. So if we do want to deprecate NDS, then DS will need to make use of this helper:

def _nested_map_ckey(key):
    segments = list(key.split('/'))
    if segments:
        last_segment = segments[-1]
        if _prog_ckey.match(last_segment):
            last_segment = last_segment.replace('.', '/')
            segments = segments[:-1] + [last_segment]
            key = '/'.join(segments)
    return key

An alternative would be to disallow "/" as a separator with the error message, "use NestedDirectoryStore".

@constantinpape
Copy link
Author

Thanks for looking into this, @joshmoore. In my opinion, the issue with DirectoryStore(dimension_separator="/") should be fixed rather than disallowing it.
I would also vote for deprecating NestedDirectoryStore or simplifying it and making sure that it uses the same implementation as DirectoryStore, e.g. something like

class NestedDirectoryStore(DirectoryStore):
  def __init__(...):
     super().__init__(..., dimension_separator="/")

@constantinpape
Copy link
Author

To add to this issue, the same problem exists when reading the data.
Given a zarr array at test.zarr/test with dimension_separator=/, the following does not work correctly:

import zarr
f = zarr.open('test.zarr')
data = f['test'][:]

The data returned will be all zeros (or more generally all fill_value).
Conversely, this works as expected:

import zarr
store = zarr.storage.NestedDirectoryStore('test.zarr')
f = zarr.open(store)
data = f['test'][:]

@joshmoore
Copy link
Member

Yeah, that definitely counts as a bug. I almost wonder if it wouldn't be easiest to use FSStore to re-implement (Nested)DirectoryStore.

cc: @martindurant

@martindurant
Copy link
Member

Maybe, but also it can be tricky to have an implementation that does too many things, with a load of options. Is that better or worse than a set of specific implementations?

When I initially trialled FSStore, it did have more options, that I removed for want of testing. fsspec is a light dependency, but I imagine people would still be surprised to need it for local files (unless zarr were to have a hard requirement).

@joshmoore
Copy link
Member

(Working on a failing test & a fix now)

@joshmoore
Copy link
Member

joshmoore commented Jun 14, 2021

So the failing test is pretty straight-forward, but I'm beginning to fear that this use case shows that the Array and not the Store should be in charge of this information. (Not to mention what happens when a single store is asked to open multiple arrays with different separators)

Edit: pondering out loud, an alternative would be to have a Key type which could carry this metadata, but other Store implementations wouldn't know what to do with this by default.

@martindurant
Copy link
Member

Agreed, we need to sort out to what extent zarr metadata should describe not just how to turn bytes into arrays, but details of how it is laid out in the storage backend. I am ambivalent - maybe the user just "needs to know", so that the metadata of a copy on a different backed is the same.

Actually, an over-layer describing how to open a dataset works and is arguably the right place to separate out this kind of option. I would propose that Intake does this well.

@joshmoore
Copy link
Member

Just thinking what that would mean for the likes of dask/dask#7673, @martindurant, would you pass an Intake URL then to from_zarr?

@martindurant
Copy link
Member

Oh no! I mean, I'd love it, but Dask doesn't want that...

I mean that the intake prescription could include url="file:///local/path", dimension_separator="/" to ensure picking the right thing. To the user, it's not obvious that including "file://" picks FSStore, and that this store can make use of the extra parameter, which is why you'd want to hide this kind of archena in a catalogue.

The most you'd want to do directly on the dask side is to pass on arguments to zarr as faithfully as you can, and provide good documentation around it. If the user needs to instantiate their own store instances, the'll have to do some digging.

@joshmoore
Copy link
Member

Ok. It's seeming then like "encoded at a higher-level" ends up being a reversal of all the work that has gone in since #715. If there are others that are similarly worried about the addition of dimension_separator and are pondering a rollback, then it'd be good to know. cc: @zarr-developers/core-devs

Otherwise, I'm going to push forward with trying to make it behave sanely as currently implemented with the goal of having it behave the same across language implementations and without the user needing to know that an array was saved with nested vs. flat.

@martindurant
Copy link
Member

I am not particularly advocating the "higher level", sorry if it sounded like it; we can surely fix the code in dask and anywhere else to "detect" the situation too. The Intake thing is like a workaround for cases where we don't want to change the implementation. As extra arguments for storage go, this one is pretty low-level, almost at the C/fortran layout layer. The argument was, that since it affects the names of the contained files, you couldn't simply copy the whole dataset anyway.

Other storage-oriented arguments are more divorced from that layer or specific to the storage, such as "consolidated" (which we support as an optional argument, but is not in the spec; it does survive copies) and "requester-pays" for an S3 dataset I happen to be looking at today.

@joshmoore
Copy link
Member

"requester-pays" for an S3 dataset I happen to be looking at today.

Interesting! Please ping me on related issues when the time comes.

I am not particularly advocating the "higher level", sorry if it sounded like it

Understood. Thanks for the clarification.

@joshmoore
Copy link
Member

joshmoore commented Jun 14, 2021

@d-v-b, if you still have any of the state of this loaded into your working memory after #709, it'd be interesting to here your opinion, since I vaguely remember you arguing that it was in fact a Store concern rather than a core.py concern.

@joshmoore joshmoore mentioned this issue Jun 14, 2021
6 tasks
@d-v-b
Copy link
Contributor

d-v-b commented Jun 15, 2021

@joshmoore I specifically objected to the dimension_separator keyword changing the string format of the internal chunk keys used in core.py. My view is that the chunk keys in the array API should always take the format foo/0.0.0 regardless of the storage backend. My assumption was that the storage layer is responsible for translating a chunk key like foo/0.0.0 into some pattern of file system access (e.g., chunk_data = store.storage_api.get(store.translate_chunk_key_into_name_for_storage_api(chunk_key))

Insofar as dimension_separator concerns stuff happening on a file system, it seems very natural for it to be a property of the storage layer.

In terms of ergonomics, one pain point here is that if the storage layer is looking for chunk files with dimension_separator = "/" but the chunk files happen to be stored with the . separator, the user does not get any warnings / errors; the requests for chunks from the storage backend will all come back as key error or 404 or the equivalent, and these will be treated as fill_value. This can be annoying / surprising. The user should probably get some loud feedback when attempting to access an array with a store that is invalid given the state of the file system.

@joshmoore
Copy link
Member

Thanks, @d-v-b. I was worried that I had remembered your concerns correctly. The issue we've run into here is that when passing the "a/b/0.0" style key from the array to the store, there's no way to pass along whether one is in a dimension_separator="/" or a "." context. If you have any ideas, I'm all ears.

The options I've considered are:

  1. What I've done in Fix DirectoryStore #773, i.e. make it an Array concern rather than a Store concern
  2. Introduce a more sophisticated Key concept (Key object or object with dictionary) to pass the dimension_separator from the Array
  3. and it sounds like you are saying, failing if a Store is used to open an Array that has a non-matching dimension_separator (I think I know where I could fail fast, but I'm not sure yet if I could catch all cases)

The downsides of 3 are:

  • The client will always need to try one and then fallback to the other
  • It's not possible to support groups which mix nested and flat storage

The downside of 2 is that all Stores would need to learn this new protocol.

If I understand correct, your concern with 1 is that in the N5 case it won't be possible to split the key portion from the chunk portion?

@d-v-b
Copy link
Contributor

d-v-b commented Jun 15, 2021

It's not possible to support groups which mix nested and flat storage

Does this actually need to be supported? In my usage, the storage backend is what makes "." or "/" the more desirable key separator. If this is generally the case, then maybe it is safe to require that the key separator must be consistent across all arrays in a given store?

If I understand correct, your concern with 1 is that in the N5 case it won't be possible to split the key portion from the chunk portion?

Yes, n5 support requires that the store intercept the abstract chunk key (formatted as name/dim_0.dim_1...), transpose the order of the dimensions, and then use the "/" dimension separator. This becomes impossible if the structure of the chunk key does not allow separating the dimensions from the array key/name.

and it sounds like you are saying, failing if a Store is used to open an Array that has a non-matching dimension_separator (I think I know where I could fail fast, but I'm not sure yet if I could catch all cases)

Yes, I think this is the right approach. If an array has been already saved to disk with dimension_separator="/" and the user tries to open it with dimension_separator=".", the user should see an error.

@joshmoore
Copy link
Member

... the user should see an error.

Agreed. For all the cases where a user passes a dimension_separator, I agree that it's the most straight-forward. But for the vast majority of cases, that's not what's happening, and I fear the usability of this is fairly lousy and will lead to:

try:
  Store(".")
except:
  Store("/")

as being the standard idiom in which case nothing has been won.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 15, 2021

The dimension_separator should only be specified when the store doesn't exist yet (i.e. during creation); otherwise, it can be inferred from the existing store -- do routines for this exist in the storage implementations?

@joshmoore
Copy link
Member

The dimension_separator should only be specified when the store doesn't exist yet

Agreed, in principle at least. In practice, it's per array since there's no top-level metadata in v2.

otherwise, it can be inferred from the existing store

That's certainly what I'm trying to achieve, but it's not currently possible. An option perhaps (4?) is to allow Arrays to modify stores (!). I don't particularly like the idea (and it introduces a race condition), but it's the only way I've thought of so far to leave Stores in charge of this feature and still have the behavior your describe.

I'm already passing the _dimension_separator into the Array creation by a property on the Store. If the Store was not given a dimension_separator, the first time an Array is opened the Array would set it on the given Store.

@constantinpape
Copy link
Author

Agreed, in principle at least. In practice, it's per array since there's no top-level metadata in v2.

There is the .zgroup metadata (which afaik currently only holds the zarr version). Could we just add the dimension separator into the mix there to solve this issue?

@martindurant
Copy link
Member

@constantinpape - probably not, since you should always be able to access the path of the array directly without knowing it is within a group.

@constantinpape
Copy link
Author

probably not, since you should always be able to access the path of the array directly without knowing it is within a group.

But that could be solved by having it both in the array and group metadata, no?
This duplicates the information, but seems preferable to the other downsides listed above.

@martindurant
Copy link
Member

martindurant commented Jun 15, 2021 via email

@constantinpape
Copy link
Author

True, but then we must worry about keeping the two in sync!

Yes, that's true. But that still sounds preferable to the other downsides discussed above. And I think that would be a good solution if we decide not to allow mixed separators per group, which I am in favor of.

Let's see what @joshmoore and @d-v-b think about this.

@jakirkham
Copy link
Member

if you still have any of the state of this loaded into your working memory after #769, it'd be interesting to here your opinion, since I vaguely remember you arguing that it was in fact a Store concern rather than a core.py concern.

@joshmoore this appears to link back to the current issue. Was there another issue this was intending to refer to? (just trying to get some missing context here)

@joshmoore joshmoore mentioned this issue Jun 16, 2021
6 tasks
@joshmoore
Copy link
Member

Thanks for catching that, @jakirkham. #709 was what I meant. Fixed above.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 19, 2021

Introduce a more sophisticated Key concept (Key object or object with dictionary) to pass the dimension_separator from the Array

What about a length-2 tuple containing the array key and the chunk key, e.g. ('path/to/array/', '0.0.0') vs ('path/to/array/', '0/0/0')? This is explicit enough to allow the chunk renaming I need for n5.

@joshmoore
Copy link
Member

In terms of new APIs that would certainly work for me, but it would be a pretty significant change.

@DennisHeimbigner
Copy link

WRT to the netcdf-c implementation of dimension separator.

  1. Internally, I end up breaking the chunk key into two parts as mentioned above. Effectively (/a/b/a, 0D0D0) where D is the dimension separator.
  2. I have to use a heuristic to figure out what part of the key is the chunk id. That heuristic looks for the trailing occurrence of [0-9]+D[0-9]+D... where D is the dimension sep. This has the bad consequence that if D = / then you cannot have an array whose name is a number.

@joshmoore
Copy link
Member

@d-v-b: sorry, I've been failing to find the time to really engage here, however the /-community is still quite broken. Do you have a branch/repository that I could test against my #773 and see if I can find a nice workaround for the N5 side of things?

@DennisHeimbigner : thanks for the info. If we're willing to break the API, then yet, I definitely see the value in clearly splitting the key into its individual parts.

In another issue (which I can dig up) there was a pretty strong vote against @jbms (#715 (comment)) that we not try the heuristics (as I had tried with a PR with @SabineEmbacher)

cc: @haileyajohnson

@DennisHeimbigner
Copy link

Actually splitting the key occurs completely internally, so the user never sees it.
As I mentioned, the price is that certain variable names become "illegal".

@d-v-b
Copy link
Contributor

d-v-b commented Jul 2, 2021

@joshmoore I will put together a branch that you can test against. And I think I figured out a stupid hack that lets a store infer the key separator even if it changes on the array side.

To summarize this issue as I understand it:

  • different dimension separators are good for different storage backends.
  • It seems like "/" is better for most (but not all) storage backends than ".".
  • As far as I know, the informed choice of dimension separator is governed by the details of a given storage backend, therefore it is probably not important to support multiple dimension separators in the same zarr container. I.e., the dimension separator is a property of the whole container, not a property of all the arrays in that container.
  • However, at present, It looks like array metadata is the only place the dimension separator can be stored, because of the constraint that root.zarr/bar/baz/array be accessible without touching the metadata of root.zarr. Because we think of the dimension separator reflecting a container-wide constraint, and because we want arrays to be abstracted from the details of chunk storage, it's awkward to put the dimension separator in array metadata, because this leads to duplicated information and an abstraction leakage that complicates stuff like n5 support.

I would be interested in understanding the pros and cons of fixing this via storage metadata (i.e. some root metadata that contains information about storage for all elements of the container). If store-specific details (like dimension separator) were stored in root.zarr/.zgroup, and initializing a store involved reading that metadata, what would break?

@martindurant
Copy link
Member

If store-specific details (like dimension separator) were stored in root.zarr/.zgroup, and initializing a store involved reading that metadata, what would break?

The obvious one would be an attempt to move (or copy) the dataset/file hierarchy from one storage system to another. Such a move is not necessarily made by a thing that has any knowledge of zarr and its conventions.

@joshmoore
Copy link
Member

I will put together a branch that you can test against. And I think I figured out a stupid hack that lets a store infer the key separator even if it changes on the array side.

💯

  • As far as I know, the informed choice of dimension separator is governed by the details of a given storage backend, therefore it is probably not important to support multiple dimension separators in the same zarr container. I.e., the dimension separator is a property of the whole container, not a property of all the arrays in that container.

At least at a specification level, this is neither the case in V2 nor V3 currently. It's an array concern.

I would be interested in understanding the pros and cons of fixing this via storage metadata (i.e. some root metadata that contains information about storage for all elements of the container). If store-specific details (like dimension separator) were stored in root.zarr/.zgroup

root.zarr is not necessarily a group but can be an array, which is another way of saying that in v2 there's no clear place for such store-level metadata.

@clbarnes
Copy link
Contributor

clbarnes commented Jul 22, 2021

Just chiming in to say I also hit the issue addressed by #773 !

... you should always be able to access the path of the array directly without knowing it is within a group.

This is certainly the current behaviour, but it's not the case in V3 is it?

The obvious one would be an attempt to move (or copy) the dataset/file hierarchy from one storage system to another. Such a move is not necessarily made by a thing that has any knowledge of zarr and its conventions.

Going into V3, we lose the ability to use such tools anyway, right (which is a shame)? Because all the metadata is stored in a directory hierarchy parallel to the groups/arrays.

@thewtex
Copy link

thewtex commented Aug 11, 2021

Another +1 and thanks to @joshmoore for #773, which addressed the behavior I observed when trying to use DirectoryStore and dimension_separator='/' with zarr and xarray.

Are there plans to integrate #773?

@joshmoore
Copy link
Member

After a lengthy back and forth, I think this is a last call for opinions on the fix. I'd propose to get it out in 2.9.0 (noting that that also drops support for Python 3.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants