Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: functionally-derived non-dimensional coordinates #3620

Open
eteq opened this issue Dec 13, 2019 · 15 comments
Open

Idea: functionally-derived non-dimensional coordinates #3620

eteq opened this issue Dec 13, 2019 · 15 comments

Comments

@eteq
Copy link

eteq commented Dec 13, 2019

@Cadair and I are from the solar and astrophysics communities, respectively (particularly SunPy and Astropy). In our fields, we have a concept of something called "World Coordinate Systems" (WCS) which basically are arbitrary mappings from pixel coordinates (which is often but not necessarily the same as the index) to physical coordinates. (For more on this and associated Python/Astropy APIs, see this document). If we are reading correctly, this concept maps roughly onto the xarray concept of "Non-dimension coordinates".

However, a critical difference is this: WCS are usually expressed as arbitrary complex mathematical functions, rather than coordinate arrays, as it is crucial to some of the science cases to carry sub-pixel or other coordinate-related metadata around along with the WCS.

So our question is: is it in-scope for xarray non-dimensional coordinates to be extended to be functional instead of having to be arrays? I.e., have the coordinate arrays be generated on-the-fly from some function instead of being realized as arrays at creation-time. We have thought about several ways this might be specified and are willing to do some trial implementations, but are first asking here if it is likely to be

  1. Easy
  2. Hard
  3. Impossible
  4. PR will immediately be rejected on philosophical grounds, regardless?

Thanks!

@rabernat
Copy link
Contributor

rabernat commented Dec 13, 2019 via email

@dcherian
Copy link
Contributor

It would also be good to hear about "sub-pixel metadata" → this seems to be the main reason why you want to carry around the analytic rather than the discrete evaluated form (which we basically support through dask). Is that right or am I missing something?

@eteq
Copy link
Author

eteq commented Dec 13, 2019

Thanks for the quick responses @rabernat and @dcherian!

When you say “arbitrary complex mathematical functions”: what are the arguments / inputs to these functions?

The short answer is: pixels. Which in astro is sometimes the same as "array index", but usually off-by-0.5 (i.e., the center of the pixel), and occasionally off-by-1 (for files generated by e.g. FORTRAN or other 1-based indexing schemes...).

The longer answer is: it depends on which direction you mean. Most WCS' are by design invertible, so you can go world-to-pixel or pixel-to-world. In the former case that means there's a bunch of possibly inputs - examples include wavelength (for a spectrum), energy (for a photon-counting experiment), or altitude and azimuth (celestial coordinates that are similar to lat/lon but on the celestial sphere instead of a reference geoid).

Additionally, the WCSs have parameters, which are usually not thought of as "inputs" to the WCS, but rather internal state or metadata (usually they are stored in file headers and interpreted by software).

If you want to see some concrete examples, you might check out Astropy's docs for our WCS interface, or the design document for the most recent iteration of that interface. But those might be impenetrable for a non-astronomer, so I'm open to more follow-ups!

It would also be good to hear about "sub-pixel metadata"

Sorry I wasn't clear there - I meant there are cases where the sub-pixel structure of the coordinates matter - e.g. telescopes often have pretty intense optical distortion, and we sometimes interpolate in such a way that the sub-pixel information in the WCS is critical to getting the interpolation right. And by "metadata" I meant both abstract parameters that encode that transformation (polynominal coefficients, information about which sky-projection is used, etc), and more physical metadata like "the temperature of the observatory, which slightly changes the shape of the distortion".

which we basically support through dask

Oh, tell me more about that - maybe this is the key for our needs?

@dcherian
Copy link
Contributor

dcherian commented Dec 14, 2019

I should have said "discrete lazily evaluated form (which we support through dask)". I think we already have what you want in principle (caveats at the end).

Here's an example:

import dask
import numpy as np
import xarray as xr

xr.set_options(display_style="html")

def arbitrary_function(dataset):
    return dataset["a"] * dataset["wavelength"] * dataset.attrs["wcs_param"]

ds = xr.Dataset()
# construct a dask array. 
# In practice this could represent an on-disk dataset, 
# with data reads only occurring when necessary
ds["a"] = xr.DataArray(dask.array.ones((10,)), dims=["wavelength"], coords={"wavelength": np.arange(10)})

# some coordinate system parameter
ds.attrs["wcs_param"] = 1.0

# complicated pixel to world function
# no compute happens since we are working with dask arrays
# so this is quite cheap.
ds.coords["azimuth"] = arbitrary_function(ds)
ds

image

So you can carry around your coordinate system parameters in the .attrs dictionary and the non-dimensional coordinate azimuth is only evaluated when needed e.g. when plotting

# Both 'a' and 'azimuth' are computed now, since actual values are required to plot
ds.a.plot(x="azimuth")

image

In practice, there are a few limitations. @djhoese and @snowman2 may have useful perspective here.

  1. xarray tends to compute "non-dimensional coordinates" more than necessary. The more egregious examples have been fixed (Multidimensional dask coordinates unexpectedly computed #3068, optimize compatibility checks in merge.unique_variable #3311, Large coordinate arrays trigger computation #3454, Optimize dask array equality checks. #3453) but there may still be some places where fixes are needed (Optimize isel for lazy array equality checking #3588).
  2. there's some discussion about carrying around Earth-specific coordinate system parameters here: Add CRS/projection information to xarray objects #2288; Checking non-dimensional coordinates for equality #2996.

Additional info:

  1. https://docs.dask.org/en/latest/array.html
  2. https://xarray.pydata.org/en/stable/dask.html
  3. https://blog.dask.org/2019/06/20/load-image-data

PS: If it helps, I'd be happy to chat over skype for a half hour getting you oriented with how we do things.

@snowman2
Copy link
Contributor

For reference, here is how CRS information is handled in rioxarray: CRS management docs.

@djhoese
Copy link
Contributor

djhoese commented Dec 14, 2019

For reference, here is how CRS information is handled in rioxarray: CRS management docs.

Nice! I didn't know you had documented this.

Sorry this is going to get long. I'd like to describe the CRS stuff we deal with and the lessons learned and the decisions I've been fighting with in the geoxarray project (https://github.com/geoxarray/geoxarray). I'm looking at this from a meteorological satellite point of view. @snowman2 please correct me if I'm wrong about anything.

  1. In our field(s) and software the CRS object that @snowman2 is talking about and has implemented in pyproj encapsulates our version of these "complex mathematical functions", although ours seem much simpler. The CRS object can hold things like the model of the Earth to use and other parameters defining the coordinate system like the reference longitude/latitude.
  2. When it comes to computing the coordinates of data on a CRS, the coordinates are typically on a cartesian plane so we have an X and a Y and points in between can be linearly interpolated. These work well as 1D xarray coordinates.
  3. These X and Y coordinates don't tell you all the information alone though so we need the CRS information. Xarray's current rasterio functionality adds this CRS definition as a string value in .attrs. The problem with using .attrs for this is most operations on the DataArray object will make this information disappear (ex. adding two DataArrays).
  4. In geoxarray I was going to try putting a pyproj CRS object in a DataArray's coordinates (.coords). I figured this would be good because then if you tried to combine two DataArrays on different CRSes, xarray would fail. Turns out xarray will just ignore the difference and drop the crs coordinate so that was no longer my "perfect" option. Additionally, to access the crs object would have to be accessed by doing my_data_arr.coords['crs'].item() because xarray stores the object as a scalar array.
  5. Xarray accessors, last time I checked, often have to be recreated when working with Dataset or DataArray objects. This has to do with how low-level xarray converts Variable objects to DataArrays. I didn't expect this when I started geoxarray and I'm not really sure how to continue now.

As for your use case(s), I'm wondering if an xarray accessor could work around some of the current limitations you're seeing. They could basically be set up like @dcherian described, but "arbitrary_function" could be accessed through x, y, z = my_data_arr.astro.world_coordinates(subpixels=4) or something. You could do my_data_arr.astro.wcs_parameters to get a dictionary of common WCS parameters stored in .attrs. The point being that the accessor could simplify the interface to doing these calculations and accessing these parameters (stored in .coords and .attrs) and maybe make changes to xarray core unnecessary.

@snowman2
Copy link
Contributor

I'm wondering if an xarray accessor could work around some of the current limitations you're seeing.

I think this sounds like a good idea.

@benbovy
Copy link
Member

benbovy commented Jun 7, 2021

This looks like a nice use case for the forthcoming Xarray's custom index feature.

How I see CRS/WCS-aware Xarray datasets with custom indexes:

  • A set of coordinate(s) and their attributes hold data or metadata relevant for public use and that could be easily (de)serialized

  • A custom index (CRSIndex or WCSIndex) provides CRS/WCS-aware implementations of common Xarray operations such as alignment (merge/concat) and data selection (sel), via Xarray.Index's equals, union, intersection and query methods added in Flexible indexes: add Index base class and xindexes properties #5102 and Internal refactor of label-based data selection #5322 (not yet ready for use outside of Xarray). Such custom index may also be used to hold some data that is tricky to propagate by other means, e.g., some internal information like "functional" coordinate parameters or a crs object. Xarray indexes should definitely provide more flexibility than coordinate data or attributes or accessor attributes for propagating this kind of information.

  • Xarray accessors may be used to extend Dataset/DataArray public API. They could use the information stored in the CRSIndex/WCSIndex, e.g., add a crs read-only property that returns the crs object stored in CRSIndex, or add some some extract_crs_parameters method to extract the parameters and store them in Dataset/DataArray attributes similarly to what @djhoese suggests in his comment above.

For this use case a possible workflow would then be something like this:

# create or open an Xarray dataset with  x, y, z "pixel" (possibly lazy) coordinates
# and set a WCS index
dataset = (
    xr.Dataset(...)
    .set_index(['x', 'y', 'z'], WCSIndex, wcs_params={...})
)

# select data using pixel coordinates
dataset.sel(x=..., y=..., z=...)

# select data using world coordinates (via the "astro" accessor,
# which may access methods/attributes of the WCS index)
dataset.astro.sel_world(x=..., y=..., z=...)

# return a new dataset where the x,y,z "pixel" coordnates are replaced by the "world" coordinates
# (again using the WCS index, and propagating it to the returned dataset)
world_dataset = dataset.astro.pixel_to_world(['x', 'y', 'z'])

# select data using world coordinates
world_dataset.sel(x=..., y=..., z=...)

# select data using pixel coordinates (via the "astro" accessor)
world_dataset.astro.sel_pixel(x=..., y=..., z=...)

# this could be reverted
pixel_dataset = world_dataset.astro.world_to_pixel(['x', 'y', 'z'])
assert pixel_dataset.identical(dataset)

# depending on the implementation in WCSIndex, would either raise an error
# or implicitly convert to either pixel or world coordinates
xr.merge([world_dataset, another_pixel_dataset])

@benbovy
Copy link
Member

benbovy commented Jun 7, 2021

We could also imagine

# returns a new dataset with both pixel and world (possibly lazy) coordinates
>>> new_dataset = dataset.astro.append_world({'x': 'xw', 'y': 'yw', 'z': 'zw'})

# so that we can directly select data either using the pixel coordinates...
>>> new_dataset.sel(x=..., y=..., z=...)

# ...or using the world coordinates
>>> new_dataset.sel(xw=..., yw=..., zw=...)

# the WCS index would be attached to both pixel and world dataset coordinates
>>> new_dataset
<xarray.Dataset>
Dimensions:  (x: 100, y: 100, z: 100)
Coordinates:
  * x                     (x) float64 ...
  * xw                    (x) float64 ...
  * y                     (y) float64 ...
  * yw                    (y) float64 ...
  * z                     (z) float64 ...
  * zw                    (z) float64 ...
Data variables:
    field                 (x, y, z) float64 ....
Indexes:
    x, y, z, zw, yw, zw    WCSIndex

@djhoese
Copy link
Contributor

djhoese commented Jun 7, 2021

@benbovy Thanks. This looks really promising and is pretty inline with what I saw geoxarray's internals doing for a user. In your opinion will this type of CRSIndex/WCSIndex work need #5322? If so, will it also require (or benefit from) the additional internal xarray refactoring you mention in #5322?

I can really see this becoming super easy for CRS-based dataset users where libraries like geoxarray (or xoak) "know" the common types of schemes/structures that might exist in the scientific field and have a simple .geo.set_index that figures out most of the parameters for .set_index by default.

@benbovy
Copy link
Member

benbovy commented Jun 7, 2021

In your opinion will this type of CRSIndex/WCSIndex work need #5322? If so, will it also require (or benefit from) the additional internal xarray refactoring you mention in #5322?

Yes, CRSIndex/WCSIndex will need to provide an implementation for the query method added in #5322. However, this could be "as simple as" internally using PandasIndex for each 1-d coordinate in case of raster/grid data, maybe with an additional check that the values provided to .sel are in the same CRS (for example in the case of advanced indexing where xarray.DataArray or xarray.Variable objects are passed as arguments).

What will be probably more tricky is to find some common way to handle CRS for various indexes (e.g., regular gridded data vs. irregular data), probably via some class inheritance hierarchy or using mixins.

I can really see this becoming super easy for CRS-based dataset users where libraries like geoxarray (or xoak) "know" the common types of schemes/structures that might exist in the scientific field and have a simple .geo.set_index that figures out most of the parameters for .set_index by default.

In case we load such data from a file/store, thanks to the Xarray backend system, maybe we won't even need a .geo.set_index but we'll be able to build the right index(es) when opening the dataset!

@djhoese
Copy link
Contributor

djhoese commented Jun 16, 2021

@benbovy I'm reading over the changes in #5322. All of this is preparing for the future, right? Is it worth it to start playing with these base classes (Index) in geoxarray or will I not be able to use them for a CRSIndex until more changes are done to xarray core? For example, none of this set_index for Index classes stuff you showed above is actually implemented yet, right?

@benbovy
Copy link
Member

benbovy commented Jun 22, 2021

@djhoese you're right, I thought it was better to do all the internal refactoring first but we maybe shouldn't wait too long before updating set_index and the DataArray / Dataset constructors so that you and others can start playing with custom indexes.

@djhoese
Copy link
Contributor

djhoese commented Sep 24, 2021

@benbovy It's been a while since I've looked into xarray's flexible index work. What's the current state of this work (sorry if there is some issue or blog I should be watching for)? Is it possible for me as a user to create my own index classes that xarray will accept?

@benbovy
Copy link
Member

benbovy commented Sep 24, 2021

@djhoese not yet but hopefully soon! Most of the work on explicit indexes is currently happening in #5692, which once merged (probably after the next release) will provide all the infrastructure for custom indexes. This is quite a big internal refactoring (bigger than I initially thought) that we cannot avoid as we're changing Xarray's core data model. After that, we'll need to update some public API (Xarray object constructors, .set_index(), etc.) so that Xarray will accept custom index classes. This should take much less work than #5692, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants