API design for pointwise indexing #475

jhamman · 2015-07-15T06:04:47Z

There have been a number of threads discussing possible improvements/extensions to xray indexing. The current indexing behavior for isel is orthogonal indexing - in other words, each coordinate is treated independently (see #214 and #411 for more discussion).

So the question: what is the best way to incorporate diagonal or pointwise indexing in xray? I see two main goals / applications:

support simple form of numpy style integer array indexing
support pointwise array indexing along coordinates via computation of nearest-neighbor indexes - I think this can also be thought of as a form of resampling.

Input from @WeatherGod, @wholmgren, and @shoyer would be great.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-07-15T16:58:36Z

So, the good news is that once we figure out the API for pointwise indexing, I think the nearest-neighbor part could be as simple as supplying method='nearest'.

The challenge is that we want to go from an DataArray that looks like this:

In [4]: arr = xray.DataArray([[1, 2], [3, 4]], dims=['x', 'y'])

In [5]: arr
Out[5]:
<xray.DataArray (x: 2, y: 2)>
array([[1, 2],
       [3, 4]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) int64 0 1

To one that looks like that:

In [6]: xray.DataArray([1, 4], {'x': ('c', [0, 1]), 'y': ('c', [0, 1])}, dims='c')
Out[6]:
<xray.DataArray (c: 2)>
array([1, 4])
Coordinates:
    y        (c) int64 0 1
    x        (c) int64 0 1
  * c        (c) int64 0 1

Somehow, we need to figure out the name for the new dimension (c in this example).

My thought would be to have methods sel_points and isel_points that work similarly to sel and isel. This is straightforward if you already have xray 1D objects with a labeled dimension: arr.sel_points(x=x, y=y), where x and y are along the c dimension.

If you don't already have 1D xray objects, I suppose we could also allow arr.sel_points(x=('c', [0, 1]), y=('c', [0, 1])) or arr.sel_points('c', x=[0, 1], y=[0, 1]).

wholmgren · 2015-07-15T18:15:49Z

Seems like if your method is going to be named sel_points then points is a reasonable dimension name. Maybe support a name kwarg?

One thing to keep in mind is that for many of us the "nearest-neighbor" part isn't really method='nearest', but instead more like, method='ingridcell' where the grid cell might be roughly square or might be something pretty different. At least that's how I think of my data. Maybe what I really want is some other kind of more explicit support for gridded data, although my thoughts on this are too half-baked to clearly write down. I thought there was another issue related to this, but I couldn't find it.

shoyer · 2015-07-15T18:22:03Z

Seems like if your method is going to be named sel_points then points is a reasonable dimension name.

Yes, this is a reasonable choice for the case of 1d indexers.

Maybe support a name kwarg?

This is also a good idea, though I would probably call the parameter dim, not name.

One thing to keep in mind is that for many of us the "nearest-neighbor" part isn't really method='nearest', but instead more like, method='ingridcell' where the grid cell might be roughly square or might be something pretty different.

Indeed. As a start, we should be able to do nearest neighbor lookups with a tolerance soon -- I have a pandas PR that should add some of that basic functionality (pandas-dev/pandas#10411). In the long term, it would be useful to have some sort of representation of grid cells in the index itself, possibly something similar to IntervalIndex (pandas-dev/pandas#8707).

jhamman · 2015-07-15T23:51:14Z

I like:

DataArray.isel_points(x=[1, 2, 3], y=[0, 1, 2], dim='points')

I also like the nearest-neighbor / resample API of:

DataArray.sel_points(lon=[-123.25, -140.0, 72.5], lat=[45.0, 72.25, 65.75],
                     dim='points', method='nearest')

How do we want to do the nearest-neighbor selection? The simplest case would be to follow the cKDTree example from #214. However, when you're using lat/lon coordinates, it is usually best to map these coordinates from the spherical coordinates to a Cartesian coordinates (see here for a simple example using cKDTree. Is that a road we want to go down here?

Further along that subject, but not directly relate - has anyone used pyresample.

wholmgren · 2015-07-16T00:42:08Z

Unidata also has a blog post benchmarking cKDTree and other methods and concludes "Your Mileage May Vary". I'd probably just go with a KDTree, but something to aware of.

rabernat · 2015-07-16T01:09:12Z

There is a great kdtree-based geospatial resampling package you might want to consider building on:
https://github.com/pytroll/pyresample
It is fast (multithreaded) and has support for different map projections.

rabernat · 2015-07-16T01:15:25Z

Maybe this is off topic, but are the plans to support more general spatial resampling / regridding? Like if I have two DataArrays a and b with different spatial coords, it would be great to be able to do

c = a.regrid_like(b)

This is a pretty common practice in climate science, since different datasets are provided on different grids with different resolutions.

shoyer · 2015-07-16T02:47:30Z

I agree that regridding and resample would be very nice, and pyresample looks like a decent option. I have no immediate plans to implement these features but contributions would be very welcome.

For n-dimensional indexing, kdtree seems sensible, especially if we can cache it on the coordinates. We probably want an explicit API for methods that add new coordinates -- something like ds.set_kdtree(['latitude', 'longitude']).

jhamman · 2015-07-16T15:45:59Z

As a first step, I'll volunteer (unless someone else is more keen on doing this work) to put together a pull request for isel_points.

After that, we'll want to add the sel_points and kdtree API, which will depend on isel_points.

Later on, I'm also interested in regridding and resampling between grids - let's open another issue for that. Maybe we use pyresample for that.

shoyer · 2015-07-16T15:59:18Z

@jhamman it would be great if you could put together a PR for isel_points. The main complexity is that you'll want to write a version that also works with dask arrays. Let me know if that part is confusing, I can certainly help with that.

As for sel_points, we only need a kdtree if the underlying coordinates are 2D. If latitude and longitude (for example) are 1d, we can just use the existing machinery for remapping label based indexers to integers. This should be pretty straightforward, following the example of isel:
https://github.com/xray/xray/blob/v0.5.1/xray/core/dataset.py#L1024
https://github.com/xray/xray/blob/v0.5.1/xray/core/indexing.py#L157

jhamman · 2015-07-17T06:54:37Z

Good point on the dask array business. From the dask docs:

Dask.array supports most of the NumPy slicing syntax.
...
It does not currently support the following:

Slicing one dask.array with another x[x > 0]
Slicing with lists in multiple axes x[[1, 2, 3], [3, 2, 1]]

Both of these are straightforward to add though. If you have a use case then raise an issue.

So, from browsing the closed dask issues, it seems like dask has similar support for multi-dimension slicing and indexing as xray. This throws a bit of a wrench in my plan for how I was going to implement isel_points as I had not fully considered the dask array complexities.

I'll have to put a bit more thought into this. Any suggestions on how to index the dask array without looping through individual points would be great.

shoyer · 2015-07-17T23:05:59Z

Any suggestions on how to index the dask array without looping through individual points would be great.

For now, I actually think selecting individual points and then concatenating the resulting arrays together would be a reasonable start. Yes, it's kind of slow, but once you have a first draft put together that way with the right API we can optimize later.

jhamman · 2015-07-27T20:31:03Z

Now that the isel_points method is implemented, I think it makes sense to discuss the sel_points method in a bit more detail. The main outstanding question is - do we want to support spherical nearest neighbor mapping. The use case is when you are searching for the nearest neighbor using longitudes and latitudes. This example shows an example of to do this by projecting the coordinates onto a sphere. If we go this route, which is probably the most common use case here, we are committing to the coordinates being latitudes and longitudes. Maybe it is better to use a method='spherical' keyword to fall into this path.

shoyer · 2015-07-27T21:34:42Z

I would start with the easiest case -- lookups of 1d orthogonal arrays, e.g., grid.sel(latitude=stations.latitude, longitude=stations.longitude, method='nearest'). This would very straightforwardly leverage our current machinery.

For 2D lookups, we need a KDTree. Here are some API ideas, just tossing things around...

>>> ds
<xray.Dataset>
Dimensions:      (x: 4, y: 5)
Coordinates:
    latitude     (x, y) float64 0.49 0.5682 -0.3541 -0.9305 -0.9669 0.01558 ...
    longitude    (x, y) float64 0.3758 1.429 -1.698 -1.344 0.5237 0.6152 ...
  * x            (x) int64 0 1 2 3
  * y            (y) int64 0 1 2 3 4
Data variables:
    temperature  (x, y) float64 0.5735 -0.4871 0.4708 0.4907 -0.3318 0.2883 ...

# perhaps set_ndindex is a better name?
>>> ds = ds.set_kdtree(['latitude', 'longitude'], name='latlon_index', method='spherical')
>>> ds
<xray.Dataset>
Dimensions:      (x: 4, y: 5)
Coordinates:
    latitude     (x, y) float64 0.49 0.5682 -0.3541 -0.9305 -0.9669 0.01558 ...
    longitude    (x, y) float64 0.3758 1.429 -1.698 -1.344 0.5237 0.6152 ...
  * latlon_index (x, y) float64 (0.49, 0.3758) (0.5682, 1.429) ...
  * x            (x) int64 0 1 2 3
  * y            (y) int64 0 1 2 3 4
Data variables:
    temperature  (x, y) float64 0.5735 -0.4871 0.4708 0.4907 -0.3318 0.2883 ...

result = ds.sel_points(latitude=other.latitude, longitude=other.longitude, method='nearest')

shoyer · 2015-07-28T06:43:26Z

I started playing around with making an array wrapper for KDTree this evening:
https://gist.github.com/shoyer/ae30a1200f749c84b4c4

I think it has most of the necessary indexing machinery and you can put it in an xray.Dataset like an array. You could easily imagine hooking in a transform argument to KDTreeIndex to handle projection. But of course it hasn't been hooked up to any API yet.

jhamman · 2015-07-29T05:44:35Z

Very nice. This is the sort of API I was hoping for. It will be a while before I can come back around on this. In the meantime, if someone else wants to take the sel_points method on, that is fine by me.

shoyer · 2015-08-01T02:29:37Z

PR #507 implements the my suggested 1d version of sel_points. Maybe we also want reindex_points, i.e., pointwise indexing by label that is gauranteed to succeed even if some labels are missing?

shoyer · 2016-08-23T18:05:09Z

A few recent developments relevant to this issue:

Indexing with alignment and broadcasting #974 discusses how we could add multi-dimensional indexing with broadcasting. This would subsume the need for separate methods like sel_points and allow also handle indexing grids with grids.
Multi-index levels as coordinates #947 adds first class support for MultiIndex coordinates into xarray. This is good model for how a KDTree could work.

So I'm now thinking an API more like this:

>>> ds = ds.set_kdtree(spatial_index=['latitude', 'longitude'])

>>> ds
<xray.Dataset>
Dimensions:        (x: 4, y: 5)
Coordinates:
  * x              (x) int64 0 1 2 3
  * y              (y) int64 0 1 2 3 4
  * spatial_index  (x, y) KDTree
    - latitude     (x, y) float64 0.49 0.5682 -0.3541 -0.9305 -0.9669 0.01558 ...
    - longitude    (x, y) float64 0.3758 1.429 -1.698 -1.344 0.5237 0.6152 ...
Data variables:
    temperature    (x, y) float64 0.5735 -0.4871 0.4708 0.4907 -0.3318 0.2883 ...

>>> result = ds.sel(latitude=other.latitude, longitude=other.longitude,
...                 method='nearest')

For building a tree with lat/lon remapped to spherical coordinates, we should write a method that converts lat and lon arrays into a tuple of x, y, z arrays (e.g., using apply_ufunc from #964). Then this looks like ds.set_kdtree(spatial_index=latlon_to_xyy(ds.latitude, ds.longitude)). Conceivably, we could add some sugar for this, e.g., ds.geo.set_kdtree(spatial_index=['latitude', 'longitude']).

burnpanck · 2016-10-25T22:44:30Z

Without following the discussion in detail, what is the status here? In particular, I would like to do pointwise selection on multiple 1D coordinates using multidimensional indexer arrays. I can do this with the current isel_points:

construct the multidimensional indexers
flatten them
create a corresponding MultiIndex
apply the flattened indexers using isel_points, and assign the multi-index as the new dimension
use unstack on the newly created dimension
The first three points can be somewhat simplified by instead putting all of the multidimensional indexer into a Dataset and then stack it to create consistent flat versions and their multi-index.

Given this conceptually easy but somewhat tedious procedure, couldn't that be something that could quite easily be implemented into the current isel_points? Would a PR along that direction have a chance of being accepted?

shoyer · 2016-10-25T22:49:14Z

@burnpanck I don't think you need to do the flattening/multi-index bit. I believe isel_points/sel_points should just work for you already.

At this point we're really just talking about design refinements (I'll rename the topic).

burnpanck · 2016-10-25T23:13:37Z

Really? I get a ValueError: Indexers must be 1 dimensional (xarray/core/dataset.py:1031 in isel_points(self, dim, **indexers) when I try. That is xarray 0.8.2, in fact from my fork recently cloned (~2-3 weeks ago), where I changed one or two asarray to asanyarray to work with units. Was there a recent change in this area?
EDIT: xarray/core/dataset.py looks very similar also here on master, and there are quite a few lines hinting that really only 1D indexers are supported.

shoyer · 2016-10-25T23:19:03Z

@burnpanck Nevermind, you are correct! I misread your comment. This cannot be done currently.

You certainly could try to put this into isel_points, and if you can do it in a clean fashion I an open to accepting it, but keep in mind that the method is going to go away when we finally get around to implementing #974. Work on #974 would probably be more productive, ultimately.

WeatherGod · 2017-11-07T17:11:49Z

So, what has become the consensus for performing regridding/resampling? I see a lot of suggestions, but I have no sense of what is mature enough to use in production-level code. I also haven't seen anything in the documentation about this topic, even if it just refers people to another project.

jhamman · 2017-11-07T17:28:17Z

@WeatherGod

Short answer. We don't have a tool that is production ready.

Longer answer: xESMF may be the best prospect in the near term. There are two main issues with its current implementation. 1) Lack of out-of-core abilities / integration with dask, and 2) lack of a test suite. Conceptually, it would be great to leverage the low-level remapping tools of ESMPy so I think this is a nice way to move forward as a community but I think everyone agrees it isn't ready for use in any sort of production environment.

This issue introduces the concept of point-wise indexing using nearest neighbor lookups on ND coordinates. @shoyer has an example implementation here but it hasn't moved forward in quite a while.

WeatherGod · 2017-11-07T18:29:12Z

Yeah, we need to move something forward, because the main benefit of xarray is the ability to manage datasets from multiple sources in a consistent way. And data from different sources will almost always be in different projections.

My current problem that I need to solve right now is that I am ingesting model data that is in a LCC projection and ingesting radar data that is in a simple regular lat/lon grid. Both dataset objects have latitude and longitude coordinate arrays, I just need to get both datasets to have the same lat/lon grid.

I guess I could continue using my old scipy-based solution (using map_coordinates() or RectBivariateSpline), but at the very least, it would make sense to have some documentation demonstrating how one might go about this very common problem, even if it is showing how to use the scipy-based tools with xarrays. If that is of interest, I can see what I can write up after I am done my immediate task.

shoyer · 2017-11-07T18:31:30Z

Yes, a documentation example would be greatly appreciated. We have been making progress in this direction (especially with the new vectorised indexing support) but it has been slow going to do it right.

…

On Tue, Nov 7, 2017 at 10:29 AM Benjamin Root ***@***.***> wrote: Yeah, we need to move something forward, because the main benefit of xarray is the ability to manage datasets from multiple sources in a consistent way. And data from different sources will almost always be in different projections. My current problem that I need to solve right now is that I am ingesting model data that is in a LCC projection and ingesting radar data that is in a simple regular lat/lon grid. Both dataset objects have latitude and longitude coordinate arrays, I just need to get both datasets to have the same lat/lon grid. I guess I could continue using my old scipy-based solution (using map_coordinates() or RectBivariateSpline), but at the very least, it would make sense to have some documentation demonstrating how one might go about this very common problem, even if it is showing how to use the scipy-based tools with xarrays. If that is of interest, I can see what I can write up after I am done my immediate task. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#475 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1rw8D01Zw5-EPR21CkrYUYchh-5_ks5s0KF4gaJpZM4FYzk7> .

jhamman · 2018-01-02T05:04:08Z

ping @stefanomattia who seems to be interested in the KDTreeIndex concepts described in this issue.

rabernat · 2018-01-02T15:43:45Z

Subscribers to this thread will probably be interested in @JiaweiZhuang's recent progress on xESMF. That package is now a viable solution for 2D regridding of xarray datasets.
https://github.com/JiaweiZhuang/xESMF

stefanomattia · 2018-01-03T09:23:47Z

Thanks @jhamman, I'd love to contribute! I'm not that confident in my Python skills, but maybe with a little guidance? Let me know if or how I could help.

jhamman · 2018-01-03T18:14:51Z

@stefanomattia - I'd be happy to provide guidance and even to contribute to some of the development. Based on your blog post, I think you may be well on your way.

shoyer · 2018-01-03T18:16:29Z

@jhamman @stefanomattia can you share a link to this blog post? :)

jhamman · 2018-01-03T18:18:32Z

http://notes.stefanomattia.net/2017/12/12/The-quest-to-find-the-closest-ground-pixel/

stefanomattia · 2018-01-04T10:04:05Z

That post must look a bit amateurish, I reckon, but if you guys think it could be a starting point for a KD-tree search implementation in xarray, I would be thrilled to contribute! There is no learning without trying, after all. I could start from #475 (comment). @jhamman maybe you could send me an email with a few requirements?

benbovy · 2018-01-05T10:38:52Z

Note that it will probably be easier to implement such KDTreeIndex after having refactored indexes and multi-indexes in xarray (see #1603). I think this refactoring would represent a good amount of work, though, so maybe we can do it after if you don't want to wait too long for the KD-Tree feature?

duncanwp · 2018-01-09T16:01:16Z

Further to the comment I made in a related issue #486 comment I've now taken a simplified version of the collocation approach in CIS and created a stand-alone package which works with xarray objects: https://github.com/cistools/collocate.

This works essentially the same as the nice example shown in the above blog, with some key differences:

The points within a certain distance (tolerance) of each sample point can be aggregated or selected from using the built-in kernels, allowing fast operations over many sample points.
The horizontal distance constraint can be supplemented with constraints in other dimensions (such as time or altitude).
The transform from Cartesian to Eucledian coordinates is not needed as we use our own KD-Tree implementation which builds haversine rectangles. Depending on use cases this isn't always the fastest approach, but it does sidestep some nasty dateline issues.
In the case where only the nearest points in the horizontal is needed the collocation falls back the fast single point lookup.
The KD-Tree implementation is (relatively well) separated so could easily be switched out for cKDtree or pyresample implementations
There are a some tests too, although no docs yet.

I'll try and put together a notebook building on the above blogpost so that the similarities and differences are a bit clearer.

I'm not familiar enough with xarray indexing to be able to say how well this would fit inside xarray, but hopefully it will be useful before we're able to crack KD-MultiIndexes!

stale · 2019-12-10T16:07:34Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

shoyer · 2020-05-24T06:18:00Z

@JimmyGao0204 I moved your comment to a new issue: #4090

benbovy · 2022-09-28T11:55:04Z

There hasn't been much activity here since quite some time.

Meanwhile, there has been the development of the xoak package that supports point-wise indexing of Xarray objects with various indexes (either generic like scipy.spatial.cKDTree or more specific like pys2index's S2PointIndex for lat/lon point data). xoak leverage Xarray's advanced indexing capabilities and supports selection using both coordinates and indexers with an arbitrary number of dimensions.

With the forthcoming Xarray release, it will be possible to create and assign custom indexes to DataArray / Dataset objects. The plan for xoak is then to just provide some custom indexes so that we can perform point-wise selection directly with Dataset.sel() instead of Dataset.xoak.sel().

benbovy · 2023-08-23T12:37:23Z

Can we close this issue and redirect the reader to https://github.com/xarray-contrib/xoak or #7041?

Or is there still a need to extend Xarray's API for supporting pointwise indexing, i.e., something that cannot be done with .isel or with .sel + a custom Xarray index?

jhamman changed the title ~~api design for pointwise indexing~~ API design for pointwise indexing Jul 15, 2015

shoyer mentioned this issue Jul 16, 2015

added xray netcdf read/write test Unidata/python-workshop#49

Closed

jhamman mentioned this issue Jul 17, 2015

Slicing with lists in multiple axes dask/dask#433

Closed

jhamman mentioned this issue Jul 20, 2015

Add pointwise indexing via isel_points method #481

Merged

shoyer mentioned this issue Jul 21, 2015

API for multi-dimensional resampling/regridding #486

Open

shoyer mentioned this issue Aug 1, 2015

Add sel_points for point-wise indexing by label #507

Merged

jhamman mentioned this issue Oct 2, 2015

Support Two-Dimensional Coordinate Variables #605

Closed

shoyer added the topic-indexing label Aug 23, 2016

jhamman mentioned this issue Jul 27, 2017

ENH: points coord from isel/sel_points should be a MultiIndex #1493

Closed

jhamman mentioned this issue Sep 18, 2017

ESMPy? jhamman/xmap#1

Closed

benbovy mentioned this issue Mar 4, 2018

Extend xarray with custom "coordinate wrappers" #1961

Closed

rsignell-usgs mentioned this issue Apr 23, 2019

Use a tree algorithm for finding closest point to extract time series reproducible-notebooks/COAWST-ROMS_Dashboards#4

Open

rabernat mentioned this issue May 14, 2019

Interpolation of geo-referenced data pangeo-data/pangeo#629

Closed

JiaweiZhuang mentioned this issue Nov 5, 2019

Handle different grid coordinate formats and naming JiaweiZhuang/xESMF#74

Open

stale bot added the stale label Dec 10, 2019

dcherian removed the stale label Dec 10, 2019

TomNicholas mentioned this issue Apr 7, 2020

Explicit indexes in xarray xarray-contrib/pint-xarray#1

Open

shoyer mentioned this issue May 24, 2020

Error with indexing 2D lat/lon coordinates #4090

Closed

pydata deleted a comment from JimmyGao0204 May 24, 2020

dcherian mentioned this issue Jun 12, 2020

Cell Boundary aware operations xarray-contrib/cf-xarray#10

Open

API design for pointwise indexing #475

API design for pointwise indexing #475

Comments

jhamman commented Jul 15, 2015

shoyer commented Jul 15, 2015

wholmgren commented Jul 15, 2015

shoyer commented Jul 15, 2015

jhamman commented Jul 15, 2015

wholmgren commented Jul 16, 2015

rabernat commented Jul 16, 2015

rabernat commented Jul 16, 2015

shoyer commented Jul 16, 2015

jhamman commented Jul 16, 2015

shoyer commented Jul 16, 2015

jhamman commented Jul 17, 2015

shoyer commented Jul 17, 2015

jhamman commented Jul 27, 2015

shoyer commented Jul 27, 2015

shoyer commented Jul 28, 2015

jhamman commented Jul 29, 2015

shoyer commented Aug 1, 2015

shoyer commented Aug 23, 2016 • edited Loading

burnpanck commented Oct 25, 2016

shoyer commented Oct 25, 2016

burnpanck commented Oct 25, 2016 • edited Loading

shoyer commented Oct 25, 2016

WeatherGod commented Nov 7, 2017

jhamman commented Nov 7, 2017

WeatherGod commented Nov 7, 2017

shoyer commented Nov 7, 2017 via email

jhamman commented Jan 2, 2018

rabernat commented Jan 2, 2018

stefanomattia commented Jan 3, 2018

jhamman commented Jan 3, 2018

shoyer commented Jan 3, 2018

jhamman commented Jan 3, 2018

stefanomattia commented Jan 4, 2018

benbovy commented Jan 5, 2018

duncanwp commented Jan 9, 2018

stale bot commented Dec 10, 2019

shoyer commented May 24, 2020 • edited Loading

benbovy commented Sep 28, 2022

benbovy commented Aug 23, 2023

shoyer commented Aug 23, 2016 •

edited

Loading

burnpanck commented Oct 25, 2016 •

edited

Loading

shoyer commented May 24, 2020 •

edited

Loading