Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to optionally return distances with .sel #41

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

kthyng
Copy link

@kthyng kthyng commented Jan 19, 2023

Changes include:

  • accessor optionally takes in a "distances_name" str with which to name the new variable in results to store calculated distances from query()
  • query() now returns distances
  • distances are returned (always?) in radians from the trees, but are converted to kilometers in query.
  • If a DataArray was input to .sel with "distances_name" input, a Dataset will be returned to contain the new variable
  • added example to intro notebook
  • added a test to test_accessor
  • also fixed an import in conftest

Notes:

  • I did not test this with the dask approach

Changes include:
* accessor optionally takes in a "distances_name" str with which to name the new variable in results to store calculated distances from query()
* query() now returns distances
* distances are returned (always?) in radians from the trees, but are converted to kilometers in query.
* If a DataArray was input to .sel with "distances_name" input, a Dataset will be returned to contain the new variable
* added example to intro notebook
* added a test to test_accessor
* also fixed an import in conftest

Notes:
* I did not test this with the dask approach
needed some updates in precommit too
@kthyng
Copy link
Author

kthyng commented Jan 19, 2023

Sorry for all the additions. I have continued to use the code in the project I'm working on and have realized more about the existing code and better approaches as I go.

@kthyng
Copy link
Author

kthyng commented Jan 20, 2023

I forgot to say this was meant to address #26.

@kthyng
Copy link
Author

kthyng commented Jan 23, 2023

@benbovy or @willirath are either of you available to take a look at this? I'm hoping to be able to use these changes in a version update of xoak with a package of mine that uses xoak. Thanks for your time.

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kthyng! That's a very useful addition.

I've left two minor comments below.

More generally, I'm slightly concerned about the API. Alternatively, we could keep Dataset.xoak.sel() API as close as possible to Xarray's Dataset.sel (I plan to eventually refactor xoak in order to reuse Xarray's new, flexible indexes capabilities so that we could use directly Dataset.sel or DataArray.sel) and return both indices and distances via another method (e.g., Dataset.xoak.query, as suggested in #32)?

src/xoak/accessor.py Outdated Show resolved Hide resolved
src/xoak/accessor.py Outdated Show resolved Hide resolved
@kthyng
Copy link
Author

kthyng commented Jan 23, 2023

@benbovy I'm completely open to the API and detailed choices in the code if you want to change anything. I just need access to the distances! I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.

Copy link
Contributor

@willirath willirath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kthyng, first of all, thanks for looking into this. I'm really happy this package is seeing adoption! (Even more so as I have a few use cases for the distances myself.)

I have left a few small comments the major of which agrees with @benbovy on the question of units. I think I'd follow his suggestion to just expose whatever unit the distances provided by the indexer come in.

Also, feel free to ignore whichever of my comments overlap with @benbovy's remarks and follow his advice instead.

@@ -11,7 +11,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we wanted to leave these notebooks un-executed in the repo? (@benbovy I vaguely remember that this was the case?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should choose on this I suppose! Alternatively I've been switching my docs over from already-run notebooks to using myst-nb to be compiled by readthedocs, which I like a lot better. I can help switch this over to that if you want to give it a try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using myst-nb, if you want to give it a try that would be awesome!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbovy I started to switch over to myst-nb to execute the docs notebooks, but I see that you are already doing that using nbsphinx! I actually didn't realize you could do that with nbsphinx. I can still change it over to myst-nb if you want, but I am not sure if there is any benefit to it over nbsphinx.

src/xoak/accessor.py Outdated Show resolved Hide resolved
src/xoak/accessor.py Outdated Show resolved Hide resolved
src/xoak/accessor.py Show resolved Hide resolved
src/xoak/accessor.py Outdated Show resolved Hide resolved
src/xoak/tests/conftest.py Show resolved Hide resolved
src/xoak/tests/test_accessor.py Outdated Show resolved Hide resolved
@kthyng kthyng requested a review from willirath January 23, 2023 19:23
@benbovy
Copy link
Member

benbovy commented Jan 24, 2023

I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.

Yes this is pretty new and still largely undocumented. Xoak is a good place to try this feature, but we need a smooth transition so I don't expect it to be straightforward.

I'm completely open to the API and detailed choices in the code if you want to change anything.

Sorry I could have provided more details in my previous comment. I was thinking about this alternative:

  • Keep Dataset.xoak.sel(indexers=None, **indexers_kwargs) unchanged so that it will be easier to later switch to using Dataset.sel((indexers=None, **indexers_kwargs) with a custom Xarray index implemented in xoak. We might want eventually provide custom options to Dataset.sel but it is still not clear how would look like the API (Pass arbitrary options to sel() pydata/xarray#7099).

  • Add a Dataset.query(indexers=None, **indexers_kwargs) that returns both distances and indices as Xarray DataArray and Dataset objects (Expose indexers #32).

This would look like this (reusing the examples in the documentation):

>>> ds_mesh
# <xarray.Dataset>
# Dimensions:  (x: 100, y: 100)
# Coordinates:
#     lat      (x, y) float64 ...
#     lon      (x, y) float64 ...
# Dimensions without coordinates: x, y
# Data variables:
#     field    (x, y) float64 ...

>>> ds_trajectory
# <xarray.Dataset>
# Dimensions:    (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
#     latitude   (trajectory) ...
#     longitude  (trajectory) ...

>>> ds_mesh.xoak.set_index(['lat', 'lon'], 'sklearn_geo_balltree')

>>> distances, indices = ds_mesh.xoak.query(
...    lat=ds_trajectory.latitude,
...    lon=ds_trajectory.longitude,
... )

>>> distances
# <xarray.DataArray (trajectory: 30)>
# ...
# Dimensions without coordinates: trajectory

>>> indices
# <xarray.Dataset>
# Dimensions:  (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
#     x        (trajectory) int64 ...
#     y        (trajectory) int64 ...

Having a query method that returns a (distances, indices) tuple is consistent with scipy.spatial.KDTree.query(), sklearn.neighbors.KDTree.query(), sklearn.neighbors.BallTree.query(), etc.

The distances and indices Xarray objects returned here are a bit confusing as they don't have the same type and structure, but it allows doing:

>>> results = ds_mesh.isel(indices).assign(distances=distances)
>>> results
# <xarray.Dataset>
# Dimensions:  (trajectory: 30)
# Coordinates:
#     lat        (trajectory) float64 ...
#     lon        (trajectory) float64 ...
# Dimensions without coordinates: trajectory
# Data variables:
#     field      (trajectory) float64 ...
#     distances  (trajectory) float64 ...

Now, this is clearly more verbose than what is proposed in this PR:

>>> ds_mesh.xoak.sel(
...     lat=ds_trajectory.latitude,
...     lon=ds_trajectory.longitude,
...     distances_name="distances"
... )

Also, because Dataset.xoak.query() doesn't return a single Xarray object it wouldn't be easy to chain it with other Dataset / DataArray methods (it is possible using pipe -> merge but this is less straightforward than just using Dataset.xoak.sel()). So maybe we could support returning the distances via both .xoak.query() and xoak.sel() after all? What do you think @kthyng @willirath?

@benbovy
Copy link
Member

benbovy commented Jan 24, 2023

So maybe we could support returning the distances via both .xoak.query() and xoak.sel() after all?

If we're going to support both, I think we can either merge this PR and then refactor #32 or the other way around.

@kthyng
Copy link
Author

kthyng commented Jan 24, 2023

@benbovy I like having the indices and distances explicitly available with

distances, indices = ds_mesh.xoak.query( ...

as you suggest. One thing I didn't like about my change is that when a DataArray is input but distances are returned, a Dataset is returned to contain both variables. It's been messing up my code here and there.

Maybe it would be better to keep it simple with .sel (no changes) and if you want to have the distances and indices explicitly, you have to do the two step process you laid out of

distances, indices = ds_mesh.xoak.query(
...    lat=ds_trajectory.latitude,
...    lon=ds_trajectory.longitude,
... )
results = ds_mesh.isel(indices).assign(distances=distances)

But I do think it should be up to you @benbovy and @willirath so let me know what you prefer and I can try to implement, or you can go ahead with it, either way.

@willirath
Copy link
Contributor

I think leaving .xoak.sel() unchanged and as close to the standard xarray .sel() API as possible is what we should do. The proposal to have .query() returning, both, distances and indices sounds like the best way to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants