-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to change underlying array type #4234
Comments
cc @quasiben |
@jacobtomlinson Have you considered implementing an accessor for this and any other CuPy-specific functionality? This is the approach we are taking with Pint integration (pint-xarray), and I wonder if it might also be a good approach for CuPy. |
@jthielen Something like this? import xarray as xr
import cupy as cp
@xr.register_dataset_accessor("cupy")
class CupyAccessor:
def to_cupy(self):
"""Convert all data arrays to cupy."""
for var in self.data_vars:
self.data_vars[var].data = cp.asarray(self.data_vars[var].data)
return self Which would then be used like this. import xarray as xr
import cupy as cp
ds = xr.open_mfdataset("/path/to/files/*.nc")
gds = ds.cupy.to_cupy() |
@jacobtomlinson Indeed! A similar Though, to have those accessors be registered with just importing |
This does sound like an option. However there are many situations within xarray where we need explicit cupy logic. Converting back to numpy before plotting is one example. I don't think that kind of logic can live in an accessor. Unless you expect users to do something like this. import xarray as xr
import cupy as cp
ds = xr.tutorial.load_dataset("air_temperature")
gds = ds.cupy.to_cupy()
# Do some manipulation on the GPU
# Grab a time slice
time_slice = gds.air.isel(time=0)
time_slice.cupy.to_numpy().plot() # I would hope that time_slice.plot() would work I would be tempted to say that cupy is more like dask in that it is trying to implement the numpy array interface exactly but in a different paradigm (distributed, GPU, etc). And of course there are limitations and divergences because of the different paradigm. However it's not the same as pint which is trying to extend numpy and add more functionality. So it makes sense to me that accessors for pint exist to add this extra functionality to xarray. But at least in theory cupy should be a drop-in replacement for numpy. So I don't expect a huge amount of logic will live in an accessor, compared to the amount of compatibility code that will need to go into xarray itself. |
@jacobtomlinson Makes complete sense! Just wanted to make sure the option was considered as a possibility. |
The only things I can think of that would make sense initially in an accessor would be An accessor does seem like a reasonable place to put that logic, but it also seems like a tiny amount of code to make, ship and maintain a separate package for. Plus those methods will either need to be used or duplicated in the core codebase to support things like plotting. |
The accessors could also be where wrapped xarray methods could go (so in your previous example, it could be |
See similar discussion for sparse here: #3245
I think we are also open to special A IIRC there's some way to read from disk to GPU, isn't there? So it makes sense to expose that in our Re: index variables.Can we avoid this for now? Or are there going to be performance issues? The general problem will be handled as part of the index refactor (we've deferred pint support for indexes for this reason). |
Those Yeah it is possible to read direct to GPU from storage with GDS. We've experimented a little with zarr, I expect if something like zarr got GDS support and a zarr dataset was configured to use GDS then
I think that would be best. However I did run into issues when trying to run the Compare weighted and unweighted mean temperature example with cupy. In that example the In my testing I just cast the |
Wanted to update on some recent work. With NEP35 (numpy/numpy#16935, numpy/numpy#17787) experimentally in NumPy, I've been exploring a bit more on using GPUs with xarray starting off with basic groupby workflows. There are places in the code where xarray calls pandas directly. For example, when building Indexes: Lines 150 to 153 in 7152b41
This is going to be challenging primarily because xarray will need to determine which DataFrame library to use during Index creation (possibly other DataFrame objects). While there is a consortium developing potential solutions for libraries to leverage multiple dataframe libraries I was going to keep hacking away and see what other issues may be lurking. |
The indexes story will change soon, we may even have our own index classes. We should have pretty decent support for NEP-18 arrays in NEP35 is cool; looks like we should use it in our |
Is your feature request related to a problem? Please describe.
In order to use Xarray with alternative array types like
cupy
the user needs to be able to specify the underlying array type without digging into internals.Right now I'm doing something like this.
However this will become burdensome when there are many data arrays and feels brittle and prone to errors.
As I see it a conversion could instead be done in a couple of places; on load, or as a utility method.
Currently Xarray supports NumPy and Dask array well. Numpy is the defrault and the way you specify whether a Dask array should be used is to give the
chunks
kwargs to anopen_
function or by calling.chunk()
on aDataSet
orDataArray
.Side note: There are a few places where the Dask array API bleeds into Xarray in order to have compatibility, the
chunk
kwarg/method is one, the.compute()
method is another. I'm hesitant to do this for other array types, however surfacing thecupy.ndarray.get
method could feel natural forcupy
users. But for now I think it would be best to take Dask as a special case and try and be generic for everything else.Describe the solution you'd like
For other array types I would like to propose the addition of an
asarray
kwarg for theopen_
methods and an.asarray()
method onDataSet
andDataArray
. This should take either the array typecupy.ndarray
, the asarray methodcp.asarray
, or preferably either.This would result in something like the following.
These operations would convert all data arrays to
cupy
arrays. For the case thatds
is backed by Dask arrays it would usemap_blocks
to cast each block to the appropriate array type.It is still unclear what to do about index variables, which are currently of type
pandas.Index
. Forcupy
it may be more appropriate to use acudf.Index
instead to ensure both are on the GPU. However this would add a dependency on cudf and potentially increase complexity here.Describe alternatives you've considered
Instead of an
asarray
kwarg/method something liketo_cupy
/from_cupy
could be done. However I feel this makes less sense because the object type is not changing, just that of the underlying data structure.Another option would be to go more high level with it. For example a
gpu
kwarg andto_gpu
/from_gpu
method could be added in the same way. This would abstract things even further and give users a choice about hardware rather than software. This would also be a fine solution but I think it may special case too much and a more generic solution would be better.Additional context
Related to #4212.
I'm keen to start implementing this. But would like some discussion/feedback before I dive in here.
The text was updated successfully, but these errors were encountered: