-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kvikio backend entrypoint #10
base: main
Are you sure you want to change the base?
Conversation
So excited I couldn't wait to hack it together :P |
Cool, great work @dcherian, it's like Christmas came early! I've tried to test this branch but am encountering some # May need to install nvidia-gds first
# https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation-common
sudo apt install nvidia-gds
git clone https://github.com/dcherian/cupy-xarray.git
cd cupy-xarray
mamba create --name cupy-xarray python=3.9 cupy=11.0 rapidsai-nightly::kvikio=22.10 jupyterlab=3.4.5 pooch=1.6.0 netcdf4=1.6.0 watermark=2.3.1
mamba activate cupy-xarray
python -m ipykernel install --user --name cupy-xarray
# https://github.com/pydata/xarray/pull/6874
pip install git+https://github.com/dcherian/xarray.git@kvikio
# https://github.com/zarr-developers/zarr-python/pull/934
# pip install git+https://github.com/madsbk/zarr-python.git@cupy_support
pip install zarr==2.13.0a2
# https://github.com/xarray-contrib/cupy-xarray/pull/10
git switch kvikio-entrypoint
pip install --editable=.
# Start jupyter lab
jupyter lab --no-browser
# Then open the docs/kvikio.ipynb notebook With that, I got an error on the ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 # Consolidated must be False
----> 2 ds = xr.open_dataset(store, engine="kvikio", consolidated=False)
3 print(ds.air._variable._data)
4 ds
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/api.py:531, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
519 decoders = _resolve_decoders_kwargs(
520 decode_cf,
521 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
527 decode_coords=decode_coords,
528 )
530 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 531 backend_ds = backend.open_dataset(
532 filename_or_obj,
533 drop_variables=drop_variables,
534 **decoders,
535 **kwargs,
536 )
537 ds = _dataset_from_backend_dataset(
538 backend_ds,
539 filename_or_obj,
(...)
547 **kwargs,
548 )
549 return ds
File ~/Documents/github/cupy-xarray/cupy_xarray/kvikio.py:190, in KvikioBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
188 store_entrypoint = StoreBackendEntrypoint()
189 with close_on_error(store):
--> 190 ds = store_entrypoint.open_dataset(
191 store,
192 mask_and_scale=mask_and_scale,
193 decode_times=decode_times,
194 concat_characters=concat_characters,
195 decode_coords=decode_coords,
196 drop_variables=drop_variables,
197 use_cftime=use_cftime,
198 decode_timedelta=decode_timedelta,
199 )
200 return ds
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/store.py:26, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
14 def open_dataset(
15 self,
16 store,
(...)
24 decode_timedelta=None,
25 ):
---> 26 vars, attrs = store.load()
27 encoding = store.get_encoding()
29 vars, attrs, coord_names = conventions.decode_cf_variables(
30 vars,
31 attrs,
(...)
38 decode_timedelta=decode_timedelta,
39 )
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/common.py:125, in AbstractDataStore.load(self)
103 def load(self):
104 """
105 This loads the variables and attributes simultaneously.
106 A centralized loading function makes it easier to create
(...)
122 are requested, so care should be taken to make sure its fast.
123 """
124 variables = FrozenDict(
--> 125 (_decode_variable_name(k), v) for k, v in self.get_variables().items()
126 )
127 attributes = FrozenDict(self.get_attrs())
128 return variables, attributes
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/zarr.py:461, in ZarrStore.get_variables(self)
460 def get_variables(self):
--> 461 return FrozenDict(
462 (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
463 )
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/core/utils.py:474, in FrozenDict(*args, **kwargs)
473 def FrozenDict(*args, **kwargs) -> Frozen:
--> 474 return Frozen(dict(*args, **kwargs))
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/zarr.py:462, in <genexpr>(.0)
460 def get_variables(self):
461 return FrozenDict(
--> 462 (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
463 )
File ~/Documents/github/cupy-xarray/cupy_xarray/kvikio.py:130, in GDSZarrStore.open_store_variable(self, name, zarr_array)
128 else:
129 array_wrapper = CupyZarrArrayWrapper
--> 130 data = indexing.LazilyIndexedArray(array_wrapper(name, self))
132 attributes = dict(attributes)
133 encoding = {
134 "chunks": zarr_array.chunks,
135 "preferred_chunks": dict(zip(dimensions, zarr_array.chunks)),
136 "compressor": zarr_array.compressor,
137 "filters": zarr_array.filters,
138 }
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/xarray/backends/zarr.py:64, in ZarrArrayWrapper.__init__(self, variable_name, datastore)
61 self.datastore = datastore
62 self.variable_name = variable_name
---> 64 array = self.get_array()
65 self.shape = array.shape
67 dtype = array.dtype
File ~/Documents/github/cupy-xarray/cupy_xarray/kvikio.py:34, in EagerCupyZarrArrayWrapper.get_array(self)
33 def get_array(self):
---> 34 return np.asarray(self)
File ~/Documents/github/cupy-xarray/cupy_xarray/kvikio.py:31, in EagerCupyZarrArrayWrapper.__array__(self)
30 def __array__(self):
---> 31 return self.datastore.zarr_group[self.variable_name][:].get()
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/core.py:807, in Array.__getitem__(self, selection)
805 result = self.vindex[selection]
806 else:
--> 807 result = self.get_basic_selection(pure_selection, fields=fields)
808 return result
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/core.py:933, in Array.get_basic_selection(self, selection, out, fields)
930 return self._get_basic_selection_zd(selection=selection, out=out,
931 fields=fields)
932 else:
--> 933 return self._get_basic_selection_nd(selection=selection, out=out,
934 fields=fields)
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/core.py:976, in Array._get_basic_selection_nd(self, selection, out, fields)
970 def _get_basic_selection_nd(self, selection, out=None, fields=None):
971 # implementation of basic selection for array with at least one dimension
972
973 # setup indexer
974 indexer = BasicIndexer(selection, self)
--> 976 return self._get_selection(indexer=indexer, out=out, fields=fields)
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/core.py:1267, in Array._get_selection(self, indexer, out, fields)
1261 if not hasattr(self.chunk_store, "getitems") or \
1262 any(map(lambda x: x == 0, self.shape)):
1263 # sequentially get one key at a time from storage
1264 for chunk_coords, chunk_selection, out_selection in indexer:
1265
1266 # load chunk selection into output array
-> 1267 self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
1268 drop_axes=indexer.drop_axes, fields=fields)
1269 else:
1270 # allow storage to get multiple items at once
1271 lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/core.py:1966, in Array._chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes, fields)
1962 ckey = self._chunk_key(chunk_coords)
1964 try:
1965 # obtain compressed data for chunk
-> 1966 cdata = self.chunk_store[ckey]
1968 except KeyError:
1969 # chunk not initialized
1970 if self._fill_value is not None:
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/zarr/storage.py:1066, in DirectoryStore.__getitem__(self, key)
1064 filepath = os.path.join(self.path, key)
1065 if os.path.isfile(filepath):
-> 1066 return self._fromfile(filepath)
1067 else:
1068 raise KeyError(key)
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/kvikio/zarr.py:43, in GDSStore._fromfile(self, fn)
41 else:
42 nbytes = os.path.getsize(fn)
---> 43 with kvikio.CuFile(fn, "r") as f:
44 ret = cupy.empty(nbytes, dtype="u1")
45 read = f.read(ret)
File ~/mambaforge/envs/cupy-xarray/lib/python3.9/site-packages/kvikio/cufile.py:67, in CuFile.__init__(self, file, flags)
49 def __init__(self, file: Union[pathlib.Path, str], flags: str = "r"):
50 """Open and register file for GDS IO operations
51
52 CuFile opens the file twice and maintains two file descriptors.
(...)
65 "+" -> "open for updating (reading and writing)"
66 """
---> 67 self._handle = libkvikio.CuFile(file, flags)
File libkvikio.pyx:103, in kvikio._lib.libkvikio.CuFile.__init__()
RuntimeError: cuFile error at: /workspace/.conda-bld/work/cpp/include/kvikio/file_handle.hpp:172: internal error Interestingly, Edit: Looking at https://docs.nvidia.com/gpudirect-storage/release-notes/index.html#known-limitations and https://docs.nvidia.com/gpudirect-storage/release-notes/index.html#support-matrix, it seems that gpudirect-storage is only supported for certain filesystems like Edit2: Nevermind, I just remembered my root folder is |
As in xarray-contrib/xbatcher#87 (comment) Now available in zarr-python 2.13.0a2 for testing. |
Should add 2.13 final is out as well |
* main: Min xarray >= 0.19.0 (xarray-contrib#25) Fix broken dask_array_type import (xarray-contrib#24)
for more information, see https://pre-commit.ci
As an update, we need more upstream work in Xarray: pydata/xarray#8100 and some work to dask to get that to work (see the notebook) @negin513 This should be OK to try on derecho now. Dask will not work, but everything else should. |
cupy_xarray/kvikio.py
Outdated
open_kwargs = dict( | ||
mode=mode, | ||
synchronizer=synchronizer, | ||
path=group, | ||
########## NEW STUFF | ||
meta_array=cp.empty(()), | ||
) | ||
open_kwargs["storage_options"] = storage_options | ||
|
||
# TODO: handle consolidated | ||
assert not consolidated | ||
|
||
if chunk_store: | ||
open_kwargs["chunk_store"] = chunk_store | ||
if consolidated is None: | ||
consolidated = False | ||
|
||
store = kvikio.zarr.GDSStore(store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can refactor this to use kvikio.zarr.open_cupy_array
once kvikio=23.10 is out? There's support for nvCOMP-based LZ4 compression now (that's compatible with Zarr's CPU-based LZ4 compressor), xref rapidsai/kvikio#267.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR welcome!
Debating on whether to start from scratch in a completely new branch, or rebase off of this one 😄
P.S. I'm starting some work over at https://github.com/weiji14/foss4g2023oceania for a conference talk on 18 Oct, hoping to get the kvikIO
engine running on an ERA5 dataset. Somehow, I could only get things on this branch to work up to kvikIO==23.06
, it seems like RAPIDS AI 23.08 moved to CUDA 12 and I've been getting some errors like RuntimeError: Unable to open file: Too many open files
. I'll try to push some code in the next 2 weeks, and should be able to nudge this forward a little bit 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just build on top of this branch in a new PR.
The optimization i was mentioning to save self.datastore.zarr_group[self.variable_name]
as self._array
. Otherwise we keep pinging the store unnecessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got a bit too ambitious trying to work in support for LZ4 compression (via nvCOMP), and hit into some issues (see rapidsai/kvikio#297). There's some stable releases of RAPIDS AI kvikIO 23.10.00 and xarray 2023.10.0 now which should have some nice enhancements for this PR. I'll try to squeeze out some time to work on it.
Obtaining a temporal subset of WeatherBench2 from 2020-2022, at pressure level 500hPa, with data variables: geopotential, u_component_of_wind, v_component_of_wind. Disabling compression when saving to Zarr so that it can be read back using cupy-xarray's kvikIO engine developed at xarray-contrib/cupy-xarray#10.
* 🧐 Subset WeatherBench2 to twenty years and select variables Obtaining a temporal subset of WeatherBench2 from 2000-2022, at pressure level 500hPa, with data variables: geopotential, u_component_of_wind, v_component_of_wind. Disabling compression when saving to Zarr so that it can be read back using cupy-xarray's kvikIO engine developed at xarray-contrib/cupy-xarray#10. * 🗃️ Get full 1464x721 spatial resolution ERA5 for year 2000 Changing from the 240x121 spatial resolution grid to 1464x721, so that a single time chunk is larger (about 4MB). Decreased time range from twenty years to one year (previous commit at c63bc43 was actual 20, not 2 years). Also had to rechunk along the latitude and longitude dims because the original WeatherBench2 Zarr store had 2 chunks per time slice.
* upstream/main: Documentation Updates 📖 (xarray-contrib#35) [pre-commit.ci] pre-commit autoupdate (xarray-contrib#37) [pre-commit.ci] pre-commit autoupdate (xarray-contrib#34) [pre-commit.ci] pre-commit autoupdate (xarray-contrib#32) Update .pre-commit-config.yaml (xarray-contrib#28) Expand installation doc (xarray-contrib#27)
Thanks! I'm running the benchmark locally on my laptop (with an RTX A2000 8GB GPU), see also zarr-developers/zarr-benchmark#14, and I've got a few more extra technical details mentioned at my own site https://weiji14.github.io/blog/when-cloud-native-geospatial-meets-gpu-native-machine-learning 😉
Yes, I saw https://earthmover.io/blog/cloud-native-dataloader, and think a lot of the pieces are coming together. We really should wrap up this PR at some point, I'd be happy to give this another review if you'd like to push more changes on this branch. |
Allow it to be rendered under the User Guide section.
Will need it for the kvikio.zarr docs later.
Create new section in the API documentation page for the kvikIO engine. Added more docstrings to the kvikio.py file, and fixed some imports so things render nicely on the API page. Also added an intersphinx link to the kvikio docs at https://docs.rapids.ai/api/kvikio/stable.
Fixes error like `TypeError: ZarrArrayWrapper.__init__() takes 2 positional arguments but 3 were given`.
array_wrapper = EagerCupyZarrArrayWrapper | ||
else: | ||
array_wrapper = CupyZarrArrayWrapper | ||
data = indexing.LazilyIndexedArray(array_wrapper(zarr_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently stuck on this line, and trying to figure out what to change in relation to pydata/xarray#6874. My guess is that in an ideal world where we've solved all the numpy/cupy duck typing issues, we could just use the recently modified ZarrArrayWrapper
(pydata/xarray#8472) like this:
data = indexing.LazilyIndexedArray(array_wrapper(zarr_array)) | |
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(zarr_array)) |
However, that leads to this error:
____________________________________________________________________________________ test_lazy_load[True] ____________________________________________________________________________________
consolidated = True, store = PosixPath('/tmp/pytest-of-weiji/pytest-32/test_lazy_load_True_0/kvikio.zarr')
@pytest.mark.parametrize("consolidated", [True, False])
def test_lazy_load(consolidated, store):
> with xr.open_dataset(store, engine="kvikio", consolidated=consolidated) as ds:
cupy_xarray/tests/test_kvikio.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/backends/api.py:571: in open_dataset
backend_ds = backend.open_dataset(
cupy_xarray/kvikio.py:241: in open_dataset
ds = store_entrypoint.open_dataset(
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/backends/store.py:58: in open_dataset
ds = Dataset(vars, attrs=attrs)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/dataset.py:711: in __init__
variables, coord_names, dims, indexes, _ = merge_data_and_coords(
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/dataset.py:425: in merge_data_and_coords
return merge_core(
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/merge.py:699: in merge_core
collected = collect_variables_and_indexes(aligned, indexes=indexes)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/merge.py:362: in collect_variables_and_indexes
idx, idx_vars = create_default_index_implicit(variable)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexes.py:1404: in create_default_index_implicit
index = PandasIndex.from_variables(dim_var, options={})
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexes.py:654: in from_variables
obj = cls(data, dim, coord_dtype=var.dtype)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexes.py:589: in __init__
index = safe_cast_to_index(array)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexes.py:469: in safe_cast_to_index
index = pd.Index(np.asarray(array), **kwargs)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexing.py:509: in __array__
return np.asarray(self.get_duck_array(), dtype=dtype)
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/backends/common.py:181: in get_duck_array
return self[key] # type: ignore [index]
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/backends/zarr.py:104: in __getitem__
return indexing.explicit_indexing_adapter(
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexing.py:1017: in explicit_indexing_adapter
indexable = NumpyIndexingAdapter(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <[AttributeError("'NumpyIndexingAdapter' object has no attribute 'array'") raised in repr()] NumpyIndexingAdapter object at 0x7f71beb14a00>
array = array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])
def __init__(self, array):
# In NumpyIndexingAdapter we only allow to store bare np.ndarray
if not isinstance(array, np.ndarray):
> raise TypeError(
"NumpyIndexingAdapter only wraps np.ndarray. "
f"Trying to wrap {type(array)}"
)
E TypeError: NumpyIndexingAdapter only wraps np.ndarray. Trying to wrap <class 'cupy.ndarray'>
../../../mambaforge/envs/cupy-xarray-doc/lib/python3.10/site-packages/xarray/core/indexing.py:1493: TypeError
================================================================================== short test summary info ===================================================================================
FAILED cupy_xarray/tests/test_kvikio.py::test_lazy_load[True] - TypeError: NumpyIndexingAdapter only wraps np.ndarray. Trying to wrap <class 'cupy.ndarray'>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Looked into this a bit and got lost in all the xarray class inheritance logic and mixins... @dcherian, could you point out what's the recommended route forward? Is there something we still need to upstream into xarray? I'm testing this on xarray=2024.6.0
btw.
Output of xr.show_versions()
for reference:
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.7.12+bpo-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: ('en_NZ', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2024.6.0
pandas: 2.2.2
numpy: 1.26.4
scipy: None
netCDF4: 1.7.1
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.2
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.6.2
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.6.0
cupy: 13.2.0
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 70.0.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: 7.3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also facing a problem at the same place in the code when it comes to dimension coordinates. Consider the dummy dataset:
<xarray.Dataset> Size: 352B
Dimensions: (instrument: 3, loc: 2, time: 4)
Coordinates:
* instrument (instrument) <U8 96B 'manufac1' 'manufac2' 'manufac3'
lat (loc) float64 16B ...
lon (loc) float64 16B ...
* time (time) datetime64[ns] 32B 2014-09-06 2014-09-07 ... 2014-09-09
Dimensions without coordinates: loc
Data variables:
temperature (loc, instrument, time) float64 192B ...
-------------------------------------------------
that I write to a zarr file:
myzarr = 'mytest.zarr'
ds.to_zarr(myzarr, mode="w")
which is ok to read back with:
ds2 = xr.open_dataset(myzarr)
But when I want to use the kvikio
engine:
ds3 = xr.open_dataset(myzarr, engine='kvikio', consolidated=False)
it fails with:
Traceback (most recent call last):
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/conventions.py", line 450, in decode_cf_variables
new_vars[k] = decode_cf_variable(
^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/conventions.py", line 291, in decode_cf_variable
var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/coding/times.py", line 214, in _decode_cf_datetime_dtype
[first_n_items(values, 1) or [0], last_item(values) or [0]]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/core/formatting.py", line 97, in first_n_items
return np.ravel(to_duck_array(array))[:n_desired]
^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/namedarray/pycompat.py", line 138, in to_duck_array
return np.asarray(data) # type: ignore[return-value]
^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/core/indexing.py", line 574, in __array__
return np.asarray(self.get_duck_array(), dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/core/indexing.py", line 577, in get_duck_array
return self.array.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/core/indexing.py", line 651, in get_duck_array
array = self.array[self.key]
~~~~~~~~~~^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 104, in __getitem__
return indexing.explicit_indexing_adapter(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/core/indexing.py", line 1015, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 94, in _getitem
return self._array[key]
~~~~~~~~~~~^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 800, in __getitem__
result = self.get_basic_selection(pure_selection, fields=fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 926, in get_basic_selection
return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 968, in _get_basic_selection_nd
return self._get_selection(indexer=indexer, out=out, fields=fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 1343, in _get_selection
self._chunk_getitems(
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 2183, in _chunk_getitems
self._process_chunk(
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 2096, in _process_chunk
chunk = self._decode_chunk(cdata)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/zarr/core.py", line 2352, in _decode_chunk
chunk = self._compressor.decode(cdata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "numcodecs/blosc.pyx", line 563, in numcodecs.blosc.Blosc.decode
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/numcodecs/compat.py", line 154, in ensure_contiguous_ndarray
return ensure_ndarray(
^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/numcodecs/compat.py", line 67, in ensure_ndarray
return np.array(ensure_ndarray_like(buf), copy=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/_core/core.pyx", line 1481, in cupy._core.core._ndarray_base.__array__
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/orliac/RADIOBLOCKS/new_inst/test_kvikio.py", line 72, in <module>
ds3 = xr.open_dataset(myzarr, engine='kvikio', consolidated=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/backends/api.py", line 588, in open_dataset
backend_ds = backend.open_dataset(
^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/cupy-xarray/cupy_xarray/kvikio.py", line 248, in open_dataset
ds = store_entrypoint.open_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/backends/store.py", line 46, in open_dataset
vars, attrs, coord_names = conventions.decode_cf_variables(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/new_inst/VENV2/lib/python3.11/site-packages/xarray/conventions.py", line 461, in decode_cf_variables
raise type(e)(f"Failed to decode variable {k!r}: {e}") from e
TypeError: Failed to decode variable 'time': Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
Any idea?
Just a bit of context: I'm working on a large radio astronomy project called RADIOBLOCKS where I'm willing to process very large Dask Xarray Datasets on a GPU cluster, hence my interest for your work on kvikio
. Our GPU cluster has H100 and GDS if that is of interest for you for running benchmarks.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core issue here is that the "eager" wrapper isn't working as expected. Dimension coordinates must be eagerly loaded in to CPU memory, and this is a bit hacky at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weiji14 I think pydata/xarray#8408 fixes your issue. It needs a test though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried @dcherian's suggestion, with
print("raw_key =\n", raw_key)
print("numpy_indices =\n", numpy_indices)
result = raw_indexing_method(raw_key.tuple)
print("result =", result, type(result))
if numpy_indices.tuple:
# index the loaded np.ndarray
#indexable = NumpyIndexingAdapter(result)
#result = apply_indexer(indexable, numpy_indices)
result = as_indexable(result)[numpy_indices]
return result
but now it fails with:
raw_key =
OuterIndexer((slice(0, 1, 1),))
numpy_indices =
OuterIndexer((slice(None, None, None),))
result =
[1867128.] <class 'numpy.ndarray'>
array is np.ndarray
Traceback (most recent call last):
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/conventions.py", line 448, in decode_cf_variables
new_vars[k] = decode_cf_variable(
^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/conventions.py", line 289, in decode_cf_variable
var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/coding/times.py", line 214, in _decode_cf_datetime_dtype
[first_n_items(values, 1) or [0], last_item(values) or [0]]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/formatting.py", line 97, in first_n_items
return np.ravel(to_duck_array(array))[:n_desired]
^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/namedarray/pycompat.py", line 138, in to_duck_array
return np.asarray(data) # type: ignore[return-value]
^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 574, in __array__
return np.asarray(self.get_duck_array(), dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 577, in get_duck_array
return self.array.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 651, in get_duck_array
array = self.array[self.key]
~~~~~~~~~~^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 100, in __getitem__
return indexing.explicit_indexing_adapter(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 1031, in explicit_indexing_adapter
result = as_indexable(result)[numpy_indices]
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 1526, in __getitem__
self._check_and_raise_if_non_basic_indexer(indexer)
File "/home/orliac/RADIOBLOCKS/kuma_inst/VENV/lib/python3.11/site-packages/xarray/core/indexing.py", line 550, in _check_and_raise_if_non_basic_indexer
raise TypeError(
TypeError: Vectorized indexing with vectorized or outer indexers is not supported. Please use .vindex and .oindex properties to index the array.
TODO:
This PR registers a "kvikio" backend that allows us to use kvikio to read from network into GPU memory directly using "GDS".
To get this to work we need
This PR subclasses the existing ZarrStore and overrides necessary methods. most of the code is actually copied over from
xarray/backends/zarr.py
For a short demo see https://github.com/dcherian/cupy-xarray/blob/kvikio-entrypoint/docs/source/kvikio.ipynb
Tip:
usr/local/cuda-12/gds/tools/gdscheck -p
to check for GPU Direct Storage compatibility on your system