You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running some benchmark experiments with NVIDIA GPU Direct Storage (GDS) on a subset of ERA5, and would like to share some preliminary results (and maybe get some more people to validate the numbers)! But first, a bar plot:
Technical details:
18.2GB WeatherBench2 ERA5 subset to 1 year at 6 hourly resolution, 3 data variables (float32)
Zarr v2 spec, no compression, no consolidated metadata
Benchmarked on an RTX A2000 8GB GPU, connected to PCIe Gen4 x8.
From nvtop, I'm getting an average of 1.3 GiB/s transfer rate for Zarr engine (CPU-based) and 1.7GiB/s for kvikIO engine (GPU-based).
Code is at weiji14/foss4g2023oceania#7, including a conda-lock.yml file to reproduce the environment exactly. The benchmark code is probably not the best as I just wanted to compare the zarr and experimental kvikIO engine from xarray-contrib/cupy-xarray#10, but it's a start 🙂
Data preprocessing steps would need to happen entirely on GPU for maximum benefit. To avoid costly CPU RAM to GPU RAM transfer, current CPU-based functions may need to be rewritten using libraries like RAPIDS AI, NVIDIA DALI, etc.
Support for NVIDIA GPUDirect Storage on cloud-based filesystems (AWS, Azure, GCP) may be lacking (see e.g. NVIDIA GPU direct storage microsoft/planetary-computer-containers#51), and needs to be ironed out. Latency may be introduced if the NVMe storage is connected to a PCIe switch that is different to the PCIe switch used by the GPU.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi there,
I've been running some benchmark experiments with NVIDIA GPU Direct Storage (GDS) on a subset of ERA5, and would like to share some preliminary results (and maybe get some more people to validate the numbers)! But first, a bar plot:
Technical details:
nvtop
, I'm getting an average of 1.3 GiB/s transfer rate forZarr
engine (CPU-based) and 1.7GiB/s forkvikIO
engine (GPU-based).Code is at weiji14/foss4g2023oceania#7, including a
conda-lock.yml
file to reproduce the environment exactly. The benchmark code is probably not the best as I just wanted to compare thezarr
and experimentalkvikIO
engine from xarray-contrib/cupy-xarray#10, but it's a start 🙂GPU compression/decompression with nvCOMP
The benchmark I ran above did not use compression, as the branch at xarray-contrib/cupy-xarray#10 does not support it yet. But as mentioned at xarray-contrib/cupy-xarray#10 (comment), there's now a kvikio.zarr.open_cupy_array function released in
kvikIO=23.10.00
that has support for LZ4 compression.Example code from rapidsai/kvikio#297:
Caveats
References
Beta Was this translation helpful? Give feedback.
All reactions