Benchmarking GPU-based kvikIO (NVIDIA GPUDirect Storage) vs CPU-based zarr engine #14

weiji14 · 2023-10-13T04:23:24Z

weiji14
Oct 13, 2023

Hi there,

I've been running some benchmark experiments with NVIDIA GPU Direct Storage (GDS) on a subset of ERA5, and would like to share some preliminary results (and maybe get some more people to validate the numbers)! But first, a bar plot:

Technical details:

18.2GB WeatherBench2 ERA5 subset to 1 year at 6 hourly resolution, 3 data variables (float32)
Zarr v2 spec, no compression, no consolidated metadata
Benchmarked on an RTX A2000 8GB GPU, connected to PCIe Gen4 x8.
- From nvtop, I'm getting an average of 1.3 GiB/s transfer rate for Zarr engine (CPU-based) and 1.7GiB/s for kvikIO engine (GPU-based).

Code is at weiji14/foss4g2023oceania#7, including a conda-lock.yml file to reproduce the environment exactly. The benchmark code is probably not the best as I just wanted to compare the zarr and experimental kvikIO engine from xarray-contrib/cupy-xarray#10, but it's a start 🙂

GPU compression/decompression with nvCOMP

The benchmark I ran above did not use compression, as the branch at xarray-contrib/cupy-xarray#10 does not support it yet. But as mentioned at xarray-contrib/cupy-xarray#10 (comment), there's now a kvikio.zarr.open_cupy_array function released in kvikIO=23.10.00 that has support for LZ4 compression.

Example code from rapidsai/kvikio#297:

import cupy as cp
import zarr
import kvikio.zarr
import numcodecs

zarr.storage.default_compressor = numcodecs.LZ4()
za = zarr.create(
    shape=(10,), fill_value=1, meta_array=cp.empty(shape=()), compressor=kvikio.zarr.LZ4()
)
zarr.save("rand10.zarr", cp.asnumpy(za[:]), zarr_version=2)

zg = kvikio.zarr.open_cupy_array(
    store="rand10.zarr", compressor=kvikio.zarr.CompatCompressor.lz4(), mode="r", path="/"
)
print(zg.compressor)
zg[:]
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Caveats

GPUDirect Storage only works on certain filesystems like ext4, xfs, etc (see https://docs.nvidia.com/gpudirect-storage/release-notes/index.html#known-limitations) and may require extra work on certain network filesystems.
Data preprocessing steps would need to happen entirely on GPU for maximum benefit. To avoid costly CPU RAM to GPU RAM transfer, current CPU-based functions may need to be rewritten using libraries like RAPIDS AI, NVIDIA DALI, etc.
Support for NVIDIA GPUDirect Storage on cloud-based filesystems (AWS, Azure, GCP) may be lacking (see e.g. NVIDIA GPU direct storage microsoft/planetary-computer-containers#51), and needs to be ironed out. Latency may be introduced if the NVMe storage is connected to a PCIe switch that is different to the PCIe switch used by the GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking GPU-based kvikIO (NVIDIA GPUDirect Storage) vs CPU-based zarr engine #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Benchmarking GPU-based kvikIO (NVIDIA GPUDirect Storage) vs CPU-based zarr engine #14

weiji14 Oct 13, 2023

GPU compression/decompression with nvCOMP

Caveats

References

Replies: 0 comments

weiji14
Oct 13, 2023