[FEA] GDS Integration #1445
Labels
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
P1
Nice to have for release
performance
A performance related task/issue
Motivation
For many jobs running on spark-rapids, the ratio of computation to data movement is low, making them I/O bound. GPUDirect Storage (GDS) enables Direct Memory Access (DMA) between storage and GPUs, with increased bandwidth and lower latency, thus could potentially reduce file I/O overhead and increase overall job performance.
Goals
Non-Goals
Assumptions
In a deployment environment, we assume the user has correctly set up NVMe/NVMe-oF support for GDS, and installed the cuFile kernel module and the cuFile user library.
Risks
Design
This blog post provides a good overview of GDS. There is some public documentation.
For spark-rapids, integrating with GDS can be roughly considered along two dimensions: the type of data, and the type of storage.
A Spark job may deal with these types of data involving disk storage:
Storage can be accessed in several ways:
Here we can see how these two dimensions interact:
We should start with local storage since it is the easiest to set up. Possible first steps:
ShuffleDataIO
interface that can be overridden by passing the configurationspark.shuffle.sort.io.plugin.class
. A new class can be implemented on top of the cuFile API. See blog post, JIRA, and SPIP. The drawback again is that this is single-node only.At this point it is hard to predict which approach would provide the biggest benefit, so they probably should be tried in parallel.
Alternatives Considered
We may want to set up a distributed filesystem (e.g. NFS) on a storage server and implement everything on top of that. This would make the implementation production ready. But this may be a complex process and the performance may suffer, so it is higher risk as a first step.
Tasks
These are the initial list of tasks. May be modified or expanded later.
The text was updated successfully, but these errors were encountered: