[FEA] Multiple buffer copy kernel #7076

jlowe · 2021-01-05T17:06:26Z

Is your feature request related to a problem? Please describe.
During Spark shuffle there are cases where we need to copy multiple buffers simultaneously. For example, after partitioning a task's data into 200 parts we use nvcomp's LZ4 to compress the 200 buffers in a batch operation, producing 200 output buffers that are typically oversized (as we have to estimate the output size when allocating the buffer before compression occurs). To release the unused memory we reallocate them, copying the 200 buffers to "right-sized" allocations, and this is currently performed with 200 separate cudaMemcpyAsync calls. It's much more efficient to invoke a kernel that performs the 200 copies in parallel.

Similarly during UCX shuffle send, we need to copy partitions into the registered memory buffers (i.e.: bounce buffers), and often we pack the transfer with multiple partitions, leading to another situation where we need to copy N buffers simultaneously. On the receiving end there's a similar situation where we need to copy the data out of the receipt bounce buffer into separate allocations, another N-buffer copy situation.

Describe the solution you'd like
libcudf could provide a multi-buffer copy API that takes the following inputs:

a vector of source buffer starting addresses
a vector of destination buffer starting addresses
a vector of buffer sizes
the rmm::cuda_stream_view to use for the copy kernel

The libcudf API would copy the source buffers to the corresponding destination addresses using a single CUDA kernel rather than invoking separate cudaMemcpyAsync operations for each one.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-01-08T17:11:24Z

This sounds like something more general than libcudf, maybe it should live in RMM or somewhere else that's more general?

jlowe · 2021-01-08T17:59:33Z

I filed it here since I believe libcudf already has similar batch-copy code (in cuio and contigous_split, IIRC). It might be easy to refactor that into something externally callable.

However I don't really care where it lives as long as we can expose a Java interface to it. RMM is probably a more appropriate place if this kernel would be useful in other RAPIDS libs.

harrism · 2021-01-12T01:49:58Z

I would put it in libcudf unless and until it is needed elsewhere. Unnecessary baggage for RMM if it is not.

kkraus14 · 2021-01-12T03:21:52Z

Only other thought would be in RAFT if it would have any use for cuml / cugraph / etc.

github-actions · 2021-02-16T20:19:48Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

jlowe · 2021-02-16T20:22:22Z

I would still love to see this functionality.

github-actions · 2021-03-18T22:18:41Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jlowe · 2021-03-18T22:20:55Z

Still would love to see this, as the use-cases are still valid.

nvdbaranec · 2021-03-25T18:07:16Z

@jlowe My assumption here is that both source and destination addresses here might be arbitrarily aligned. Is that correct?

jlowe · 2021-03-25T18:29:53Z

In the use-cases I can think of so far, the source addresses would be aligned but the destinations would not necessarily be, e.g.: post batch compression where we need to gather M source buffers into N destination buffers with M > N. We also have a use case where we could do a buffer scatter, but I suspect in that case we could find a way to ensure the sub-buffer offsets are always aligned within the parent buffer.

nvdbaranec · 2021-03-25T18:40:41Z

Seems like the safe thing to do would be to plan for the worst. Shouldn't be too bad.

jrhemstad · 2021-03-25T19:43:30Z

The core piece of functionality needed here is a function like:

template <typename Group, typename Size>
void memcpy(Group g, void* destination, void* source, Size s);

It uses the group g to copy s bytes from source to destination. Size is a template to allow using an aligned_size type to signal that the pointers are aligned.

Work on this functionality is already in progress internally, and so this feature should wait until that is done.

nvdbaranec · 2021-03-25T20:40:39Z

The thing about this though is: A single buffer memcpy ends up scaling badly when called many times. It's the same thing as with contiguous_split: you want it all done in a single kernel call.

jrhemstad · 2021-03-26T02:32:50Z

The thing about this though is: A single buffer memcpy ends up scaling badly when called many times. It's the same thing as with contiguous_split: you want it all done in a single kernel call.

The function I described is a __device__ function (takes a CG). You call it in parallel from a single kernel.

sameerz · 2021-05-10T21:13:04Z

Depends on NVIDIA/cccl#944

jrhemstad · 2021-06-08T17:32:08Z

Closing this as this feature does not belong in libcudf. Instead, working on it as a CUB algorithm here: NVIDIA/cub#297

jakirkham · 2023-01-13T09:23:39Z

FYI PR ( NVIDIA/cub#359 ) landed. Looks like this will be part of CUB 2.1.0. So this could be used if it is still of interest

jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jan 5, 2021

kkraus14 removed the Needs Triage Need team to review and classify label Jan 8, 2021

github-actions bot added the stale label Feb 16, 2021

github-actions bot removed the stale label Feb 16, 2021

github-actions bot added the inactive-30d label Mar 18, 2021

kkraus14 added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Mar 18, 2021

ttnghia self-assigned this Mar 23, 2021

jrhemstad mentioned this issue May 10, 2021

[FEA] Multi-buffer copy algorithm NVIDIA/cub#297

Closed

jrhemstad closed this as completed Jun 8, 2021

jrhemstad mentioned this issue Jul 15, 2024

Added batch memset to memset data and validity buffers in parquet reader #16281

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Multiple buffer copy kernel #7076

[FEA] Multiple buffer copy kernel #7076

jlowe commented Jan 5, 2021

kkraus14 commented Jan 8, 2021

jlowe commented Jan 8, 2021

harrism commented Jan 12, 2021

kkraus14 commented Jan 12, 2021

github-actions bot commented Feb 16, 2021

jlowe commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

jlowe commented Mar 18, 2021

nvdbaranec commented Mar 25, 2021

jlowe commented Mar 25, 2021

nvdbaranec commented Mar 25, 2021

jrhemstad commented Mar 25, 2021

nvdbaranec commented Mar 25, 2021

jrhemstad commented Mar 26, 2021

sameerz commented May 10, 2021

jrhemstad commented Jun 8, 2021

jakirkham commented Jan 13, 2023

[FEA] Multiple buffer copy kernel #7076

[FEA] Multiple buffer copy kernel #7076

Comments

jlowe commented Jan 5, 2021

kkraus14 commented Jan 8, 2021

jlowe commented Jan 8, 2021

harrism commented Jan 12, 2021

kkraus14 commented Jan 12, 2021

github-actions bot commented Feb 16, 2021

jlowe commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

jlowe commented Mar 18, 2021

nvdbaranec commented Mar 25, 2021

jlowe commented Mar 25, 2021

nvdbaranec commented Mar 25, 2021

jrhemstad commented Mar 25, 2021

nvdbaranec commented Mar 25, 2021

jrhemstad commented Mar 26, 2021

sameerz commented May 10, 2021

jrhemstad commented Jun 8, 2021

jakirkham commented Jan 13, 2023