Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Multiple buffer copy kernel #7076

Closed
jlowe opened this issue Jan 5, 2021 · 17 comments
Closed

[FEA] Multiple buffer copy kernel #7076

jlowe opened this issue Jan 5, 2021 · 17 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Jan 5, 2021

Is your feature request related to a problem? Please describe.
During Spark shuffle there are cases where we need to copy multiple buffers simultaneously. For example, after partitioning a task's data into 200 parts we use nvcomp's LZ4 to compress the 200 buffers in a batch operation, producing 200 output buffers that are typically oversized (as we have to estimate the output size when allocating the buffer before compression occurs). To release the unused memory we reallocate them, copying the 200 buffers to "right-sized" allocations, and this is currently performed with 200 separate cudaMemcpyAsync calls. It's much more efficient to invoke a kernel that performs the 200 copies in parallel.

Similarly during UCX shuffle send, we need to copy partitions into the registered memory buffers (i.e.: bounce buffers), and often we pack the transfer with multiple partitions, leading to another situation where we need to copy N buffers simultaneously. On the receiving end there's a similar situation where we need to copy the data out of the receipt bounce buffer into separate allocations, another N-buffer copy situation.

Describe the solution you'd like
libcudf could provide a multi-buffer copy API that takes the following inputs:

  • a vector of source buffer starting addresses
  • a vector of destination buffer starting addresses
  • a vector of buffer sizes
  • the rmm::cuda_stream_view to use for the copy kernel

The libcudf API would copy the source buffers to the corresponding destination addresses using a single CUDA kernel rather than invoking separate cudaMemcpyAsync operations for each one.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jan 5, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Jan 8, 2021

This sounds like something more general than libcudf, maybe it should live in RMM or somewhere else that's more general?

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Jan 8, 2021
@jlowe
Copy link
Member Author

jlowe commented Jan 8, 2021

I filed it here since I believe libcudf already has similar batch-copy code (in cuio and contigous_split, IIRC). It might be easy to refactor that into something externally callable.

However I don't really care where it lives as long as we can expose a Java interface to it. RMM is probably a more appropriate place if this kernel would be useful in other RAPIDS libs.

@harrism
Copy link
Member

harrism commented Jan 12, 2021

I would put it in libcudf unless and until it is needed elsewhere. Unnecessary baggage for RMM if it is not.

@kkraus14
Copy link
Collaborator

Only other thought would be in RAFT if it would have any use for cuml / cugraph / etc.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@jlowe
Copy link
Member Author

jlowe commented Feb 16, 2021

I would still love to see this functionality.

@github-actions github-actions bot removed the stale label Feb 16, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2021

Still would love to see this, as the use-cases are still valid.

@kkraus14 kkraus14 added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Mar 18, 2021
@ttnghia ttnghia self-assigned this Mar 23, 2021
@nvdbaranec
Copy link
Contributor

@jlowe My assumption here is that both source and destination addresses here might be arbitrarily aligned. Is that correct?

@jlowe
Copy link
Member Author

jlowe commented Mar 25, 2021

In the use-cases I can think of so far, the source addresses would be aligned but the destinations would not necessarily be, e.g.: post batch compression where we need to gather M source buffers into N destination buffers with M > N. We also have a use case where we could do a buffer scatter, but I suspect in that case we could find a way to ensure the sub-buffer offsets are always aligned within the parent buffer.

@nvdbaranec
Copy link
Contributor

Seems like the safe thing to do would be to plan for the worst. Shouldn't be too bad.

@jrhemstad
Copy link
Contributor

The core piece of functionality needed here is a function like:

template <typename Group, typename Size>
void memcpy(Group g, void* destination, void* source, Size s);

It uses the group g to copy s bytes from source to destination. Size is a template to allow using an aligned_size type to signal that the pointers are aligned.

Work on this functionality is already in progress internally, and so this feature should wait until that is done.

@nvdbaranec
Copy link
Contributor

The thing about this though is: A single buffer memcpy ends up scaling badly when called many times. It's the same thing as with contiguous_split: you want it all done in a single kernel call.

@jrhemstad
Copy link
Contributor

The thing about this though is: A single buffer memcpy ends up scaling badly when called many times. It's the same thing as with contiguous_split: you want it all done in a single kernel call.

The function I described is a __device__ function (takes a CG). You call it in parallel from a single kernel.

@sameerz
Copy link
Contributor

sameerz commented May 10, 2021

Depends on NVIDIA/cccl#944

@jrhemstad
Copy link
Contributor

Closing this as this feature does not belong in libcudf. Instead, working on it as a CUB algorithm here: NVIDIA/cub#297

@jakirkham
Copy link
Member

FYI PR ( NVIDIA/cub#359 ) landed. Looks like this will be part of CUB 2.1.0. So this could be used if it is still of interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

8 participants