-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711
Comments
We talked about adding a bulk free api to rmm, but I think a less intrusive option is to make the arena allocator accumulate these small buffers until the total size reaches certain threshold, and then free them all at once. I'll work on a PR for that. |
… by default (#732) This is done similarly to #702. Previously `arena_memory_resource` maintained a set of allocated blocks, but this was only used for reporting/debugging purposes. Maintaining this set requires a `set::find` at every deallocation, which can get expensive when there are many allocated blocks. This PR moves the tracking behind a default-undefined preprocessor flag. This results in some speedup in the random allocations benchmark for `arena_memory_resource`. Tracking can be enabled by defining `RMM_POOL_TRACK_ALLOCATIONS`. This should also fix the Spark small shuffle buffer issue: NVIDIA/spark-rapids#1711 Before: ```console ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_RandomAllocations/arena_mr/1000/1 1.36 ms 1.36 ms 457 BM_RandomAllocations/arena_mr/1000/4 1.21 ms 1.21 ms 517 BM_RandomAllocations/arena_mr/1000/64 1.22 ms 1.22 ms 496 BM_RandomAllocations/arena_mr/1000/256 1.08 ms 1.07 ms 535 BM_RandomAllocations/arena_mr/1000/1024 0.949 ms 0.948 ms 583 BM_RandomAllocations/arena_mr/1000/4096 0.853 ms 0.848 ms 680 BM_RandomAllocations/arena_mr/10000/1 98.7 ms 98.3 ms 8 BM_RandomAllocations/arena_mr/10000/4 65.4 ms 65.4 ms 9 BM_RandomAllocations/arena_mr/10000/64 16.6 ms 16.5 ms 38 BM_RandomAllocations/arena_mr/10000/256 11.2 ms 11.2 ms 48 BM_RandomAllocations/arena_mr/10000/1024 9.45 ms 9.44 ms 62 BM_RandomAllocations/arena_mr/10000/4096 9.24 ms 9.20 ms 59 BM_RandomAllocations/arena_mr/100000/1 7536 ms 7536 ms 1 BM_RandomAllocations/arena_mr/100000/4 3002 ms 3002 ms 1 BM_RandomAllocations/arena_mr/100000/64 170 ms 170 ms 3 BM_RandomAllocations/arena_mr/100000/256 107 ms 107 ms 7 BM_RandomAllocations/arena_mr/100000/1024 96.0 ms 95.7 ms 6 BM_RandomAllocations/arena_mr/100000/4096 86.7 ms 86.7 ms 6 ``` After: ```console ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_RandomAllocations/arena_mr/1000/1 1.20 ms 1.20 ms 519 BM_RandomAllocations/arena_mr/1000/4 1.08 ms 1.08 ms 588 BM_RandomAllocations/arena_mr/1000/64 1.11 ms 1.11 ms 552 BM_RandomAllocations/arena_mr/1000/256 0.957 ms 0.957 ms 611 BM_RandomAllocations/arena_mr/1000/1024 0.857 ms 0.857 ms 687 BM_RandomAllocations/arena_mr/1000/4096 0.795 ms 0.793 ms 724 BM_RandomAllocations/arena_mr/10000/1 73.0 ms 73.0 ms 10 BM_RandomAllocations/arena_mr/10000/4 45.7 ms 45.7 ms 14 BM_RandomAllocations/arena_mr/10000/64 14.4 ms 14.4 ms 40 BM_RandomAllocations/arena_mr/10000/256 9.87 ms 9.82 ms 60 BM_RandomAllocations/arena_mr/10000/1024 8.72 ms 8.72 ms 69 BM_RandomAllocations/arena_mr/10000/4096 7.32 ms 7.30 ms 85 BM_RandomAllocations/arena_mr/100000/1 6384 ms 6384 ms 1 BM_RandomAllocations/arena_mr/100000/4 2480 ms 2480 ms 1 BM_RandomAllocations/arena_mr/100000/64 147 ms 147 ms 5 BM_RandomAllocations/arena_mr/100000/256 103 ms 103 ms 7 BM_RandomAllocations/arena_mr/100000/1024 78.1 ms 78.1 ms 9 BM_RandomAllocations/arena_mr/100000/4096 72.3 ms 72.3 ms 9 ``` @abellina Authors: - Rong Ou (@rongou) Approvers: - Mark Harris (@harrism) - Conor Hoekstra (@codereport) URL: #732
@abellina this should be fixed when RMM makes into the next nightly cudf jni jar. Can you verify when you get the chance? |
@rongou, I can confirm that small buffers are no longer seeing the synchronization train I saw after |
Yes, looks good. |
We have recently seen traces where a train of
cudaStreamSynchronize
can be seen after aconcat
(shuffle iterator gets a batch on next, we concat them atconcat
and then many synchronizes). These is happening because the shuffle-client thread can allocate RMM buffers that are smaller than theminimum_superblock_size
(256 KiB
), and then hand them off to another thread (the task thread) to actually perform work and eventually free it.When ARENA is freeing the buffer in our Spark task thread/stream, it wants to place the freed block in the global arena, because the allocation came from the shuffle-client arena (i.e. this is an "external" buffer). The synchronization done here helps prevent another stream from taking the buffer from the global arena and start writing to it before it is safe.
This task is to figure out a way to have the task thread do this instead of the shuffle client thread, such that frees are "free" in this case since the blocks will be "deposited" back into the local stream arena, instead of going to the global arena.
@jlowe @rongou fyi.
The text was updated successfully, but these errors were encountered: