-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load balance optimization for contiguous_split #9755
Load balance optimization for contiguous_split #9755
Conversation
…opied over all the SMs as best it can. Previously, it was vulnerable to low numbers of splits/columns or large discrepancies between column sizes (eg, large string columns combined with columns of shorts for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you compared performance for a non-degenerate case?
cpp/src/copying/contiguous_split.cu
Outdated
int device; | ||
cudaGetDevice(&device); | ||
cudaDeviceProp prop; | ||
cudaGetDeviceProperties(&prop, device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not recommend basing this on the exact SM count. A more canonical way to do this kind of load balancing is to just pick a partition size and divide up your input data according to that partition size and then distribute those partitions among CTAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest looking at the load balancing algorithm @elstehle developed for NVIDIA/cub#297.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting changing contiguous split to use cooperative groups? That's kind of a massive change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. My main thing is to avoid the cudaGetProperties
call and basing the load balancing on the exact SM count. That isn't necessary. As is, this code currently calculates a chunk size based on the number of SMs and chunks the input data accordingly. I'm saying to pick a chunk size independent of #SMs (based on empirical results from benchmarking).
That could still lead to load imbalance if the incoming chunks are already very small, so you'd really want to do something where you can aggregate incoming chunks up to your chunk size as well. At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Jake beat me to it. I don't think @jrhemstad necessarily referred to using cg. I think Jake rather referred to the scenario where copying chunk 0
(on SM 0
) takes (considerably) longer than copying chunk 1
(on SM 1
). Assume we'd just have 2 SMs and we're copying 2 chunks of 1 GB each. For some reason copying chunk 0
is 25% slower. So, on a vertical time scale we'd look at something like this:
t | SM0 | SM1 | |
---|---|---|---|
0.0ms | Buff#0.bytes:[0,200) | Buff#1.bytes:[0,250) | |
0.1ms | Buff#0.bytes:[200,400) | Buff#1.bytes:[250,500) | |
0.2ms | Buff#0.bytes:[400,600) | Buff#1.bytes:[500,750) | |
0.3ms | Buff#0.bytes:[600,800) | Buff#1.bytes:[750,1000) | |
0.4ms | Buff#0.bytes:[800,1000) |
So, in the time 0.4ms
until 0.5ms
, SM1
is sitting idle, while, in the best case it could help out copying Buff#0
when it's finished. So, if instead of splitting the total work into roughly num_sm
s, we could subdivide the work into smaller "work items" of, say, 100 MB, and have threadBlockIdx.x
work on the <threadBlockIdx.x>
-th such work item. That allows the GPU's scheduler to schedule a new thread block, at the time when SM1
would otherwise be sitting idle. The smaller the work items, the smaller the load imbalance. The 100 MB
would be what Jake is referring to as "partition size". I'm not yet sure what's a good choice here, probably something in the order of 1 MB? If we assume copy throughput to be around 250 GB/s that would make the worst-case load imbalance be around 4 us
(which seems acceptable). The trade-off being that: the smaller the partition size, the more involved the computation of assignment from partitions to the byte range within a buffer (involved=run time & memory capacity for materializing the assignments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.
@jrhemstad, do you think we should consider the scenario where a buffer is too small to be copied by a whole thread block (i.e., we need sub-thread block granularity)? If we do, I will try to better understand how we can refactor BatchMemcpy
to meet this use case, because I think otherwise this will become too complex to have the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The way this works is that it does a pretty good job of making sure the # of bytes being copied by each block is the same. It's not always exactly the same, but it's quite close. So the "outlier" case is typically "many equally-sized buffers, and a number of much much smaller ones", but not "many equally-size buffers, and a number of much much larger ones". The idea being: the "much smaller" ones will end up churningthrough 1 SM while the remainder of them take up and approximately equal amount of time on the rest.
If I'm reading this right, you're thinking maybe just say: "divide this up into 1 MB pieces across the board". As you mention, this will increase the time spent running the setup code, but maybe not too bad.
I'll try this as well as the idea from a few days ago: using some multiple of the # of sm's to squeeze the difference between buffers than get repartitioned and the ones that don't (typically just validity) and end up being small outliers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried out a couple of things:
-
Uniform subdivision of buffers at various levels of granularity ranging from 64 KB to 2 MB.
-
Uniform subdivision of buffers based on a larger multiple of SM count (8), with a minimum size of 256 KB.
These all worked out roughly the same on my various interesting benchmarks, however they did inch out the implementation in the existing PR a tiny bit:
4GB, 4 columns, no splits 198.64 2G/s -> 208-212 G/s
1GB, 1 column + validity, no splits 189.784G/s -> 211-213 G/s
The new code (both subdivisions):
// A: straight constant
auto const desired_chunk_size = size_t{1 * 1024 * 1024};
// B: larger multiple of SM count with uniform subdivision across all columns
auto const num_sms = prop.multiProcessorCount * 8;
auto const desired_chunk_size = max(size_t{256 * 1024}, util::round_up_safe(total_bytes / (num_sms), split_align));
thrust::transform(
rmm::exec_policy(stream),
_d_dst_buf_info,
_d_dst_buf_info + num_bufs,
chunks.begin(),
[desired_chunk_size] __device__(dst_buf_info const& buf) {
// how many chunks do we want to subdivide this buffer into
size_t const bytes = buf.num_elements * buf.element_size;
// can happen for things like lists and strings (the root columns store no data)
if (bytes == 0) { return thrust::pair<size_t, size_t>{1, 0}; }
size_t const num_chunks = max(size_t{1}, _round_up_safe(bytes, desired_chunk_size) / desired_chunk_size);
// NOTE: leaving chunk size as a separate parameter for future tuning possibilities,
// even though in the current implemenetation it will be a constant.
return thrust::pair<size_t, size_t>{num_chunks, desired_chunk_size};
});
We'll want to run this against the Spark benchmarks again but it seems like a reasonable change. Does this seem more inline with what you guys are thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍 Thanks for making the changes and evaluation. Performance numbers look robust (at least amongst the two evaluated cases). Hope we can confirm this with additional Spark benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rob did a run on Friday with the new code. It does carve us out some additional gains. Code updated.
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9755 +/- ##
================================================
- Coverage 10.49% 10.41% -0.08%
================================================
Files 119 119
Lines 20305 20480 +175
================================================
+ Hits 2130 2134 +4
- Misses 18175 18346 +171
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some basic comments pending changes to the load balancing algorithm.
…. Simply subdivide each buffer to be copied into 1 MB chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking like a great speedup here. Note that copyrights need to be updated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gpucibot merge |
The existing
contiguous_split
implementation was vulnerable to situations wherenumber of columns N * number of splits M
was < the number of SMs on the gpu. This PR implements a postprocessing step which attempts to distribute the amount of bytes to be copied as evenly as possible across all available SMs.PR has been updated to repartition using a constant chunk size of 1 MB. This yields better results than the initial approach.
Before/after benchmarks for some particularly degenerate cases (T4)