Load balance optimization for contiguous_split #9755

nvdbaranec · 2021-11-23T00:02:34Z

The existing contiguous_split implementation was vulnerable to situations where number of columns N * number of splits M was < the number of SMs on the gpu. This PR implements a postprocessing step which attempts to distribute the amount of bytes to be copied as evenly as possible across all available SMs.

PR has been updated to repartition using a constant chunk size of 1 MB. This yields better results than the initial approach.

Before/after benchmarks for some particularly degenerate cases (T4)

Before (4 partitions)
4GB, 4 columns, no splits                      43.3 ms         43.3 ms            8 bytes_per_second=46.1738G/s
After
4GB, 4 columns, no splits                      10.1 ms         10.1 ms            8 bytes_per_second=198.642G/s

Before (2 partitions)
1GB, 1 column + validity, no splits         114 ms          114 ms            8 bytes_per_second=17.5212G/s
After
1GB, 1 column + validity, no splits         10.5 ms         10.6 ms            8 bytes_per_second=189.784G/s

…opied over all the SMs as best it can. Previously, it was vulnerable to low numbers of splits/columns or large discrepancies between column sizes (eg, large string columns combined with columns of shorts for example).

hyperbolic2346

Have you compared performance for a non-degenerate case?

cpp/src/copying/contiguous_split.cu

jrhemstad · 2021-11-29T16:36:15Z

cpp/src/copying/contiguous_split.cu

+  int device;
+  cudaGetDevice(&device);
+  cudaDeviceProp prop;
+  cudaGetDeviceProperties(&prop, device);


I do not recommend basing this on the exact SM count. A more canonical way to do this kind of load balancing is to just pick a partition size and divide up your input data according to that partition size and then distribute those partitions among CTAs.

I suggest looking at the load balancing algorithm @elstehle developed for NVIDIA/cub#297.

Are you suggesting changing contiguous split to use cooperative groups? That's kind of a massive change.

No. My main thing is to avoid the cudaGetProperties call and basing the load balancing on the exact SM count. That isn't necessary. As is, this code currently calculates a chunk size based on the number of SMs and chunks the input data accordingly. I'm saying to pick a chunk size independent of #SMs (based on empirical results from benchmarking).

That could still lead to load imbalance if the incoming chunks are already very small, so you'd really want to do something where you can aggregate incoming chunks up to your chunk size as well. At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.

Edit: Jake beat me to it. ~~I don't think @jrhemstad necessarily referred to using cg.~~ I think Jake rather referred to the scenario where copying chunk 0 (on SM 0) takes (considerably) longer than copying chunk 1 (on SM 1). Assume we'd just have 2 SMs and we're copying 2 chunks of 1 GB each. For some reason copying chunk 0 is 25% slower. So, on a vertical time scale we'd look at something like this:

t SM0 SM1

0.0ms Buff#0.bytes:[0,200) Buff#1.bytes:[0,250)

0.1ms Buff#0.bytes:[200,400) Buff#1.bytes:[250,500)

0.2ms Buff#0.bytes:[400,600) Buff#1.bytes:[500,750)

0.3ms Buff#0.bytes:[600,800) Buff#1.bytes:[750,1000)

0.4ms Buff#0.bytes:[800,1000)

So, in the time 0.4ms until 0.5ms, SM1 is sitting idle, while, in the best case it could help out copying Buff#0 when it's finished. So, if instead of splitting the total work into roughly num_sms, we could subdivide the work into smaller "work items" of, say, 100 MB, and have threadBlockIdx.x work on the <threadBlockIdx.x>-th such work item. That allows the GPU's scheduler to schedule a new thread block, at the time when SM1 would otherwise be sitting idle. The smaller the work items, the smaller the load imbalance. The 100 MB would be what Jake is referring to as "partition size". I'm not yet sure what's a good choice here, probably something in the order of 1 MB? If we assume copy throughput to be around 250 GB/s that would make the worst-case load imbalance be around 4 us (which seems acceptable). The trade-off being that: the smaller the partition size, the more involved the computation of assignment from partitions to the byte range within a buffer (involved=run time & memory capacity for materializing the assignments).

At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.

@jrhemstad, do you think we should consider the scenario where a buffer is too small to be copied by a whole thread block (i.e., we need sub-thread block granularity)? If we do, I will try to better understand how we can refactor BatchMemcpy to meet this use case, because I think otherwise this will become too complex to have the logic here.

I see. The way this works is that it does a pretty good job of making sure the # of bytes being copied by each block is the same. It's not always exactly the same, but it's quite close. So the "outlier" case is typically "many equally-sized buffers, and a number of much much smaller ones", but not "many equally-size buffers, and a number of much much larger ones". The idea being: the "much smaller" ones will end up churningthrough 1 SM while the remainder of them take up and approximately equal amount of time on the rest.

If I'm reading this right, you're thinking maybe just say: "divide this up into 1 MB pieces across the board". As you mention, this will increase the time spent running the setup code, but maybe not too bad.

I'll try this as well as the idea from a few days ago: using some multiple of the # of sm's to squeeze the difference between buffers than get repartitioned and the ones that don't (typically just validity) and end up being small outliers.

So I tried out a couple of things:

Uniform subdivision of buffers at various levels of granularity ranging from 64 KB to 2 MB.

Uniform subdivision of buffers based on a larger multiple of SM count (8), with a minimum size of 256 KB.

These all worked out roughly the same on my various interesting benchmarks, however they did inch out the implementation in the existing PR a tiny bit:

4GB, 4 columns, no splits 198.64 2G/s -> 208-212 G/s 1GB, 1 column + validity, no splits 189.784G/s -> 211-213 G/s

The new code (both subdivisions):

// A: straight constant auto const desired_chunk_size = size_t{1 * 1024 * 1024}; // B: larger multiple of SM count with uniform subdivision across all columns auto const num_sms = prop.multiProcessorCount * 8; auto const desired_chunk_size = max(size_t{256 * 1024}, util::round_up_safe(total_bytes / (num_sms), split_align)); thrust::transform( rmm::exec_policy(stream), _d_dst_buf_info, _d_dst_buf_info + num_bufs, chunks.begin(), [desired_chunk_size] __device__(dst_buf_info const& buf) { // how many chunks do we want to subdivide this buffer into size_t const bytes = buf.num_elements * buf.element_size; // can happen for things like lists and strings (the root columns store no data) if (bytes == 0) { return thrust::pair<size_t, size_t>{1, 0}; } size_t const num_chunks = max(size_t{1}, _round_up_safe(bytes, desired_chunk_size) / desired_chunk_size); // NOTE: leaving chunk size as a separate parameter for future tuning possibilities, // even though in the current implemenetation it will be a constant. return thrust::pair<size_t, size_t>{num_chunks, desired_chunk_size}; });

We'll want to run this against the Spark benchmarks again but it seems like a reasonable change. Does this seem more inline with what you guys are thinking?

Looks good 👍 Thanks for making the changes and evaluation. Performance numbers look robust (at least amongst the two evaluated cases). Hope we can confirm this with additional Spark benchmarks.

Rob did a run on Friday with the new code. It does carve us out some additional gains. Code updated.

codecov · 2021-11-29T21:45:31Z

Codecov Report

Merging #9755 (717ab92) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9755      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20480     +175     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18346     +171

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5ec7f...717ab92. Read the comment docs.

vyasr

Some basic comments pending changes to the load balancing algorithm.

cpp/benchmarks/copying/contiguous_split_benchmark.cu

cpp/src/copying/contiguous_split.cu

…. Simply subdivide each buffer to be copied into 1 MB chunks.

hyperbolic2346

Looking like a great speedup here. Note that copyrights need to be updated as well.

cpp/benchmarks/copying/contiguous_split_benchmark.cu

cpp/src/copying/contiguous_split.cu

vyasr

LGTM

cpp/src/copying/contiguous_split.cu

nvdbaranec · 2022-01-13T18:16:33Z

@gpucibot merge

nvdbaranec added 2 commits November 22, 2021 16:41

Add benchmarks for poorly balanced splits

39b8600

nvdbaranec requested a review from a team as a code owner November 23, 2021 00:02

nvdbaranec requested review from vyasr and davidwendt November 23, 2021 00:02

nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 23, 2021

nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 23, 2021

hyperbolic2346 reviewed Nov 23, 2021

View reviewed changes

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Nov 29, 2021

View reviewed changes

elstehle self-requested a review November 29, 2021 16:46

Add a test that forces repartitioning of validity buffer copies.

4fba4cb

vyasr requested changes Nov 30, 2021

View reviewed changes

nvdbaranec added 5 commits December 6, 2021 09:58

Merge branch 'branch-22.02' into cs_load_balancing

493470d

Merge branch 'branch-22.02' into cs_load_balancing

9efe040

Merge branch 'branch-22.02' into cs_load_balancing

3e20a0f

Change how chunk subpartitioning works to be simpler and more uniform…

a09fa9a

…. Simply subdivide each buffer to be copied into 1 MB chunks.

PR review changes.

f453196

nvdbaranec requested review from vyasr and jrhemstad January 10, 2022 19:55

hyperbolic2346 requested changes Jan 11, 2022

View reviewed changes

jrhemstad approved these changes Jan 11, 2022

View reviewed changes

PR review changes.

11c4113

nvdbaranec requested a review from hyperbolic2346 January 11, 2022 16:56

vyasr approved these changes Jan 11, 2022

View reviewed changes

Comments and cleanup from PR reviews.

283fd54

Copyright date updates.

717ab92

elstehle approved these changes Jan 11, 2022

View reviewed changes

nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 13, 2022

hyperbolic2346 approved these changes Jan 13, 2022

View reviewed changes

rapids-bot bot merged commit c07fdab into rapidsai:branch-22.02 Jan 13, 2022

abellina mentioned this pull request Feb 11, 2022

Making a batch spillable is expensive NVIDIA/spark-rapids#3749

Closed

jlowe mentioned this pull request Mar 2, 2022

[QST] Performance question NVIDIA/spark-rapids#4894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balance optimization for contiguous_split #9755

Load balance optimization for contiguous_split #9755

nvdbaranec commented Nov 23, 2021 •

edited

Loading

hyperbolic2346 left a comment

jrhemstad Nov 29, 2021

jrhemstad Nov 29, 2021

nvdbaranec Dec 8, 2021

jrhemstad Dec 8, 2021

elstehle Dec 8, 2021 •

edited

Loading

elstehle Dec 8, 2021

nvdbaranec Dec 8, 2021

nvdbaranec Jan 5, 2022 •

edited

Loading

elstehle Jan 10, 2022

nvdbaranec Jan 10, 2022

codecov bot commented Nov 29, 2021 •

edited

Loading

vyasr left a comment

hyperbolic2346 left a comment

vyasr left a comment

nvdbaranec commented Jan 13, 2022

t	SM0	SM1
0.0ms	Buff#0.bytes:[0,200)	Buff#1.bytes:[0,250)
0.1ms	Buff#0.bytes:[200,400)	Buff#1.bytes:[250,500)
0.2ms	Buff#0.bytes:[400,600)	Buff#1.bytes:[500,750)
0.3ms	Buff#0.bytes:[600,800)	Buff#1.bytes:[750,1000)
0.4ms	Buff#0.bytes:[800,1000)

Load balance optimization for contiguous_split #9755

Load balance optimization for contiguous_split #9755

Conversation

nvdbaranec commented Nov 23, 2021 • edited Loading

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elstehle Dec 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvdbaranec Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 29, 2021 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

nvdbaranec commented Jan 13, 2022

nvdbaranec commented Nov 23, 2021 •

edited

Loading

elstehle Dec 8, 2021 •

edited

Loading

nvdbaranec Jan 5, 2022 •

edited

Loading

codecov bot commented Nov 29, 2021 •

edited

Loading