Refactor `distinct` using `static_map` `insert_or_apply` #16484

srinivasyadav18 · 2024-08-02T18:07:49Z

Description

This PR refactors distinct using static_map::insert_or_apply for keep_first, keep_last and keep_none options.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2024-08-02T18:27:24Z

cpp/src/stream_compaction/distinct.cu

+    auto keys    = rmm::device_uvector<size_type>(map_size, stream, mr);
+    auto values  = rmm::device_uvector<size_type>(map_size, stream, mr);
+
+    map.retrieve_all(keys.begin(), values.begin(), stream);


Can we use a discard iterator for the keys? I don't think we need to materialize them since they're not returned.

bdice · 2024-08-02T18:29:00Z

cpp/src/stream_compaction/distinct.cu

+  auto plusop = plus_op{};
+  map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value());


Suggested change

auto plusop = plus_op{};

map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value());

map.insert_or_apply(pairs, pairs + num_rows, plus_op{}, stream.value());

bdice · 2024-08-02T18:30:20Z

cpp/src/stream_compaction/distinct.cu

+    [values         = values.begin(),
+     keys           = keys.begin(),


Feels more natural to order these as keys, then values.

Suggested change

[values = values.begin(),

keys = keys.begin(),

[keys = keys.begin(),

values = values.begin(),

bdice · 2024-08-02T18:31:27Z

cpp/src/stream_compaction/distinct.cu

+      }
+    });
+
+  auto const map_end = thrust::copy_if(


The for_each above is unnecessary afaict. I think we can use a single copy_if with a counting transform iterator that checks the value for a given index, and copies the key for that index if so. That avoids materializing the intermediate output_indices.

bdice · 2024-08-02T21:07:48Z

cpp/src/stream_compaction/distinct_helpers.hpp

This header is now empty. Can we delete it and remove it from #includes?

bdice · 2024-08-02T21:12:12Z

cpp/src/stream_compaction/distinct_helpers.cu

Likewise, this file is now empty and should be removed and deleted from CMakeLists.txt.

cpp/src/stream_compaction/distinct.cu

srinivasyadav18 · 2024-08-02T23:28:58Z

@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review.

Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with copy_if should be efficient enough.

cpp/src/stream_compaction/distinct.cu

reverting back this commit, as compilation time of distinct.cu is almost 6 mins. This reverts commit 5b66240.

PointKernel · 2024-08-05T18:44:38Z

@srinivasyadav18 can you please share the performance comparisons before and after this PR?

GregoryKimball · 2024-08-06T13:07:21Z

Based on this gist, here are the most important performance signals:

I'm comparing the 1B row case, removing any because it's not impacted by this PR, and removing last because it is identical to first.

drop at 1B for keep=first
drop at medium-low cardinality (1-100K)
improvement at low cardinality (100) for I32
improvement at 1-100M

Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the 1B rows, 1B cardinality, keep first case and the 1B rows, 100K cardinality, keep first case.

srinivasyadav18 · 2024-08-06T15:24:58Z

Thanks @GregoryKimball for providing the plot and important performance signals.

I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out.

Overall distint algorithm has 2 major steps

distinct_indices
- initalize data strucutre
- perform the reductions (reduce_by_row)
- return the output_indices (all the distinct indices based on keep option)
gather

Time taken for each step with base branch-24.10:

distinct_indices
- initalize data strucutre (cuco::static_set, reduction_results device_uvector) TIME = 5ms
- perform the reductions (reduce_by_row) TIME = 65.589ms
- return the output_indices (all the distinct indices based on keep option) TIME = 4ms
gather ##TIME = 15ms##

Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime)

Time taken for each step with insert_or_apply:

distinct_indices
- initalize data strucutre TIME = 7.8ms
- perform the reductions (insert_or_apply) (thrust::for_each with set_ref) TIME = 55.7ms
- return the output_indices (retrieve_all) TIME = 11.8ms
gather ##TIME = 53.364##

Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime)

In summary, the gather algorithm is causing the main bottleneck.
It looks like the gather algorithm is efficient only when the row indices are already sorted (this makes sense because there is less data movement).

I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time.

Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead)

srinivasyadav18 · 2024-08-06T16:37:47Z

In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).

insert-or-apply kernel improves performance more than 15%, but due to unordered indices as the output, the gather algorithm which does post-processing regresses the performance more than 50%, hence causing the overall performance regression in the pipeline.

Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers).

GregoryKimball · 2024-08-06T17:18:50Z

Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit.

It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker

ttnghia · 2024-08-06T17:22:18Z

Instead of sorting, how about using scatter-gather approach?

Initialize a gather map with all 0
Scatter the output indices into the gather map, marking the position of these indices as 1
Gather using that gather map.

I'm starting to review this PR. Sorry for jumping late.

ttnghia · 2024-08-06T17:33:46Z

cpp/src/stream_compaction/distinct_helpers.cu

+  auto const map_end = thrust::copy_if(
+    rmm::exec_policy_nosync(stream),
+    output_iter,
+    output_iter + num_distinct_keys,
+    output_indices.begin(),
+    cuda::proclaim_return_type<bool>([] __device__(auto const idx) { return idx != -1; }));


Please add back all the explanation comments (and new comments for new code). Although I wrote the original code, now I hardly understand what it is doing without the comments.

ttnghia · 2024-08-06T17:42:05Z

cpp/src/stream_compaction/distinct_helpers.hpp

+struct plus_op {
+  template <cuda::thread_scope Scope>
+  __device__ void operator()(cuda::atomic_ref<size_type, Scope> ref, size_type const val)


Please do not put __device__ code in hpp file.

ttnghia · 2024-08-06T17:43:44Z

cpp/src/stream_compaction/distinct_helpers.hpp

@@ -23,11 +23,13 @@
 #include <rmm/device_uvector.hpp>
 #include <rmm/resource_ref.hpp>

+#include <cuco/static_map.cuh>


Now with including the cuco header, this file should be renamed into _helper.cuh. Otherwise please move the these header and their relevant code into _helper.cuh and keep this file containing host-only code.

@srinivasyadav18

This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484. Authors: - Bradley Dice (https://github.com/bdice) - Srinivas Yadav (https://github.com/srinivasyadav18) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Srinivas Yadav (https://github.com/srinivasyadav18) URL: #16497

GregoryKimball · 2024-08-16T04:01:01Z

We hope that #15700 will improve the overall performance outlook of static_map in distinct

Refactor distinct using insert_or_apply

eb5a6ea

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 2, 2024

bdice reviewed Aug 2, 2024

View reviewed changes

PointKernel reviewed Aug 2, 2024

View reviewed changes

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

srinivasyadav18 added 3 commits August 2, 2024 22:29

clean up process_keep function

86c2a22

minor cleanups

f2d8eef

remove helper files

5b66240

github-actions bot added the CMake CMake build issue label Aug 2, 2024

srinivasyadav18 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 2, 2024

srinivasyadav18 marked this pull request as ready for review August 2, 2024 23:24

srinivasyadav18 requested review from a team as code owners August 2, 2024 23:24

srinivasyadav18 requested review from PointKernel and JayjeetAtGithub August 2, 2024 23:24

Merge branch 'branch-24.10' into distinct_insert_or_apply

00fa59b

PointKernel reviewed Aug 2, 2024

View reviewed changes

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

Revert "remove helper files"

213e1ce

reverting back this commit, as compilation time of distinct.cu is almost 6 mins. This reverts commit 5b66240.

github-actions bot removed the CMake CMake build issue label Aug 4, 2024

Split the implementation due to high compile times

9f7db16

bdice mentioned this pull request Aug 5, 2024

Add keep option to distinct nvbench #16497

Merged

3 tasks

ttnghia reviewed Aug 6, 2024

View reviewed changes

srinivasyadav18 and others added 3 commits August 8, 2024 11:20

Merge branch 'branch-24.10' into distinct_insert_or_apply

9d93e60

fix formatting

c10e047

fix missing template parameter for allocator

78561d1

GregoryKimball mentioned this pull request Oct 9, 2024

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `distinct` using `static_map` `insert_or_apply` #16484

Refactor `distinct` using `static_map` `insert_or_apply` #16484

srinivasyadav18 commented Aug 2, 2024

bdice Aug 2, 2024

bdice Aug 2, 2024

bdice Aug 2, 2024

bdice Aug 2, 2024

PointKernel Aug 2, 2024

bdice Aug 2, 2024

bdice Aug 2, 2024

srinivasyadav18 commented Aug 2, 2024

PointKernel commented Aug 5, 2024

GregoryKimball commented Aug 6, 2024

srinivasyadav18 commented Aug 6, 2024 •

edited

Loading

srinivasyadav18 commented Aug 6, 2024

GregoryKimball commented Aug 6, 2024

ttnghia commented Aug 6, 2024 •

edited

Loading

ttnghia Aug 6, 2024 •

edited

Loading

ttnghia Aug 6, 2024

ttnghia Aug 6, 2024

GregoryKimball commented Aug 16, 2024 •

edited

Loading

		auto plusop = plus_op{};
		map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value());

Refactor distinct using static_map insert_or_apply #16484

Are you sure you want to change the base?

Refactor distinct using static_map insert_or_apply #16484

Conversation

srinivasyadav18 commented Aug 2, 2024

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srinivasyadav18 commented Aug 2, 2024

PointKernel commented Aug 5, 2024

GregoryKimball commented Aug 6, 2024

srinivasyadav18 commented Aug 6, 2024 • edited Loading

srinivasyadav18 commented Aug 6, 2024

GregoryKimball commented Aug 6, 2024

ttnghia commented Aug 6, 2024 • edited Loading

ttnghia Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GregoryKimball commented Aug 16, 2024 • edited Loading

Refactor `distinct` using `static_map` `insert_or_apply` #16484

Refactor `distinct` using `static_map` `insert_or_apply` #16484

srinivasyadav18 commented Aug 6, 2024 •

edited

Loading

ttnghia commented Aug 6, 2024 •

edited

Loading

ttnghia Aug 6, 2024 •

edited

Loading

GregoryKimball commented Aug 16, 2024 •

edited

Loading