Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nan_equality in cudf::distinct #11118

Merged
merged 115 commits into from
Jun 23, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 16, 2022

This adds nan_equality parameter to cudf::distinct, allowing to specify the desired behavior when dealing with floating-point data: NaN should be compared equally to other NaN or not.

Depends on #11052 (built on top of it).
Closes #11092.
This is a blocker for set-like operations (#11043) and also the last blocker for #11053.

# Conflicts:
#	cpp/benchmarks/stream_compaction/distinct.cpp
#	cpp/include/cudf/stream_compaction.hpp
#	cpp/src/dictionary/detail/concatenate.cu
#	cpp/src/dictionary/set_keys.cu
#	cpp/src/stream_compaction/distinct.cu
#	cpp/src/stream_compaction/distinct_reduce.cu
#	cpp/src/stream_compaction/distinct_reduce.cuh
#	cpp/src/transform/encode.cu
#	cpp/tests/stream_compaction/distinct_tests.cpp
@ttnghia ttnghia force-pushed the distinct_with_nans_equality branch from 97a6cb6 to be3b2fe Compare June 22, 2022 19:04
@github-actions github-actions bot removed the CMake CMake build issue label Jun 22, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress 0 - Blocked Cannot progress due to external reasons labels Jun 22, 2022
@ttnghia ttnghia marked this pull request as ready for review June 22, 2022 19:12
@ttnghia ttnghia requested a review from a team as a code owner June 22, 2022 19:12
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions. Otherwise this is a straightforward extension of the comparator API to forward the NaN policy through. Nice work.

cpp/include/cudf/detail/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_reduce.cu Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. LGTM!

map.get_device_view(), key_hasher, key_equal, keep, reduction_results.begin()});
auto const row_comp = cudf::experimental::row::equality::self_comparator(preprocessed_input);

auto const reduce_by_row = [&](auto const value_comp) {
Copy link
Contributor

@bdice bdice Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of what @PointKernel was suggesting -- one alternative I considered was making this lambda actually allocate and return the output vector, rather than binding in reduction_results as an output iterator and returning void. That felt like an implicit "output parameter" in the lambda. If the lambda were a real function, we'd avoid the output parameter and return the rmm::device_uvector directly from the function. I don't have strong feelings on this, so feel free to keep it as-is.

edit: Initially wrote "IIFE" where I meant lambda. Fixed.

Copy link
Contributor

@bdice bdice Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, however, almost the entire body of hash_reduce_by_row becomes a lambda and the part at the end is just a nans_equal dispatcher. 😛

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 23, 2022

Thanks all. I really appreciate your help with polishing everything 🍬

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 23, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 80d7cc7 into rapidsai:branch-22.08 Jun 23, 2022
@ttnghia ttnghia deleted the distinct_with_nans_equality branch June 23, 2022 03:08
@ttnghia ttnghia mentioned this pull request Jun 23, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support nan_equality in cudf::distinct
3 participants