Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate group offsets from element labels #11017

Merged
merged 67 commits into from
Jun 3, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 1, 2022

Given an array of integer values which may be the labels of some list elements, we want to generate an array of offsets so we can create a lists column from these offsets.

For example:

input_labels = [ 0, 0, 0, 0, 1, 1, 4, 4, 4, 4 ]
output = [ 0, 4, 6, 6, 6, 10 ]

This is basically the reverse operation of #10945.

Closes #10955.

Signed-off-by: Nghia Truong <[email protected]>
# Conflicts:
#	cpp/include/cudf/detail/labeling/label_segments.cuh
#	cpp/src/lists/drop_list_duplicates.cu
@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jun 1, 2022
@ttnghia ttnghia requested a review from a team as a code owner June 1, 2022 18:05
@ttnghia ttnghia self-assigned this Jun 1, 2022
@codecov

This comment was marked as off-topic.

@ttnghia ttnghia mentioned this pull request Jun 3, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a042be6 into rapidsai:branch-22.08 Jun 3, 2022
@ttnghia ttnghia deleted the offsets_from_labels branch June 3, 2022 21:43
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Generate offsets from element labels
3 participants