Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-23.06 to branch-23.08 #13416

Merged
merged 8 commits into from
May 23, 2023
Merged

Conversation

GPUtester
Copy link
Collaborator

Forward-merge triggered by push to branch-23.06 that creates a PR to keep branch-23.08 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

wence- and others added 8 commits May 19, 2023 14:25
Scan-based groupbys are massaged back into pandas (original dataframe)
order by a post-processing step. Previously, this did the wrong thing
if the grouping key contained null (or nan) keys. In this situation
dropna=True will cause libcudf to produce an output table that is
smaller than the input frame. To mimic pandas we need to expand this
output to the original frame size, inserting nulls in the missing rows
and reordering correctly.

Furthermore, the previous reordering code had an out-of-bounds memory
access when there were null keys, since we were asking to group a
column of the same length as the result, but the grouping object expects
columns of length of the original input (which is larger with
dropna=True and null keys).

To fix these issues, compute the reordering on a column of appropriate
size, and, if dropna is true and any of the key columns have nulls, go
down a more expensive reordering path that inserts nulls correctly by
reindexing the result.

- Closes #13349
- Closes #12055

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #13389
Putting this up as an optimization for ColumnVector.EventHandler. 

As it stands today, code in the spark-rapids plugin that wants to use this is having to java objects that encapsulate state about the columns around reference counting and what represents the bag of `ColumnVector`s that are spillable at any given time. If we pass the `ColumnVector` instance in the event handler, one of these objects can be removed, and we can simplify the implementation in the plugin. I am putting this up as draft while my local tests pass, but I think it should be fairly straightforward.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #13386
This PR supersedes part of #11656.

It adds a public API for `cudf::stable_distinct`, mirroring that of `cudf::distinct` but preserving the order of the input table. The `stable_distinct` implementation was refactored to use `apply_boolean_mask`, which reduces the number of kernels needed. I also added tests/benchmarks for `cudf::stable_distinct`.

I split out the C++ changes from #11656 because that PR size was getting too large. Also these C++ changes are non-breaking, but the Python changes are breaking (and depend on these C++ changes), so separating into a new PR seemed like a good idea.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #13392
This PR drops two dependencies from the cudf conda recipe.

- `numba` is only a `run` dependency and should not be listed in `host`.
- `fastavro` is only a test dependency. It is listed in `dependencies.yaml` but does not need to be listed in `meta.yaml` as a run dependency. This is already listed as a testing dependency in `pyproject.toml`.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13406
Log whether kvikIO's compatibility mode is on for a given file input/output.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #13363
Depends on #13392.

Closes #11638
Closes #12449
Closes #11230
Closes #5286

This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm.

This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.

As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas.

Authors:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11656
@GPUtester GPUtester requested review from a team as code owners May 23, 2023 18:10
@GPUtester GPUtester merged commit c823dd3 into branch-23.08 May 23, 2023
@GPUtester
Copy link
Collaborator Author

SUCCESS - forward-merge complete.

@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants