-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward-merge branch-23.06 to branch-23.08 #13416
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Scan-based groupbys are massaged back into pandas (original dataframe) order by a post-processing step. Previously, this did the wrong thing if the grouping key contained null (or nan) keys. In this situation dropna=True will cause libcudf to produce an output table that is smaller than the input frame. To mimic pandas we need to expand this output to the original frame size, inserting nulls in the missing rows and reordering correctly. Furthermore, the previous reordering code had an out-of-bounds memory access when there were null keys, since we were asking to group a column of the same length as the result, but the grouping object expects columns of length of the original input (which is larger with dropna=True and null keys). To fix these issues, compute the reordering on a column of appropriate size, and, if dropna is true and any of the key columns have nulls, go down a more expensive reordering path that inserts nulls correctly by reindexing the result. - Closes #13349 - Closes #12055 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #13389
Putting this up as an optimization for ColumnVector.EventHandler. As it stands today, code in the spark-rapids plugin that wants to use this is having to java objects that encapsulate state about the columns around reference counting and what represents the bag of `ColumnVector`s that are spillable at any given time. If we pass the `ColumnVector` instance in the event handler, one of these objects can be removed, and we can simplify the implementation in the plugin. I am putting this up as draft while my local tests pass, but I think it should be fairly straightforward. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) - Robert (Bobby) Evans (https://github.com/revans2) URL: #13386
Closes #13393 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13394
This PR supersedes part of #11656. It adds a public API for `cudf::stable_distinct`, mirroring that of `cudf::distinct` but preserving the order of the input table. The `stable_distinct` implementation was refactored to use `apply_boolean_mask`, which reduces the number of kernels needed. I also added tests/benchmarks for `cudf::stable_distinct`. I split out the C++ changes from #11656 because that PR size was getting too large. Also these C++ changes are non-breaking, but the Python changes are breaking (and depend on these C++ changes), so separating into a new PR seemed like a good idea. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #13392
Closes #13397 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #13398
This PR drops two dependencies from the cudf conda recipe. - `numba` is only a `run` dependency and should not be listed in `host`. - `fastavro` is only a test dependency. It is listed in `dependencies.yaml` but does not need to be listed in `meta.yaml` as a run dependency. This is already listed as a testing dependency in `pyproject.toml`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass) URL: #13406
Log whether kvikIO's compatibility mode is on for a given file input/output. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #13363
Depends on #13392. Closes #11638 Closes #12449 Closes #11230 Closes #5286 This PR re-implements Python's `DataFrame.drop_duplicates` / `Series.drop_duplicates` to use the `stable_distinct` algorithm. This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step. As a consequence of changing the behavior of `drop_duplicates`, a lot of refactoring was needed. The `drop_duplicates` function was used to implement `unique()`, which cascaded into changes for several groupby functions, one-hot encoding, `np.unique` array function dispatches, and more. Those downstream functions relied on the sorting order of `drop_duplicates` and `unique`, which is _not_ promised by pandas. Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Nghia Truong (https://github.com/ttnghia) URL: #11656
SUCCESS - forward-merge complete. |
github-actions
bot
added
CMake
CMake build issue
conda
Java
Affects Java cuDF API.
Python
Affects Python cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
labels
May 23, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Forward-merge triggered by push to
branch-23.06
that creates a PR to keepbranch-23.08
up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.