-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add vectorized global and grouped aggregations. #23478
[Datasets] Add vectorized global and grouped aggregations. #23478
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a37d1bd
to
3622a0a
Compare
4f44cb1
to
bbb480c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg! I'll take a closer look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
) | ||
|
||
if self.num_rows() == 0: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why None instead of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to delineate between a non-empty table with all nulls (0) and an empty table (None
), in case ignore_nulls=False
, in which case we'd want to propagate that None
in the latter case. See the Mean
and Std
agg function definitions.
# column will result in an all-None column of object type, which will raise | ||
# a type error when attempting to do most binary operations. We explicitly | ||
# check for this type failure here so we can properly propagate a null. | ||
if np.issubdtype(col.dtype, np.object_) and col.isnull().all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just check beforehand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm purposefully not checking beforehand since this check is expensive: col.isnull().all()
creates a fully copy of the column and then traverses it (2 scans). By only checking it on such a TypeError
, we keep this expensive check off of the common critical path.
ae74504
to
69e10f4
Compare
@jjyao I addressed your feedback, ready for another pass. |
… to base Block arg.
… block accumulation.
69e10f4
to
55bac7b
Compare
@jjyao Ping on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Can we have a representative perf test to measure the improvement? (can be a follow-up after this PR) |
This PR adds support for vectorized global and grouped aggregations, porting the built-in aggregations to vectorized block aggregations for tabular datasets.
The
AggregateFn
abstraction is extended to have an optionalvectorized_aggregate
function that performs a vectorized aggregation on a single block, allowing aggregations to opt-in to vectorized block aggregation. TheAggregateFn
also exposes acan_vectorize_for_block()
API, which allows aggregations to opt-in for vectorized block aggregation for only certain block types, e.g. only for Arrow and Pandas blocks. The built-in set of aggregations currently only opts-in to vectorized block aggregation on tabular datasets, i.e. only for Arrow and Pandas blocks, since vectorized aggregation of simple blocks will amount to the accumulator loop with extra copying for each group slice (no zero-copy views possible for Python lists).For Arrow blocks, vectorized block aggregation is supported by creating zero-copy views on each group slice within each partition and applying the vectorized aggregation on these group slice views. This currently entails two scans of each block partition: one to determine the group view boundaries, and one to process each group. As a future optimization, we could eliminate this extra scan by gathering group boundaries while sorting and partitioning each block along the sample boundaries.
Checks
scripts/format.sh
to lint the changes in this PR.