-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074
[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074
Conversation
20ac5d8
to
a392f06
Compare
a392f06
to
db5129a
Compare
for i in range(4, 8): | ||
assert values[i]["value"] == arr2[i - 4] | ||
np.testing.assert_array_equal( | ||
values, np.expand_dims(np.concatenate((arr1, arr2)), axis=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A happy accident of adding a __len__()
definition to ArrowRow
is that consuming rows from a NumPy Dataset now transparently removes the single-column table nesting.
8cee895
to
fac8259
Compare
print(f"Seeding RNG for test_groupby_arrow_multicolumn with: {seed}") | ||
random.seed(seed) | ||
xs = list(range(100)) | ||
random.shuffle(xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bonus randomness I'd also .repartition(random(0, 100))
to induce random partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise every partition would have 1 element only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to go with a [1, 10, 100]
partitioning set, I think this should give us the best tradeoff of partitioning edge case coverage with test time.
…egations for a multi-column aggregation into a utility, shared by both Dataset and GroupedDataset.
…repartitioning." This reverts commit 99438ab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the empty block handling shouldn't be done in repartition.
79a068f
to
8399bd1
Compare
8399bd1
to
9c467c1
Compare
@ericl Could you take another pass? |
Datasets tests passed and other Python failures are unrelated, I think this is ready to merge cc @ericl |
As a stacked follow-up to #20044 (only the last commit contains incremental changes from original PR), this PR adds multi-column/multi-lambda aggregations, making it much easier to express "do some aggregation on multiple columns".
Mean on 3 columns - old API
Mean on 3 columns - new API
This PR also adds thorough checking of the
on
aggregation argument.Drivebys
.repartition()
wasn't properly handling thenum_partitions > num_rows
case: for a simpleDataset
with simple blocks, the generated empty blocks were Arrow blocks instead of simple blocks, which causes most downstream operations to break. We explicitly handle the empty block case in.repartition()
. We should find a way to fix this in general, so I opened an issue for that effort.Checks
scripts/format.sh
to lint the changes in this PR.