-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Swordfish specific test fixtures #3164
Conversation
CodSpeed Performance ReportMerging #3164 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3164 +/- ##
==========================================
+ Coverage 79.01% 79.02% +0.01%
==========================================
Files 634 634
Lines 76942 76962 +20
==========================================
+ Hits 60792 60823 +31
+ Misses 16150 16139 -11
|
tests/dataframe/test_sort.py
Outdated
@@ -13,6 +14,12 @@ | |||
### | |||
|
|||
|
|||
@pytest.fixture(scope="function", autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should preferably call the context manager explicitly in our tests, or pass the fixture into tests when we want to use it -- we've had issues in the past where autouse makes it confusing for people debugging our unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, making the fixture explicit.
&[aggregation.clone()], | ||
&aggregate_schema.into(), | ||
&group_by_with_pivot, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the two-stage agg only necessary for partitioned distributed computations? To my knowledge, we do this 2-stage thing so that we can perform a local agg (as an optimization to reduce data cardinality) before the shuffle, and then perform the second local agg + final project to correctly execute the operation.
For local computations on swordfish, would it not be more performant to just perform a fully-materializing single stage aggregation before the pivot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually faster to still use the two-stage agg pattern on swordfish, vs full-materialize then agg (at least for the tpch questions), so i kept it around.
But, the simpler and way more effective way to do it local would be to do a .fold
like pattern. This is something I plan on doing.
I probably shouldn't use the two stage for pivot then, will remove it and keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified it to be a fully-materializing single stage agg.
tests/conftest.py
Outdated
def with_morsel_size(request): | ||
morsel_size = request.param | ||
with daft.context.execution_config_ctx(default_morsel_size=morsel_size): | ||
yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe yield the morsel size
@pytest.fixture( | ||
scope="module", | ||
params=[1, 2] if daft.context.get_context().daft_execution_config.enable_native_executor is False else [1], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I think, but in the future is should really just be if daft.context.get_context().runner == "ray"
since partitioning only makes sense for the ray runner
One last comment -- if we're concerned with verbosity of passing |
This PR sets up a few swordfish related test fixtures, specifically:
default_morsel_size = [1, None]
for dataframe tests that do any into/repartitioning. This is to make sure that the operator parallelism is working.test_iter.py
assert df.sort(col) == expected
, but there are other columns in df that may not be sorted, and this won't be enough if morsel_size = 1. This isn't a problem with swordfish but the test, where the sort should actually involve more columns.Big note: There was a problem with pivot not getting applied correctly. This is because a dataframe pivot operation comprises of an agg + the actual pivoting, but previously the pivot was implemented as an intermediate operator, and the results of the agg were getting buffered. In order for the pivot to work it has to receive all values with the same group_by keys. This PR implements simplifies Pivot as a BlockingSink, so all the work is in there.