Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Datafusion 40 #771

Merged
merged 42 commits into from
Jul 31, 2024
Merged

Upgrade Datafusion 40 #771

merged 42 commits into from
Jul 31, 2024

Conversation

Michael-J-Ward
Copy link
Contributor

@Michael-J-Ward Michael-J-Ward commented Jul 24, 2024

Part of #776.
Closes #778.
Closes #774.

What changes are included in this PR?

Release notes for upstream datafusion v40

Upstream datafusion continues its migration from built-in aggregate functions to User Defined Aggregate Functions. The APIs around these functions have been in serious flux, so the focus of this PR is to upgrade the deps and get all existing tests to pass.

Our regr_* func wrappers are now properly tested.

And aliases from functions.rs have been removed, leaning on our python wrappers.

@Michael-J-Ward Michael-J-Ward marked this pull request as draft July 24, 2024 19:32
@Michael-J-Ward Michael-J-Ward marked this pull request as ready for review July 24, 2024 23:20
@Michael-J-Ward Michael-J-Ward marked this pull request as draft July 25, 2024 02:31
This required trait method was added upstream [0] and recommends to simply forward to `static_name`.

[0]: apache/datafusion#10266
Upstream signatures were changed for the new new `AggregateBuilder` api [0].

This simply gets the code to work. We should better incorporate that API into `datafusion-python`.

[0] apache/datafusion#10560
Builtin Count was removed upstream.

TBD whether we want to re-implement `count_star` with new API.

Ref: apache/datafusion#10893
… UDAF

Ref: approx_distinct apache/datafusion#10851
Ref: approx_median apache/datafusion#10840
Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
The python wrapper now provides stddev_samp alias.
The functions now take a single expression instead of a Vec<_>.

Ref: apache/datafusion#10930
`approx_percentile_cont` is now returning a DoubleArray instead of an IntArray.

This may be a bug upstream; it requires further investigation.
`first_value` and `last_value` are currently failing and marked as xfail.
The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior.
This allowed me to remove `marks=pytest.xfail` from the window tests.
@Michael-J-Ward Michael-J-Ward marked this pull request as ready for review July 25, 2024 19:48
@Michael-J-Ward Michael-J-Ward changed the title WIP: Upgrade Datafusion 40 Upgrade Datafusion 40 Jul 25, 2024
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I don't see any showstoppers. Most of my comments are thoughts, just the one issue I see about maybe updating the macro.

If possible I would like to get these merged in before the release. Both are approved.
#768
#770

Comment on lines +40 to +53
#[pyfunction]
pub fn approx_distinct(expression: PyExpr) -> PyExpr {
functions_aggregate::expr_fn::approx_distinct::approx_distinct(expression.expr).into()
}

#[pyfunction]
pub fn approx_median(expression: PyExpr, distinct: bool) -> PyResult<PyExpr> {
let expr = functions_aggregate::expr_fn::approx_median(expression.expr);
if distinct {
Ok(expr.distinct().build()?.into())
} else {
Ok(expr.into())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like these are a lot of work. Is it too difficult to instead update aggregate_function! macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

One of the first functions I went to port was first_value, and I came away certain that we'd end up some Builder API around the aggregate / window functions. Additionally, the functions currently expose different builder args.

So, I decided to go the explicit route instead of writing a macro_rules! just to change it next release

(That's why I was excited to see you doing precisesly that Builder API work upstream.)

Comment on lines +1484 to +1506
def bit_and(arg: Expr, distinct: bool = False) -> Expr:
"""Computes the bitwise AND of the argument."""
args = [arg.expr for arg in args]
return Expr(f.bit_and(*args, distinct=distinct))
return Expr(f.bit_and(arg.expr, distinct=distinct))


def bit_or(*args: Expr, distinct: bool = False) -> Expr:
def bit_or(arg: Expr, distinct: bool = False) -> Expr:
"""Computes the bitwise OR of the argument."""
args = [arg.expr for arg in args]
return Expr(f.bit_or(*args, distinct=distinct))
return Expr(f.bit_or(arg.expr, distinct=distinct))


def bit_xor(*args: Expr, distinct: bool = False) -> Expr:
def bit_xor(arg: Expr, distinct: bool = False) -> Expr:
"""Computes the bitwise XOR of the argument."""
args = [arg.expr for arg in args]
return Expr(f.bit_xor(*args, distinct=distinct))
return Expr(f.bit_xor(arg.expr, distinct=distinct))


def bool_and(*args: Expr, distinct: bool = False) -> Expr:
def bool_and(arg: Expr, distinct: bool = False) -> Expr:
"""Computes the boolean AND of the arugment."""
args = [arg.expr for arg in args]
return Expr(f.bool_and(*args, distinct=distinct))
return Expr(f.bool_and(arg.expr, distinct=distinct))


def bool_or(*args: Expr, distinct: bool = False) -> Expr:
def bool_or(arg: Expr, distinct: bool = False) -> Expr:
"""Computes the boolean OR of the arguement."""
args = [arg.expr for arg in args]
return Expr(f.bool_or(*args, distinct=distinct))
return Expr(f.bool_or(arg.expr, distinct=distinct))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think these are on me - I should have realized they should only have one argument.

Copy link
Contributor Author

@Michael-J-Ward Michael-J-Ward Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nay - these wrappers were correct for v39.

The regr_* functions were incorrect, though. But that led me to discover that the tests for those are bypassing the wrappers: #778

src/functions.rs Outdated
}

#[pyfunction]
#[pyo3(signature = (*args, distinct = false, filter = None, order_by = None, null_treatment = None))]
#[pyo3(signature = (expr, distinct = false, filter = None, order_by = None, null_treatment = None))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the wrappers now in place, I'm wondering if we should just remove all the pyo3 signatures just to reduce clutter in the repo. That shouldn't hold up this PR though. What do you think?

Comment on lines +572 to +578
// TODO: should we just expose this in python?
/// Create a COUNT(1) aggregate expression
#[pyfunction]
fn count_star() -> PyResult<PyExpr> {
Ok(PyExpr {
expr: Expr::AggregateFunction(AggregateFunction {
func_def: datafusion_expr::expr::AggregateFunctionDefinition::BuiltIn(
aggregate_function::AggregateFunction::Count,
),
args: vec![lit(1)],
distinct: false,
filter: None,
order_by: None,
null_treatment: None,
}),
})
fn count_star() -> PyExpr {
functions_aggregate::expr_fn::count(lit(1)).into()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the question:

I think it's worth adding some guidance about what to put in python vs what to put in rust for this repo. In my mind the things that should go into the python side are

  • Trivial aliases (this is a good example)
  • Simple type conversion, like path -> string of the path or number to lit(number)
  • More complex type conversion if it makes sense to do in python. For example, in the named_struct where sending in a dictionary on the python side makes a lot more sense and isn't as simple to do via pyo3.

And then I'd lean towards everything else sitting on the rust side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add that to the functions.rs module docs and/or somewhere in the contributor guide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the contributor guide since it's more general guidance and not specific to functions

@Michael-J-Ward
Copy link
Contributor Author

@timsaucer I've captured the follow-on issues and added your two to the tracking issue #776

I'd prefer to merge large one in and follow-on with smaller PRs for the remaining.

@timsaucer
Copy link
Contributor

I've pushed this PR that just ensures we haven't missed any exports with the new wrappers. It might make sense to get it into the 40.0 release.

#782

@Michael-J-Ward
Copy link
Contributor Author

Michael-J-Ward commented Jul 27, 2024

An aside:

To me, the process of releasing a new datafusion-python version:

  1. Upgrade the datafusion deps and migrate code so new code compiles
  2. Integrate new upstream features that are available.
  3. Any additional fixes / improvements we want to get into this release
  4. Tag a commit as new version and do the release.

In my view, this PR is just step 1, but I feel like that's not clear enough.

And the "blast-radius" of the upgrade PR is such that I'd prefer to merge once CI is clean and then do the additional steps 2 and 3.

@timsaucer
Copy link
Contributor

Thank you. That is very helpful!

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planend on reviewing this today but ran out of time. I skimmed throught this and the changes look reasonable, so am approving on that basis. Thanks @Michael-J-Ward and thanks for the review @timsaucer.

@andygrove andygrove merged commit f580155 into apache:main Jul 31, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_regr_funcs does not test datafusion-python functions Remove duplicate aliases
3 participants