Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert rank / dense_rank and percent_rank builtin functions to UDWF #12718

Merged

Conversation

jatin510
Copy link
Contributor

@jatin510 jatin510 commented Oct 2, 2024

Which issue does this PR close?

Closes #12648

Rationale for this change

Same as described in #8709.

What changes are included in this PR?

  1. Converts BuiltinWindowFunction::Rank, BuiltinWindowFunction::DenseRank, BuiltinWindowFunction::PercentRank to user-defined window function.
  2. Export a fluent-style API for creating rank, dense_rank, percent_rank expressions.

Are these changes tested?

Updates explain output in sqllogictests to match lowercase rank, dense_rank and percent_rank.

Are there any user-facing changes?

@jatin510 jatin510 marked this pull request as draft October 2, 2024 12:43
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate proto Related to proto crate labels Oct 2, 2024
@@ -508,6 +508,9 @@ message ScalarUDFExprNode {
enum BuiltInWindowFunction {
UNSPECIFIED = 0; // https://protobuf.dev/programming-guides/dos-donts/#unspecified-enum
// ROW_NUMBER = 0;
// TODO
// if i will remove this
// should i renumber all variables?
RANK = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcsherin how to handle this ?
I want to remove rank, percent_rank and dense_rank from here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just need to comment out the lines (not remove them):

enum BuiltInWindowFunction {
  UNSPECIFIED = 0; // https://protobuf.dev/programming-guides/dos-donts/#unspecified-enum
  // ROW_NUMBER = 0;
  // RANK = 1;
  // DENSE_RANK = 2;
  // PERCENT_RANK = 3;

@github-actions github-actions bot added the sql SQL Planner label Oct 3, 2024
\n Aggregate: groupBy=[[ROLLUP (person.state, person.last_name)]], aggr=[[sum(person.age), grouping(person.state), grouping(person.last_name)]]\
\n TableScan: person";
// TODO
// fix this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcsherin when i am running cargo test,
it is not able to detect the rank, dense_rank and percent_rank udwf functions.
I even added the window_function at line 2637 as well.
Any idea where to register the newly created udwf's.

Thanks, In advance .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UDWF is registered but the lookup in get_window_meta still returns None.

fn get_window_meta(&self, _name: &str) -> Option<Arc<WindowUDF>> {
None
}

This PR is coming along well 😄. Thanks @jatin510.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) substrait labels Oct 4, 2024
/// TODO
/// how to handle import of percent_rank function from
/// datafusion+functions_window
/// as using this is causing the circular depency issue
/// # // first_value is an aggregate function in another crate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expr function have written docs for the percent_rank,
but when I try to import the percent_rank from datafusion_functions_window workspace,
its throwing error related to the circular depency

how to handle such cases ?

Copy link
Contributor

@jcsherin jcsherin Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of ways to get around this. But here I noticed that the first_value function is a mock and not the real first_value window function.

So you can consider doing the same for percent_rank. Then you don't have to import anything. This is alright because the test is not testing the functionality of percent_rank in the doc test.

Another advantage of mocking in this specific case is that the change remains local to the doc test which I believe is good.

@alamb
Copy link
Contributor

alamb commented Oct 4, 2024

This is so exciting to see

@jatin510
Copy link
Contributor Author

jatin510 commented Oct 7, 2024

in window.slt file
while running cargo test
getting error:

Running "map.slt"
External error: query failed: DataFusion error: Arrow error: Invalid argument error: column types must match schema types, expected UInt64 but found Float64 at column index 6
[SQL] SELECT
  c9,
  RANK() OVER(ORDER BY c1) AS rank1,
  RANK() OVER(ORDER BY c1 ROWS BETWEEN 10 PRECEDING and 1 FOLLOWING) as rank2,
  DENSE_RANK() OVER(ORDER BY c1) as dense_rank1,
  DENSE_RANK() OVER(ORDER BY c1 ROWS BETWEEN 10 PRECEDING and 1 FOLLOWING) as dense_rank2,
  PERCENT_RANK() OVER(ORDER BY c1) as percent_rank1,
  PERCENT_RANK() OVER(ORDER BY c1 ROWS BETWEEN 10 PRECEDING and 1 FOLLOWING) as percent_rank2
  FROM aggregate_test_100
  ORDER BY c9
  LIMIT 5
at test_files/window.slt:1151

but percent_rank is supposed to return the float64 value

@jcsherin
Copy link
Contributor

jcsherin commented Oct 7, 2024

but percent_rank is supposed to return the float64 value

The built-in implementation doesn't define the result type as Datatype::Int64. It depends on the type of the input expression.

fn field(&self) -> Result<Field> {
let nullable = false;
Ok(Field::new(self.name(), self.data_type.clone(), nullable))
}

You can use the WindowUDFFieldArgs::get_input_type method to set it.

/// Returns `Some(DataType)` of input expression at index, otherwise
/// returns `None` if the index is out of bounds.
pub fn get_input_type(&self, index: usize) -> Option<DataType> {
self.input_types.get(index).cloned()
}

@jatin510 You are right, the return type is DataType::Float64 for percent_rank. Your fix in f1c7547 looks right to me.

The percent_rank takes zero arguments, so it does not depend on the type of the input expression.

The return_type for the BuiltInWindowFunction is defined here:

match self {
BuiltInWindowFunction::Rank
| BuiltInWindowFunction::DenseRank
| BuiltInWindowFunction::Ntile => Ok(DataType::UInt64),
BuiltInWindowFunction::PercentRank | BuiltInWindowFunction::CumeDist => {
Ok(DataType::Float64)
}

@jatin510 jatin510 marked this pull request as ready for review October 8, 2024 04:00
@jatin510
Copy link
Contributor Author

jatin510 commented Oct 8, 2024

Thanks for your help, @jcsherin

@jatin510 jatin510 changed the title wip: converting rank builtin function to UDWF Converting rank builtin function to UDWF Oct 8, 2024
@jatin510 jatin510 requested a review from jcsherin October 8, 2024 07:22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add test cases for the expression API here:

async fn roundtrip_expr_api() -> Result<()> {


pub fn with_window_function(mut self, window_function: Arc<WindowUDF>) -> Self {
self.window_functions.insert(
window_function.name().to_string().to_lowercase(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the to_lowercase() is needed here for WindowUDFs.

@jcsherin
Copy link
Contributor

jcsherin commented Oct 8, 2024

@jatin510 Great job! This PR is almost there ✊.

The Rank user-defined window function implementation is now split across three modules. This has resulted in duplicated code in many places.

It would be nicer if we continue to maintain the same form factor as the original implementation.

You can reuse the RankType enum used in the built-in window function implementation in WindowUDF:

pub enum RankType {
    Basic,
    Dense,
    Percent,
}

And in PartitionEvaluator::evaluate you can specialize like this,

/// Evaluates the window function inside the given range.
    fn evaluate(
        &mut self,
        values: &[ArrayRef],
        range: &Range<usize>,
    ) -> Result<ScalarValue> {

      /// no more code duplication here ...

       match self.rank_type {
            RankType::Basic => Ok(ScalarValue::UInt64(Some(
                self.state.last_rank_boundary as u64 + 1,
            ))),
            RankType::Dense => Ok(ScalarValue::UInt64(Some(self.state.n_rank as u64))),
            RankType::Percent => {
                exec_err!("Can not execute PERCENT_RANK in a streaming fashion")
            }
        }
}

Would you like to unify this? If not, this can be completed in a follow-on PR.

define_udwf_and_expr!(
DenseRank,
dense_rank,
"Returns the rank of each row within a window partition without gaps."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Returns the rank of each row within a window partition without gaps."
"Returns rank of the current row without gaps. This function counts peer groups"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define_udwf_and_expr!(
PercentRank,
percent_rank,
"Returns the relative rank of the current row within a window partition."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Returns the relative rank of the current row within a window partition."
"Returns the relative rank of the current row: (rank - 1) / (total rows - 1)"

Copy link
Contributor

@jcsherin jcsherin Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jatin510
Copy link
Contributor Author

jatin510 commented Oct 8, 2024

@jatin510 Great job! This PR is almost there ✊.

The Rank user-defined window function implementation is now split across three modules. This has resulted in duplicated code in many places.

It would be nicer if we continue to maintain the same form factor as the original implementation.

You can reuse the RankType enum used in the built-in window function implementation in WindowUDF:

pub enum RankType {
    Basic,
    Dense,
    Percent,
}

And in PartitionEvaluator::evaluate you can specialize like this,

/// Evaluates the window function inside the given range.
    fn evaluate(
        &mut self,
        values: &[ArrayRef],
        range: &Range<usize>,
    ) -> Result<ScalarValue> {

      /// no more code duplication here ...

       match self.rank_type {
            RankType::Basic => Ok(ScalarValue::UInt64(Some(
                self.state.last_rank_boundary as u64 + 1,
            ))),
            RankType::Dense => Ok(ScalarValue::UInt64(Some(self.state.n_rank as u64))),
            RankType::Percent => {
                exec_err!("Can not execute PERCENT_RANK in a streaming fashion")
            }
        }
}

Would you like to unify this? If not, this can be completed in a follow-on PR.

I will create a follow up PR for this change

@jatin510 jatin510 requested a review from jcsherin October 8, 2024 14:40
Comment on lines +943 to +945
rank(),
dense_rank(),
percent_rank(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jatin510
Copy link
Contributor Author

jatin510 commented Oct 8, 2024

The PR is ready to be reviewed @jcsherin @alamb

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

Awesome -- than you @jatin510 -- I have startd the CI and plan to review this tomorrow morning

@alamb alamb changed the title Converting rank builtin function to UDWF Convert rank builtin function to UDWF Oct 9, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 9, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jatin510 and @jcsherin -- this is a really nice improvement. We are getting very close to having no built in functions 🎉 .

FYI @mustafasrepo @ozankabak and @berkaysynnada

// specific language governing permissions and limitations
// under the License.

//! Defines physical expression for `dense_rank` that can evaluated at runtime during query execution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking this module contains the WindowUDFImpl for dense_rank, but I think this wording is consistent with the other code in this crate.

I have this PR checked out to resolve some logical conflicts anyways, so I will pushed a commit to change it in this PR too

@alamb alamb changed the title Convert rank builtin function to UDWF Convert rank / dense_rank and percen_rank builtin functions to UDWF Oct 9, 2024
@jatin510
Copy link
Contributor Author

jatin510 commented Oct 9, 2024

Thanks, @alamb and @jcsherin, for all your help! You both are awesome!

@jatin510 jatin510 changed the title Convert rank / dense_rank and percen_rank builtin functions to UDWF Convert rank / dense_rank and percent_rank builtin functions to UDWF Oct 10, 2024
@alamb alamb merged commit 939ef9e into apache:main Oct 10, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 10, 2024

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert BuiltInWindowFunction::{Rank, PercentRank, DenseRank} to a user defined functions
3 participants