Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a new optimizer framework for datafusion. #2633

Closed
liurenjie1024 opened this issue May 27, 2022 · 15 comments
Closed

Introducing a new optimizer framework for datafusion. #2633

liurenjie1024 opened this issue May 27, 2022 · 15 comments

Comments

@liurenjie1024
Copy link
Contributor

There are some discussions about datafusion's optimizer framework in #440 and #1972. And I tried to build a framework based datafusion's expression system with the following features:

  1. Includes a heuristic optimizer that applies rules to plan iteratively in different ways: top-down or bottom-up. The optimizer stops applying rules to the plan until reaching a fixpoint or max iteration times.
  2. Includes a cascades style cost-based optimizer framework.
  3. Pattern matching and binding for rules.

Following is an example of removing limit rule, including patten definition and rule implementation:

pattern(|op| matches!(op, Logical(LogicalLimit(_))))
    .leaf(|op| matches!(op, Logical(LogicalProjection(_))))
.finish()
if let (Logical(LogicalLimit(limit1)), Logical(LogicalLimit(limit2))) =
            (input.get_operator(&_ctx)?, input[0].get_operator(&_ctx)?)
{
    let new_limit = min(limit1.limit(), limit2.limit());

    let ret = input[0].clone_with_inputs(Logical(LogicalLimit(Limit::new(new_limit))));

    result.add(ret);
    Ok(())
} else {
    bail!("Pattern miss matched")
}

And the source code can be found here: https://github.com/liurenjie1024/rust-opt-framework

Welcome to discuss and share your thoughts.

@liurenjie1024
Copy link
Contributor Author

cc @mingmwang @alamb @andygrove

@alamb
Copy link
Contributor

alamb commented May 27, 2022

THanks @liurenjie1024 -- I ran out of time today to review this but I will try and find time tomorrow

@alamb
Copy link
Contributor

alamb commented May 28, 2022

I looked at https://github.com/liurenjie1024/rust-opt-framework for bit today -- it looks very neat and a good example of a more general purpose optimization framework.

I would personally be very interested in seeing an Proof Of Concept of this framework connected into DataFusion (aka instead of the existing hard coded ordering https://github.com/apache/arrow-datafusion/blob/894be6719373be85fa777028fe3ec534536660e3/datafusion/core/src/execution/context.rs#L1259-L1278)

In terms of features I think are valuable for DataFusion are:

  1. Easy to understand default behavior that is robust (in the sense it doesn't degrade horribly without statistics, etc)
  2. Customizable: some way to allow users of DataFusion to mix/match / mashup the existing optimizer passes and their own to tailor DataFusion to their own use

There are probably more things I haven't thought of yet

@liurenjie1024
Copy link
Contributor Author

I looked at https://github.com/liurenjie1024/rust-opt-framework for bit today -- it looks very neat and a good example of a more general purpose optimization framework.

I would personally be very interested in seeing an Proof Of Concept of this framework connected into DataFusion (aka instead of the existing hard coded ordering https://github.com/apache/arrow-datafusion/blob/894be6719373be85fa777028fe3ec534536660e3/datafusion/core/src/execution/context.rs#L1259-L1278)

In terms of features I think are valuable for DataFusion are:

  1. Easy to understand default behavior that is robust (in the sense it doesn't degrade horribly without statistics, etc)

  2. Customizable: some way to allow users of DataFusion to mix/match / mashup the existing optimizer passes and their own to tailor DataFusion to their own use

There are probably more things I haven't thought of yet

Good suggestion, I will write a poc to integrate with datafusion.

@andygrove
Copy link
Member

Thanks for raising this @liurenjie1024. I am very interested in this effort since I will likely be spending time contributing to the optimizer rules in the coming weeks.

@liurenjie1024
Copy link
Contributor Author

liurenjie1024 commented Jun 29, 2022

Hi, @alamb @andygrove I've finished a simple poc to integrate new optimizer framework: https://github.com/liurenjie1024/rust-opt-framework/tree/main/src/datafusion_poc

Here are the general ideas:

  1. To adopt new heuristic optimizer, we can wrap HeuristicOptimizer as a optimizer rule, and it works as following:
Datafusion Logical Plan -> Our Logical Plan -> HeuristicOptimizer -> Our Logical Plan -> Datafusion Logical Plan

You can find an implementation here:
https://github.com/liurenjie1024/rust-opt-framework/blob/main/src/datafusion_poc/rule.rs

  1. To adopt new cascades style cost based optimizer, we can implement a new QueryPlanner, which works as following:
Datafusion logical plan -> Our logical plan -> Cost based optimizer -> Our physical plan -> Datafusion physical plan

You can find implementation here:
https://github.com/liurenjie1024/rust-opt-framework/blob/main/src/datafusion_poc/planner.rs

  1. For robust behavior of cbo without statistics, I prefer to use trivial cost model. For example, add penalty for operators like sort, nest loop join, etc. Currently I don't have implementation for this, but I think the optimizer framework is flexible enough and we can add them later.

@andygrove
Copy link
Member

Thanks @liurenjie1024 I will review this next week.

@liurenjie1024
Copy link
Contributor Author

liurenjie1024 commented Jul 13, 2022

@andygrove @alamb PTAL when you are available, looking forward to hear your feedback

@alamb
Copy link
Contributor

alamb commented Jul 13, 2022

Hi @liurenjie1024 -- I'll try and find some time this weekend to review this

@liurenjie1024
Copy link
Contributor Author

I would like to donate this optimizer to datafusion-contrib so that we can develop it with community.

@alamb
Copy link
Contributor

alamb commented Jul 25, 2022

Sounds like a great idea to me -- sorry I haven't had a chance to review this @liurenjie1024 -- what would you like to call the repo in datafusion-contrib? I can make one for you

@liurenjie1024
Copy link
Contributor Author

liurenjie1024 commented Jul 25, 2022

Sounds like a great idea to me -- sorry I haven't had a chance to review this @liurenjie1024 -- what would you like to call the repo in datafusion-contrib? I can make one for you

Thanks for response. How about calling it datafusion-dolomite? No special meaning, just came up with something randomly 😂

@alamb
Copy link
Contributor

alamb commented Jul 25, 2022

I created https://github.com/datafusion-contrib/datafusion-dolomite and invited you as a maintainer -- let me know if you would like / want admin access as well to that repo

@liurenjie1024
Copy link
Contributor Author

I created https://github.com/datafusion-contrib/datafusion-dolomite and invited you as a maintainer -- let me know if you would like / want admin access as well to that repo

Thanks, I'll have a try.

@liurenjie1024
Copy link
Contributor Author

Welcome to join discussion and development in new repo https://github.com/datafusion-contrib/datafusion-dolomite, and we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants