Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pushdown optimization and logical to physical transformation #1091

Merged
merged 13 commits into from
Dec 7, 2022

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Nov 19, 2022

Signed-off-by: Chen Dai [email protected]

Background

This section introduces the current architecture of logical optimizer and physical transformation.

Logical-to-Logical Optimization

Currently each storage engine adds its own logical operator as concrete implementation for TableScanOperator abstraction. Typically each data source needs to add 2 logical operators for table scan with and without aggregation. Take OpenSearch for example, there are OpenSearchLogicalIndexScan and OpenSearchLogicalIndexAgg and a bunch of pushdown optimization rules for each accordingly.

class LogicalPlanOptimizer:
  /*
   * OpenSearch rules include:
   *   MergeFilterAndRelation
   *   MergeAggAndIndexScan
   *   MergeAggAndRelation
   *   MergeSortAndRelation
   *   MergeSortAndIndexScan
   *   MergeSortAndIndexAgg
   *   MergeSortAndIndexScan
   *   MergeLimitAndRelation
   *   MergeLimitAndIndexScan
   *   PushProjectAndRelation
   *   PushProjectAndIndexScan
   *
   * that return *OpenSearchLogicalIndexAgg*
   *  or *OpenSearchLogicalIndexScan* finally
   */
  val rules: List<Rule>

  def optimize(plan: LogicalPlan):
    for rule in rules:
      if rule.match(plan):
        plan = rules.apply(plan)
    return plan.children().forEach(this::optimize)

Logical-to-Physical Transformation

After logical transformation, planner will let the Table in LogicalRelation (identified before logical transformation above) transform the logical plan to physical plan.

class OpenSearchIndex:

  def implement(plan: LogicalPlan):
    return plan.accept(
      DefaultImplementor():
        def visitNode(node):
          if node is OpenSearchLogicalIndexScan:
            return OpenSearchIndexScan(...)
          else if node is OpenSearchLogicalIndexAgg:
            return OpenSearchIndexScan(...)

Problem Statement

The current planning architecture causes 2 serious problems:

  1. Each data source adds special logical operator and explode the optimizer rule space. For example, Prometheus also has PrometheusLogicalMetricAgg and PrometheusLogicalMetricScan accordingly. They have the exactly same pattern to match query plan tree as OpenSearch.
  2. A bigger problem is the difficulty of transforming from logical to physical when there are 2 Tables in query plan. Because only 1 of them has the chance to do the implement(). This is a blocker for supporting INSERT ... SELECT ... statement or JOIN query. See code below.
  public PhysicalPlan plan(LogicalPlan plan) {
    Table table = findTable(plan);
    if (table == null) {
      return plan.accept(new DefaultImplementor<>(), null);
    }
    return table.implement(
        table.optimize(optimize(plan)));
  }

Solution

TableScanBuilder

A new abstraction TableScanBuilder is added as a transition operator during logical planning and optimization. Each data source provides its implementation class by Table interface. The push down difference in non-aggregate and aggregate query is hidden inside specific scan builder, for example OpenSearchIndexScanBuilder rather than exposed to core module.

TableScanBuilder

TablePushDownRules

In this way, LogicalOptimizier in core module always have the same set of rule for all push down optimization.

LogicalPlanOptimizer

Examples

The following diagram illustrates how TableScanBuilder along with TablePushDownRule solve the problem aforementioned.

optimizer-Page-1

Similarly, TableWriteBuilder will be added and work in the same way in separate PR: #1094

optimizer-Page-2

TODO

  1. Refactor Prometheus optimize rule and enforce table scan builder
  2. Figure out how to implement AD commands
  3. Deprecate optimize() and implement() if item 1 and 2 complete
  4. Introduce fixed point or maximum iteration limit for iterative optimization
  5. Investigate if CBO should be part of current optimizer or distributed planner in future
  6. Remove pushdownHighlight once it's moved to OpenSearch storage
  7. Move TableScanOperator to the new read package (leave it in this PR to avoid even more file changed)

Issues Resolved

#948

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added the maintenance Improves code quality, but not the product label Nov 19, 2022
@dai-chen dai-chen self-assigned this Nov 19, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 19, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.78%. Comparing base (354e843) to head (51196ab).
Report is 399 commits behind head on 2.x.

Additional details and impacted files
@@             Coverage Diff              @@
##                2.x    #1091      +/-   ##
============================================
- Coverage     98.31%   95.78%   -2.53%     
- Complexity     3485     3502      +17     
============================================
  Files           348      350       +2     
  Lines          8707     9306     +599     
  Branches        555      669     +114     
============================================
+ Hits           8560     8914     +354     
- Misses          142      334     +192     
- Partials          5       58      +53     
Flag Coverage Δ
query-workbench 62.76% <ø> (?)
sql-engine 98.30% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Yury-Fridlyand
Copy link
Collaborator

Is it related to #1088?

@dai-chen
Copy link
Collaborator Author

#1088

Not related directly. I think it's related to #811 which is the solution?

@dai-chen dai-chen marked this pull request as ready for review November 23, 2022 22:37
@dai-chen dai-chen requested a review from a team as a code owner November 23, 2022 22:37
@Yury-Fridlyand
Copy link
Collaborator

I like the PR description. Can you add it to the docs section?

@dai-chen
Copy link
Collaborator Author

I like the PR description. Can you add it to the docs section?

Sure, I'm drafting another PR for updating our dev docs: #1092. Will add this too. Thanks!

@dai-chen
Copy link
Collaborator Author

dai-chen commented Dec 7, 2022

ML integration seems missing IT. Just tested manually to confirm it is not impacted.

In particular, source = accounts | AD can be translated to ADOperator as before.

@dai-chen dai-chen merged commit 64a3794 into opensearch-project:2.x Dec 7, 2022
@dai-chen dai-chen deleted the improve-optimizer branch December 16, 2022 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Improves code quality, but not the product
Development

Successfully merging this pull request may close these issues.

5 participants