Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: add proposal for skyline pruning #9184

Merged
merged 6 commits into from
Feb 12, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions docs/design/2019-01-25-skyline-pruning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Proposal: Support Skyline Pruning

- Author(s): [Haibin Xie](https://github.com/lamxTyler)
- Last updated: 2019-01-25
- Discussion at:

## Abstract

This proposal introduces some heuristics and a general framework for pruning the access paths. With the help of it, the optimizer can avoid some wrong choices on access paths.

## Background

Currently, the choice of access path strongly depends on the statistics. We may choose the wrong index due to outdated statistics. However, many of the wrong choices can be eliminated by simple heuristics, for example, if the primary key or unique indices can be fully matched, we can choose it without the referring to the statistics.

## Proposal

The most important factors to choose the access paths are the number of rows that need to be scanned, whether or not it matches the physical property and does it require a double scan. Among these three factors, only the scan row count depends on statistics. So how can we compare the scan row count without statistics? Let's take a look at an example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two factors have different priorities?
single read/ double read
required properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are equal.


```sql
create table t(a int, b int, c int, index idx1(b, a), index idx2(a));
select * from t where a = 1 and b = 1;
```

From the query and schema, we can know that the access condition of `idx1` could strictly covers `idx2`, therefore the number of rows scanned by `idx1` will be no more than `idx2`, so `idx1` will be better than `idx2` in this case.
alivxxx marked this conversation as resolved.
Show resolved Hide resolved

So how can we combine these factors to prune the access paths? Consider two access paths `x` and `y`, if `x` is not worse than `y` at all factors, and there exists one factor that `x` is better than `y`, then we can prune `y` before referring to the statistics, because `x` will works better than `y` at all circumstances. This is also called skyline pruning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add all the rules that you decided to implement in the proposed skyline pruning process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually I mean this:

how can we compare the scan row count without statistics

For now, there is only one rule:

the access condition of idx1 could strictly covers idx2, therefore the number of rows scanned by idx1 will be no more than idx2, so idx1 will be better than idx2 in this case.

Is there any other rule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the three factors will be rules, but the rule of double read and required properties in quite simple so I did not mention them. Do I need to describe them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first factor the number of rows that need to be scanned, is there any other heuristic rules now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, currently there is only one rule for the first factor.

alivxxx marked this conversation as resolved.
Show resolved Hide resolved

## Rationale

The skyline pruning is also implemented on other databases, including MySQL and OceanBase. Without it, we may suffer from choosing the wrong access path in some simple cases.
alivxxx marked this conversation as resolved.
Show resolved Hide resolved

## Compatibility

It does not affect the compatibility.

## Implementation

Since we need to decide whether the access path matches the physical property, we need to do the skyline pruning when finding the best task for the data source. And usually there won't be too many indices, a naive nested-loops algorithm will suffice. The comparison of any two access paths has been explained in the previous `Proposal` section.

## Open issues (if applicable)
1 change: 1 addition & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Writing a design document can promote us to think deliberately and gather knowle

- [Proposal: A new command to restore dropped table](./2018-08-10-restore-dropped-table.md)
- [Proposal: Support SQL Plan Management](./2018-12-11-sql-plan-management.md)
- [Proposal: Support Skyline Pruning](./2019-01-25-skyline-pruning.md)

### In Progress

Expand Down