Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't ignore failed optimizer rules #4615

Closed
crepererum opened this issue Dec 14, 2022 · 19 comments · Fixed by #6265
Closed

Don't ignore failed optimizer rules #4615

crepererum opened this issue Dec 14, 2022 · 19 comments · Fixed by #6265
Labels
bug Something isn't working

Comments

@crepererum
Copy link
Contributor

crepererum commented Dec 14, 2022

Describe the bug
Some logical optimizer rules like type coercion are actually essential for correct query execution, but are currently silently ignored when they failed.

To Reproduce
TBD

Expected behavior
Don't skip failed rules (this probably requires us to remove the non-failing OptimizerRule::optimize method).

Workaround
This behavior is controlled by a config option (that defaults to true meaning to silently skip failed errors).

You can see this in the datafusion cli:

DataFusion CLI v20.0.0
❯ show all;
+-----------------------------------------------------------+------------+
| name                                                      | setting    |
+-----------------------------------------------------------+------------+
...
| datafusion.optimizer.skip_failed_rules                    | true       |
...
+-----------------------------------------------------------+------------+
35 rows in set. Query took 0.038 seconds.

Datafusion can be configured to fail hard if an optimizer rule fails like

-- default behaviorset datafusion.optimizer.skip_failed_rules = false;
0 rows in set. Query took 0.023 seconds.

Additional context
See this discussion.

@crepererum crepererum added the bug Something isn't working label Dec 14, 2022
@jackwener
Copy link
Member

jackwener commented Dec 14, 2022

Make sense to me. It's a good discussion point.
Some rule is essential, we shouldn't tolerate them failed.
BTW, I think we also should make optimizer rule more stable instead of ignoring error.
IMHO, I think it's not good that tolerate error in optimizer. If a rule want to tolerate failed, I think it may be better handle inside itself.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2022

I think the original "ignore optimizer rule if error" was added (maybe by @andygrove or @avantgardnerio ) because at that time there was no way to distinguish between "Optimizer didn't error but could not handle the query" and "optimizer hit a real error"

Now that Optimizer::try_optimize exists, which can distinguish between those two cases, I think removing OptimizerRile::optimize is 👍

@alamb
Copy link
Contributor

alamb commented Dec 14, 2022

Related #4619

@HaoYang670
Copy link
Contributor

HaoYang670 commented Dec 15, 2022

Some logical optimizer rules like type coercion are actually essential for correct query execution

Some rule is essential, we shouldn't tolerate them failed

at that time there was no way to distinguish between "Optimizer didn't error but could not handle the query" and "optimizer hit a real error"

The type coercion and other rules for correctness should be moved in front of the optimizer.
It is the duty of the optimizer to make the query plan to achieve better performance, but not to correct the query.
In another word, the Datafusion should work even if we remove the optimizer.

It could be good if we could extract the type coercion from the optimizer and put it into a type checker for correcting the query. The pipeline could be very similar to a compiler:

SQL ---(parser)---> logical plan ---(type checker)---> verified logical plan ---(logical optimizer)---> optimized logical plan ...
          |                                 |
         \|/                               \|/
       Syntax error                     Type error

If we could do this, one different idea of the return type of the optimizer is that I think the optimizer should return an Option<LogicalPlan> instead of Result<Option<LogicalPlan>> or Result<LogicalPlan>. Because we should not allow an error (even if it is an Internal Error) returned from the optimizer, with the assumption that the input plan has been verified. The variant of the returned values should only be:

  1. None: when the plan cannot be optimized.
  2. Some(plan): return the optimized plan
  3. panic!(..): there is a bug in the optimizer.

@crepererum
Copy link
Contributor Author

On application of failing during an optimization may also be to detect malformed plans early. E.g. while working on #4370 I've figured that we could reject malformed regex patterns that are available during planning as literal early. I'm wondering if this should then be done by failing the optimization (and by fixing this very issue) or by inserting "terminal"/"always error" expressions into the plan that are later picked up by the physical planner.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2022

The type coercion and other rules for correctness should be moved in front of the optimizer.
It is the duty of the optimizer to make the query plan to achieve better performance, but not to correct the query.
In another word, the Datafusion should work even if we remove the optimizer.

I agree this would be a good change

@liukun4515 made a similar observation / suggestion #3582

I'm wondering if this should then be done by failing the optimization (and by fixing this very issue) or by inserting "terminal"/"always error" expressions into the plan that are later picked up by the physical planner.

I recommend the inserting an "always error" expr node. The rationale being that you could have branches that are never actually hit in the query

A trivial example is like

CASE
  WHEN FALSE THEN 1 / 0 
  WHEN TRUE THEN 5.0
END

@alamb
Copy link
Contributor

alamb commented Dec 15, 2022

panic!(..): there is a bug in the optimizer.

I would personally rather have an InternalError rather than a panic

@andygrove
Copy link
Member

original "ignore optimizer rule if error" was added (maybe by @andygrove or @avantgardnerio ) because at that time there was no way to distinguish between "Optimizer didn't error but could not handle the query" and "optimizer hit a real error"

That's correct. Now that we have the try_optimize method I support failing on errors and we will soon fix cases where rules incorrectly return errors.

@HaoYang670
Copy link
Contributor

HaoYang670 commented Dec 21, 2022

I'm wondering if this should then be done by failing the optimization (and by fixing this very issue) or by inserting "terminal"/"always error" expressions into the plan that are later picked up by the physical planner.

My suggestion is to fail as early as possible if we have detected some errors in the logical plan.

I recommend the inserting an "always error" expr node. The rationale being that you could have branches that are never actually hit in the query

A trivial example is like

CASE
  WHEN FALSE THEN 1 / 0 
  WHEN TRUE THEN 5.0
END

Why do we cover up the error, instead of exposing them? Besides, we should not return a compile error at runtime, even if the branch won't be hit. No static typed language would do this (I guess SQL is not a dynamic typed language 🤔).

@HaoYang670
Copy link
Contributor

I tried to remove the config skip_failed_rules and met several test failures tracked in #4685.
Feel free to update the task-list if you find more and welcome to fix them if you have time!

@alamb
Copy link
Contributor

alamb commented Dec 21, 2022

Why do we cover up the error, instead of exposing them? Besides, we should not return a compile error at runtime, even if the branch won't be hit. No static typed language would do this (I guess SQL is not a dynamic typed language 🤔).

Well I think the rationale / question is if SQL expression semantics allow / require eager evaluation or if delayed evaluation

A more realistic example might be to use the CASE to explicitly guard against a divide by zero

CASE
  WHEN num <> 0 THEN value / num
  ELSE  0.0
END

So in that case evaluating value/num might cause a divide by zero if it were done prior to evaluating the WHEN guard 🤔

@HaoYang670
Copy link
Contributor

Well I think the rationale / question is if SQL expression semantics allow / require eager evaluation or if delayed evaluation

These are the variants for runtime behavior, not related to compile time checking I think.
I've done some tests on Postgres, and find the expression when FALSE then ... will be removed before type checking, so if you have the expr when FALSE then 1/0 in your query, it is always valid: https://onecompiler.com/postgresql/3yskvh35x

But if the invalid expression can not be removed at compile time, a Divide by zero error will always return before executing the query, even if the invalid code won't be touched at runtime: https://onecompiler.com/postgresql/3yskvqaxf

@crepererum
Copy link
Contributor Author

The "when to check errors" is highly dependent on the database. I've worked with MS cloud SQL and Exasol before and in some cases, type errors (e.g. due to incompatible operators or functions that only accept a certain type) were only raised if at least a single row was passed throw it (post joins and where selections). These errors happened at query execution (not planning). From a user's perspective this was highly confusing. I recommend we take a look at the SQL standard and if it allows for it we bail out of invalid queries as early as possible, no matter if there's any data to query.

@alamb
Copy link
Contributor

alamb commented Jan 2, 2023

I believe that the SQL standard says it is "implementation defined":

6.3.3.3 Rule evaluation order

...

Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions
is generally performed from left to right. However, it is implementation-dependent whether expressions are
actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or
if the results of the expressions can be determined without completely evaluating all parts of the expression.

@wolffcm
Copy link
Contributor

wolffcm commented Mar 24, 2023

We recently ran into this when implementing gap-filling rule HandleGapFill for IOx which makes use of the logical planner:

  • it looks for a call to a UDF DATE_BIN_GAPFILL and if it exists it is removed (there is no actual implementation)
  • Inserts a GapFill node into the plan

https://github.com/influxdata/influxdb_iox/issues/7330

In this case the rule is really performing "lowering" (compiler jargon) in the sense that we need to get rid of this UDF for the query to be valid. Any error that occurs should be fatal for the query, so we can return an appropriate message to the user.

So I just want to voice that TypeCoercion isn't in a category by itself, in the sense that it needs to run and succeed for the query to be valid.

@alamb
Copy link
Contributor

alamb commented Mar 27, 2023

I updated the description with a workaround (to set datafusion.optimizer.skip_failed_rules to false).

@alamb
Copy link
Contributor

alamb commented Mar 27, 2023

FWIW @HaoYang670 's ticket #4685 describes the work needed to make sure all queries pass with skip_failed_rules set to false

@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

I hope to look into this sometime over the next two weeks

@alamb
Copy link
Contributor

alamb commented May 9, 2023

@jackwener has a PR up to fix this ❤ #6265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants