Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression: Backtraces in errors slow down planning time (Expensive backtraces) #7522

Closed
crepererum opened this issue Sep 11, 2023 · 8 comments · Fixed by #7527
Closed
Labels
bug Something isn't working

Comments

@crepererum
Copy link
Contributor

DataFusion creates loads of errors even on the happy path. However as of #7434, we now gather a backtrace for each error. This is rather expensive. Here is a profile dump from a prod workload:

Screenshot from 2023-09-11 18-35-55

In the said workload, there is a LOT of going for bookkeeping (that's why I had the profiler running in the first place) but the backtraces alone make up for 30% of the time. The place looks like this:

logical:
Projection: ...
  TableScan: ...                                                                                                                                                                                                                                                                                                                                                                                               

physical:
ProjectionExec: ..
  CoalesceBatchesExec: target_batch_size=8192
    FilterExec: ...
      ParquetExec: ...

(had to remove a good amount of details due to data protection, but the filters / predicates are rather simple)

I think there are two paths forward:

  • do NOT generate backtraces for errors
  • do NOT use errors for the happy path but rather Option or some other enum
@alamb alamb changed the title Expensive backtraces Backtraces in errors slow down planning time (Expensive backtraces) Sep 11, 2023
@alamb alamb changed the title Backtraces in errors slow down planning time (Expensive backtraces) Performance Regression: Backtraces in errors slow down planning time (Expensive backtraces) Sep 11, 2023
@alamb alamb added the bug Something isn't working label Sep 11, 2023
@alamb
Copy link
Contributor

alamb commented Sep 11, 2023

cc @comphead

I think there are two paths forward:
do NOT generate backtraces for errors
do NOT use errors for the happy path but rather Option or some other enum

I agree with this assessment. What I propose we do is:

  1. revert Add backtrace to error messages #7434 while we sort things out

  2. Try and work on avoiding errors on the happy path (Make a faster way to check column existence in optimizer (not is_err()) #5309 is related) which is a somewhat unfortunate pattern (as each error requires at least one string allocation!) among other things

@comphead
Copy link
Contributor

Thanks for reporting this. I'm thinking of adding an extra check whether backtrace should be generated.

For the happy path, would you mind to clarify a bit the idea?

@comphead
Copy link
Contributor

@crepererum if you run your workload with RUST_BACKTRACE=0 do you still experience performance degradation?

@alamb
Copy link
Contributor

alamb commented Sep 11, 2023

For the happy path, would you mind to clarify a bit the idea?

I think the idea is that there are many places in the code that use fallible functions (on DFSchema and otherwise) and then call ok() or is_ok() and discard the error. This means that in the happy path many DataFusionErrors are created that are not returned to the user but instead are created during the course of normal query processing. #5309 is one example of code that does this, but there are several others

@comphead
Copy link
Contributor

as a quick fix I can do backtrace even more optional, like crate feature(the same done in anyhow crate) or put another datafusion config param to enable/disable backtrace.

Also its probably good to have a benchmark test for planner, not sure how to do it though, but its doable. @alamb @crepererum let me know your thoughts

@crepererum
Copy link
Contributor Author

I can check w/ RUST_BACKTRACE=0 later. However note that we (InfluxData) run our Rust binaries with RUST_BACKTRACE=1 in prod because otherwise you're pretty helpless in case of unexpected panics. So if you wanna make this env-based, maybe use a different variable instead of hijacking the one that Rust offers?

@alamb
Copy link
Contributor

alamb commented Sep 12, 2023

Also its probably good to have a benchmark test for planner, not sure how to do it though, but its doable. @alamb @crepererum let me know your thoughts

I think there is one benchmark for the planner here:

cargo bench --bench sql_planner

I haven't run it recently -- it would be great to eventually run these over time and track their performance #5504

@alamb
Copy link
Contributor

alamb commented Sep 13, 2023

I looked into why Errors were being created so frequently and I found and filed #7553

I also filed #7552 to track the general notion of not using errors in the happy path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants