Optimizer: Predicate Rewrite pass for TPCH Q19 #217
Labels
datafusion
Changes in the datafusion crate
enhancement
New feature or request
performance
Make DataFusion faster
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As @Dandandan and I were discussing on #78 (comment)
The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The downside is that Q19 currently has abysmal performance(basically it will never finish) because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal plan would recognize a join predicate (so INNER JOIN) can be used, as well as several "single column predicates" and "single table predicates" which could be pushed down to the scans (aka applied prior to the joins)
For reference, TPCH Q19 looks like this
Note that while the predicate is one big
OR
, it can be rewritten like:in which case the input cardinality into the join would be much lower.
Note there are further rewrites possible (aka introducing additional single table predicates like
p_size between 1 and 15
that can filter the input to the joins even further (although the final filter is also still needed).Describe the solution you'd like
The "classic" way to implement this is as a "predicate rewrite" pass that rearranges predicates for further downstream operations
The goal is basically to get the predicate into a form of
good_predicate1
ANDgood_predicate2
AND ...Where
good_predicate
means the predicate has special support in the execution engine.Since OR is not typically handled specially, rewrites to AND are helpful. Some common rewrites:
Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for
(A, B, C)
and do fast filtering.This kind of rewrite can get all sorts of fancy and sometimes needs a cost model (to estimate, for example, if redundantly applying a filter during scan and after a join is worthwhile). It probably makes sense to implement a basic rewrite pass with the single table predicate extraction first, and then make it fancier from there
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: