Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation on TreeNode #10035

Merged
merged 11 commits into from
Apr 22, 2024
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 10, 2024

Which issue does this PR close?

Part of #8913 and part of #7013 and part of #10121

Rationale for this change

As I was working on various rewrites using the TreeNode APIs (which are awesome) I sometimes get lost in all the similarly named functions (map, apply, visit, rewrite, etc) and what the difference is. Now that I have that all loaded into my brain, I wanted to write it down for my future self (and others hopefully too).

What changes are included in this PR?

Improve documentation on TreeNode, TreeNodeVisitor, and TreeNodeRewriter

Are these changes tested?

Doc tests

Are there any user-facing changes?

Better docs, no functional changes

@alamb alamb added the documentation Improvements or additions to documentation label Apr 10, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 10, 2024

FYI @peter-toth

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Apr 10, 2024
/// Defines a visitable and rewriteable tree node. This trait is implemented
/// for plans ([`ExecutionPlan`] and [`LogicalPlan`]) as well as expression
/// trees ([`PhysicalExpr`], [`Expr`]) in DataFusion.
/// API for visiting (read only) and rewriting tree data structures .
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just tried to explain how this API works better.

Comment on lines +60 to +62
/// * `map_*`: applies a transformation to rewrite owned nodes
/// * `apply_*`: invokes a function on borrowed nodes
/// * `transform_`: applies a transformation to rewrite owned nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im probably missing it but does it make sense to add more details to distinguish map_* and transform_ here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand the difference between map and transform here. Is the former creating a new node and the latter mutating an existing node?

Copy link
Contributor

@wiedld wiedld Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a shot at consolidating the terminology. It's probably wrong, but hopefully it will help the conversation. 🙈 (edited)

API action traversal ref actor
apply traverse, no rewrite top-down, pre-order &self FnMut
transform traverse & rewrite suffix dependent self suffix dependent
- - *_up suffix = bottom-up, post-order - -
- - *_down suffix = top-down, pre-order - -
- - *_up_down suffix = does both, see docs - -
- - - - Fn
- - - - *_mut suffix = FnMut
visit visit, no rewrite depth-first † &self dyn TreeNodeVisitor
rewrite visit & rewrite depth-first † &self dyn TreeNodeRewrite
apply_children inspect children, no rewrite not traversal per se‡ &self Fn
map_children inspect children & rewrite in place not traversal per se‡ self FnMut

† The specific implementations of the visitors (TreeNodeVisitor or TreeNodeRewrite) can impact the traversal patterns.
‡ I think the implementations decide how to handle any travsersal (a.k.a. does the mapping of children inspect it's own children). But I'm not really clear on this one.

Caveat (edited): I didn't attempt to generalize the map_* terminology (vs transform) as there are several other use cases across intersecting code paths (and traits), which I haven't dug into yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above table is not quite right and misses a few things. But, let me comment on this ticket tomorrow when I'm back home and have better github access.

Copy link
Contributor

@peter-toth peter-toth Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous efforts (especially #8891 but others too: #9876, #9913, #9965, #9999) consolidated the behaviour of existing APIs (consistent TreeNodeRecursion behaviour and common Transformed return types of transforming APIs) but didn't try to normalize naming of the APIs in favour of causing the least incompatibility with previous DataFusion versions. So this means that the current API names evolved organically and thus they might not be intuitive in all cases and are a bit inconsistent here and there...

Basically, there are 2 kinds of TreeNode APIs:

  1. "Inspecting" APIs to taverse over a tree and inspect &TreeNodes.
  2. "Transforming" APIs to traverse over a tree while consuming TreeNodes and produce possibly changed TreeNodes
    (i.e. there is no in-place mutation API, that uses &mut TreeNode, but we are improving the consume and produce APIs to cleate less new objects unnecessarly e.g.: Avoid LogicalPlan::clone() in LogicalPlan::map_children when possible #9999)

There are only 2 inspecting TreeNode APIs:

  • TreeNode::apply():
    Top-down (pre-order) traverses the nodes and applies the povided f closure on the nodes.
    • As TreeNode trees are hierarchical collections of nodes, something like foreach() would better describe the operation.
    • Because the API name doesn't explicitely defines the traversal direction IMO its docs shouldn't say anything about its implementation just the fact that it should be used for order agnostic inspection of nodes.
    • Its transforming counterpart should be TreeNode::transform(). But TreeNode::transform() is an alias for TreeNode::transform_up() so it does a bottom-up (post-order) traversal.
  • TreeNode::visit():
    Requires a TreeNodeVisitor that incorporates 2 methods to do a combined traversal. TreeNodeVisitor::f_down() is called in top-down order, TreeNodeVisitor::f_up() is called in bottom-up order on the nodes in one combined traversal.
    • As there is no apply_up() exists, visit() can be used to do a bottom-up inspecting traversal (TreeNodeVisitor::f_down() can remain its default implementation, that does nothing).
    • Its transforming counterpart is TreeNode::rewrite().

There are 7 transforming TreeNode APIs:

  • TreeNode::transform():
    An alias for TreeNode::transform_up() so it does bottom-up traversal on the nodes. Consumes them with the provided f closure and produces a possibly transformed node.
    • Should be the transforming counterpart of apply() but this runs bottom-up for some reason...
    • IMO its docs shouldn't mention traversal direction and probably should be used for order agnostic transformation of nodes.
  • TreeNode::transform_down() and TreeNode::transform_down_mut():
    These 2 are top-down traverse the nodes, consume them with the provided f closure and produce a possibly transformed node.
    • I have no idea why we have the ..._mut() version, most likely for just convenience, but IMO one transform_down() with FnMut should be fair enough.
  • TreeNode::transform_up() and TreeNode::transform_up_mut():
    These 2 are the bottom-up versions of the above.
    • I have no idea why we have the ..._mut() version.
    • Don't have an inspecting counterpart
  • TreeNode::transform_down_up():
    This API accepts 2 closures f_down and f_up. f_down is called in top-down order and f_up is called in bottom-up order on the nodes in one combined traversal.
    • Doesn't have an inspecting counterpart
  • TreeNode::rewrite():
    Requires a TreeNodeRewriter that incorporates 2 methods to do a combined traversal. TreeNodeRewriter::f_down() is called in top-down order, TreeNodeRewriter::f_up() is called in bottom-up order on the nodes in one combined traversal.
    • This API is the transforming counterpart of TreeNode::visit()
    • If there is no mutable shared data between TreeNodeRewriter::f_down() and TreeNodeRewriter::f_up() then it is easier to use TreeNode::transform_down_up() that accepts f_down and f_up closures directly and no need to define a TreeNodeRewriter instance.

Additional TreeNode APIs:

  • TreeNode::apply_children() and TreeNode::map_children():
    Building blocks for the above 2 + 7 TreeNode APIs. apply_children() inspects, map_children() transforms the children of a TreeNode node with the provided f closure.
    • TreeNode implementations (e.g. LogicalPlan, Expr, Arc<dyn ExecutionPlan>, Arc<dyn PhysicalExpr>, ...) have to implement apply_children() for inspecting and map_children() for transforming APIs.
    • These 2 methods are public, but they should almost never be called explicitely by the TreeNode API users. If, for some reason, the above 2 + 7 APIs don't cover the recursion needed then these 2 methods can be used to define a custom algorithms.
    • The apply_children() name is aligned with apply(), but have no idea why we call the other map_children(), transform_children() would probably be a better fit...

Besides the above general TreeNode APIs we have some LogicalPlan APIs:

  • LogicalPlan::..._with_subqueries():
    All the above 2 + 7 TreeNode APIs have a LogicalPlan::..._with_subqueries() version that do the same as the original TreeNode API but also recurse into subqueries (same type inner LogicalPlan trees that are defined in the expressions of the original LogicalPlan tree nodes).
    • We could generalize this concept into TreeNode but LogicalPlan seems to be only TreeNode type that makes use of these APIs.

Additional LogicalPlan APIs:

  • LogicalPlan::apply_expressions(), LogicalPlan::map_expressions(), LogicalPlan::apply_subqueries() and LogicalPlan::map_subqueries():
    Building blocks for LogicalPlan::..._with_subqueries() APIs. apply_expressions() and apply_subqueries() inspect, map_expressions() and map_subqueries() transform the expressions/subqueries of a LogicalPlan node with the provided f closure.
    • These 4 are similar to TreeNode::apply_children() and TreeNode::map_children() and so they are rarely called directly but can be used to define a custom algorithms.

Here is the summary of the current situation:

TreeNode APIs:

traversal order inspecting transforming
order agnostic transform()
top-down apply() transform_down(), transform_down_mut()
bottom-up transform_up(), transform_up_mut()
combined with separete f_down and f_up closures transform_down_up()
combined with incorporated f_down() and f_up() methods into an object visit() + TreeNodeVisitor rewrite() + TreeNodeRewriter

Additional TreeNode APIs: apply_children(), map_children().

LogicalPlan APIs:

traversal order inspecting transforming
order agnostic transform_with_subqueries()
top-down apply_with_subqueries() transform_down_with_subqueries(), transform_down_mut_with_subqueries()
bottom-up transform_up_with_subqueries(), transform_up_mut_with_subqueries()
combined with separete f_down and f_up closures transform_down_up_with_subqueries()
combined with incorporated f_down() and f_up() methods into an object visit_with_subqueries() + TreeNodeVisitor rewrite_with_subqueries() + TreeNodeRewriter

Additional LogicalPlan APIs: apply_expressions(), map_expressions(), apply_subqueries(), map_subqueries()

Copy link
Contributor

@peter-toth peter-toth Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if we wanted to consolidate the TreeNode API terminonlogy I would suggest the following:

traversal order inspecting transforming
order agnostic for_each() (new API, an alias to for_each_up(), but its docs shouldn't specify any order) transform() (remains an alias to transform_up() but its docs shouldn't specify any order)
top-down for_each_down() (new API), apply() (deprecated, from now just an alias to for_each_down()) transform_down() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>>, which is a breaking change but requires a trivial fix from the users), transform_down_mut() (deprecated, from now just an alias to transform_down())
bottom-up for_each_up() (new API) transform_up() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>>, which is a breaking change but requires a trivial fix from the users), transform_up_mut() (deprecated, from now just an alias to transform_up())
combined with separete f_down and f_up closures transform_down_up()
combined with incorporated f_down() and f_up() methods into an object visit() + TreeNodeVisitor rewrite() + TreeNodeRewriter

+ Maybe rename apply_children() to for_each_children() and map_children() to transform_children(). Although renaming these would be a breaking change if something that is built on DataFusion defines a new kind of TreeNode, but fixing them would also be trivial.

+ LogicalPlan::..._with_subqueries() APIs should be alligned with the above TreeNode ones, but as they are fairly new (#9913) they haven't landed in any release yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this information Peter. Amazing and mind-blowing

I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.

I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.

I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)

Yes, I totally agree with you. It is more important to document what we already have. I just wanted to give you some details about the remaining inconsistency in API naming. If we pursue to fix it, I'm happy to open a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, update here:

  1. I incorporated the information (and table!) from @wiedld and @peter-toth in this PR's documentation
  2. Added the suggestion for consolidated terminology to Epic: Unified TreeNode rewrite API #8913 (comment)
  3. Filed Deprecate TreeNode transform_xx_mut methods #10097 to clean up the transform APIs

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Apr 16, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 16, 2024

This PR is ready for another look if anyone has the time and inclination to read more docs 🙏

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I have only minor comments.

datafusion/common/src/tree_node.rs Outdated Show resolved Hide resolved
datafusion/common/src/tree_node.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Apr 19, 2024

Can a committer possibly review and approve this PR? I think it is ready to go

@alamb alamb added the documentation Improvements or additions to documentation label Apr 21, 2024
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Apr 21, 2024
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/common/src/tree_node.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

Thank you for the review @Jefffrey 🙏

@alamb alamb added the documentation Improvements or additions to documentation label Apr 22, 2024
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Apr 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

🚀 📖

@alamb alamb merged commit 16369d8 into apache:main Apr 22, 2024
23 checks passed
@alamb alamb added the documentation Improvements or additions to documentation label Apr 22, 2024
@alamb alamb deleted the alamb/more_tree_node_docs branch April 22, 2024 20:30
ccciudatu pushed a commit to hstack/arrow-datafusion that referenced this pull request Apr 26, 2024
* Improve documentation on `TreeNode`

* Document inspecting, rewriting  APIs. Add chart

* tweak

* Add references to PlanContext and ExprContext

* refine TreeNodeRewriter docs

* Add note about exists

* Update datafusion/common/src/tree_node.rs

Co-authored-by: Jeffrey Vo <[email protected]>

---------

Co-authored-by: Jeffrey Vo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants