Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

LorrensP-2158466 · 2024-06-05T18:16:32Z

Is your feature request related to a problem or challenge?

Currently, the only way to get the inputs of a LogicalPlan is to call inputs(), which returns a Vec<&LogicalPlan>. But I have noticed that there can be unnecessary calls to into_iter() or iter() on this vector.
Furthermore, the function returns a lot of Vectors of size 1, which can create unnecessary allocations.
This also applies to UserDefinedLogicalNode(Core), since I don't think the compiler can see through the use of inputs() and convert the Vec to an iterator.

Describe the solution you'd like

To change the API to return an Iterator instead of a vector requires a lot of rewriting, so I think it's maybe nicer to create a new function that returns an iterator over the inputs like this:
fn inputs_iter(&self) -> impl Iterator<Item = &LogicalPlan> {}

We also have to extend the API of UserDefinedLogicalNode(Core) to have the same function so extension node's have this ability as well.

So instead of all the vec![ input ] calls, we can use std::iter::once, empty vec's can be empty iterators and in the case of an extension node we can just call node.inputs_iter().

Describe alternatives you've considered

No response

Additional context

This issue is purely for changing the API, so I'm willing to do it if this is accepted. To use this function in the source code is a bit more work, so I think it will be better to open another issue for changing the calls to inputs() into inputs_iter().

I may be entirely wrong since I'm fairly new to DataFusion, so any feedback is greatly appreciated.

The text was updated successfully, but these errors were encountered:

alamb · 2024-06-07T12:59:35Z

I believe @peter-toth and @ozankabak and I discussed something similar in #10543 (comment)

The new reference walking APIs added in #10543 (which I don't think are released yet) may also be related -- specifically TreeNode::visit can now return references that can be kepy of the underlying nodes

peter-toth · 2024-06-07T13:47:33Z

@LorrensP-2158466, I as far as I get you just want to change the return type of LogicalPlan::inputs() to use iterators instead of the current Vec. I think that is not really related to #10543 and actually can be a good idea to avoid unnecessary Vec allocations.

But I feel you will run into problems with the impl Iterator<Item = &LogicalPlan> return type as it doesn't allow returning different iterator implementations on different LogicalPlans. (And I think we want to avoid dyn Iterator too.)
But please give it a try if you have some time...

LorrensP-2158466 · 2024-06-07T15:19:28Z

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!
I will also look at #10543 for extra info.

alamb · 2024-06-07T17:59:40Z

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!

In order to make it work, you might be able to implement a custom iterator like

enum InputIterator<'a> {
  SingleInput(&LogicalPlan),
  VecInput(&[LogicalPlan])),
..
}

And then implement the Iterator trait appropriately

LorrensP-2158466 · 2024-06-10T18:36:25Z

So I have found a solution that i can compile, but at this stage I'm not very happy with it.
I will first try to explain my reasoning as to how i got to my current solution, it is a bit much, so if you are only interested the final you can Jump to the end. I did underestimate this problem...

Explanation

I first tried the solution Andrew mentioned, but it runs into following problems:

The Join Plans and Recursive Query
Both the Join and CrossJoin hold the left and right inputs separately in an Arc, so we need another variant Double() which needs to know if the first input is retrieved and than the second. This is because you can't transform both the Arc's into a &[LogicalPlan]. This also applies to Recursive Queries which have 2 inputs.
The Union node holds a Vec<Arc<LogicalPlan>> which you also can't transform safely into a &[LogicalPlan].
The Extension node doesn't allow us to know how the inputs are stored, we have to assume this can have multiple inputs all contained in Arc's, so we get the same problem as the Union Node but a little worse because some Extension nodes just have 1 input but we treat them as having multiple.

To support all these cases i came up with:

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>), // DoubleIter<T> = std::array::IntoIter<T, 2>
    Multiple(std::vec::IntoIter<&'parent LogicalPlan>),
}

Built-In LogicalPlans

Most cases can be handled by the Empty, Single or Double variant.
The Multiple variant has a Vec::IntoIter, so we can handle the Union case easier, this is because we have to transform the Vec<Arc<LogicalPlan>> into a collection of references, which results in us allocating and transforming it back into a iterator. This also happens in the inputs()function but with just the collect() call.

Extension Plans

To handle the Extension nodes i added the following function into the UserDefinedLogicalNodetrait:

fn inputs_iter(&self) -> InputIterator<'_>;

The user of this trait can then choose their own iterator, and we only have to call extension.node.inputs_iter().

Solutions

To fix this i can split up the Multiple variant into 2 new variants:

pub enum InputIterator{
    // ...
    Slice(slice::IntoIter<&LogicalPlan>),
    FromArcs(Map<Iter<'_, Arc<LogicalPlan>>, AsRef::as_ref>), // maps Arc<LogicalPlan> into &LogicalPlan
}

This does sum up to a total of 5 different iterator types, but i don't know how i can cover every possible way of holding onto multiple inputs. For example if node implementation holds their inputs in some other collection (like BTreeMap, HashMap, ...) their respective Iter implementations have different types, so if someone wants to return a InputIterator they are forced to also keep their inputs in a Vec<LogicalPlan> or Vec<Arc<LogicalPlan>>.

Currently

So currently the InputIterator looks like this...

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>),
    Slice(SliceIter<'parent, LogicalPlan>),
    FromArcs(FromArcsIter<'parent, LogicalPlan>),
}

I have made a macro which let's me "delegate" a call to this iterator to any of the inner iterators, e.g. fn next():

fn next(&mut self) -> Option<Self::Item> {
    delegate!(self, iter => iter.next())
}

Things to do:

find a way to accept many more IteratorTypes, (maybe just have a fallback in the form of Vec::into_iter or dyn Iterator).
Minimize Iter variants while still keeping the implementation clear, maybe we can collapse the Empty, Single and Double into one variant, but this does cause more checks.

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

alamb · 2024-06-11T01:44:59Z

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

I agree your (clearly very impressive skills) might be better spent on other projects.

Is there any particular issue or area you are interested in working on?

LorrensP-2158466 · 2024-06-11T14:35:36Z

Thanks, that means a lot. I'm interested in anything data science or database related. I came in contact with DataFusion because of my Bachelor's project this year, and I liked it so much that I wanted to contribute in any way I could. But because of school, I can't find the time to do this actively. The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

As I said, I like to help wherever I can, but I'm not familiar enough with DataFusion to know where I can help. Maybe you guys know some places where I can look/help?

Thanks for the interest and help in this issue!

alamb · 2024-06-11T15:09:39Z

The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

This is very cool -- is the code somehere pubic ? Hopefully doing this kind of analysis would be easier now with the nicer tree node API from @peter-toth .

One thing that might be interesting / relevant perhaps then would be to add an example of that kind of analysis.

For example, this ticket describes an example for an analyzer rule #10855, but writing an example of sql_analaysis.rs that shows how to parse some sql and then use DataFusion structures to do an analysis (like maybe join counts, or predicate analysis or something) would be really neat -- I filed #10871 to track this

LorrensP-2158466 · 2024-06-11T16:13:36Z

This is very cool -- is the code somehere pubic?

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

I also think adding an example of SQL analysis would be nice. I'll move to #10871 so this can be closed.

alamb · 2024-06-11T20:51:38Z

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

That is very cool -- thank you for sharing @LorrensP-2158466 . Since DataFusion is totally open it is always cool to see what people are doing with it.

(BTW if you need a paper about DataFusion to cite -- we now have one https://dl.acm.org/doi/10.1145/3626246.3653368 :) )

LorrensP-2158466 added the enhancement New feature or request label Jun 5, 2024

alamb mentioned this issue Jun 11, 2024

Add example for writing an SQL analysis pass #10871

Closed

alamb closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

LorrensP-2158466 commented Jun 5, 2024 •

edited

Loading

alamb commented Jun 7, 2024 •

edited

Loading

peter-toth commented Jun 7, 2024 •

edited

Loading

LorrensP-2158466 commented Jun 7, 2024

alamb commented Jun 7, 2024

LorrensP-2158466 commented Jun 10, 2024

alamb commented Jun 11, 2024

LorrensP-2158466 commented Jun 11, 2024

alamb commented Jun 11, 2024

LorrensP-2158466 commented Jun 11, 2024

alamb commented Jun 11, 2024

Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

Comments

LorrensP-2158466 commented Jun 5, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jun 7, 2024 • edited Loading

peter-toth commented Jun 7, 2024 • edited Loading

LorrensP-2158466 commented Jun 7, 2024

alamb commented Jun 7, 2024

LorrensP-2158466 commented Jun 10, 2024

Explanation

Built-In LogicalPlans

Extension Plans

More Problems

Solutions

Currently

alamb commented Jun 11, 2024

LorrensP-2158466 commented Jun 11, 2024

alamb commented Jun 11, 2024

LorrensP-2158466 commented Jun 11, 2024

alamb commented Jun 11, 2024

LorrensP-2158466 commented Jun 5, 2024 •

edited

Loading

alamb commented Jun 7, 2024 •

edited

Loading

peter-toth commented Jun 7, 2024 •

edited

Loading