Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. #10808

Closed
LorrensP-2158466 opened this issue Jun 5, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@LorrensP-2158466
Copy link
Contributor

LorrensP-2158466 commented Jun 5, 2024

Is your feature request related to a problem or challenge?

Currently, the only way to get the inputs of a LogicalPlan is to call inputs(), which returns a Vec<&LogicalPlan>. But I have noticed that there can be unnecessary calls to into_iter() or iter() on this vector.
Furthermore, the function returns a lot of Vectors of size 1, which can create unnecessary allocations.
This also applies to UserDefinedLogicalNode(Core), since I don't think the compiler can see through the use of inputs() and convert the Vec to an iterator.

Describe the solution you'd like

To change the API to return an Iterator instead of a vector requires a lot of rewriting, so I think it's maybe nicer to create a new function that returns an iterator over the inputs like this:
fn inputs_iter(&self) -> impl Iterator<Item = &LogicalPlan> {}

We also have to extend the API of UserDefinedLogicalNode(Core) to have the same function so extension node's have this ability as well.

So instead of all the vec![ input ] calls, we can use std::iter::once, empty vec's can be empty iterators and in the case of an extension node we can just call node.inputs_iter().

Describe alternatives you've considered

No response

Additional context

This issue is purely for changing the API, so I'm willing to do it if this is accepted. To use this function in the source code is a bit more work, so I think it will be better to open another issue for changing the calls to inputs() into inputs_iter().

I may be entirely wrong since I'm fairly new to DataFusion, so any feedback is greatly appreciated.

@LorrensP-2158466 LorrensP-2158466 added the enhancement New feature or request label Jun 5, 2024
@alamb
Copy link
Contributor

alamb commented Jun 7, 2024

I believe @peter-toth and @ozankabak and I discussed something similar in #10543 (comment)

The new reference walking APIs added in #10543 (which I don't think are released yet) may also be related -- specifically TreeNode::visit can now return references that can be kepy of the underlying nodes

@peter-toth
Copy link
Contributor

peter-toth commented Jun 7, 2024

@LorrensP-2158466, I as far as I get you just want to change the return type of LogicalPlan::inputs() to use iterators instead of the current Vec. I think that is not really related to #10543 and actually can be a good idea to avoid unnecessary Vec allocations.

But I feel you will run into problems with the impl Iterator<Item = &LogicalPlan> return type as it doesn't allow returning different iterator implementations on different LogicalPlans. (And I think we want to avoid dyn Iterator too.)
But please give it a try if you have some time...

@LorrensP-2158466
Copy link
Contributor Author

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!
I will also look at #10543 for extra info.

@alamb
Copy link
Contributor

alamb commented Jun 7, 2024

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!

In order to make it work, you might be able to implement a custom iterator like

enum InputIterator<'a> {
  SingleInput(&LogicalPlan),
  VecInput(&[LogicalPlan])),
..
}

And then implement the Iterator trait appropriately

@LorrensP-2158466
Copy link
Contributor Author

So I have found a solution that i can compile, but at this stage I'm not very happy with it.
I will first try to explain my reasoning as to how i got to my current solution, it is a bit much, so if you are only interested the final you can Jump to the end. I did underestimate this problem...

Explanation

I first tried the solution Andrew mentioned, but it runs into following problems:

  1. The Join Plans and Recursive Query
    Both the Join and CrossJoin hold the left and right inputs separately in an Arc, so we need another variant Double() which needs to know if the first input is retrieved and than the second. This is because you can't transform both the Arc's into a &[LogicalPlan]. This also applies to Recursive Queries which have 2 inputs.
  2. The Union node holds a Vec<Arc<LogicalPlan>> which you also can't transform safely into a &[LogicalPlan].
  3. The Extension node doesn't allow us to know how the inputs are stored, we have to assume this can have multiple inputs all contained in Arc's, so we get the same problem as the Union Node but a little worse because some Extension nodes just have 1 input but we treat them as having multiple.

To support all these cases i came up with:

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>), // DoubleIter<T> = std::array::IntoIter<T, 2>
    Multiple(std::vec::IntoIter<&'parent LogicalPlan>),
}

Built-In LogicalPlans

Most cases can be handled by the Empty, Single or Double variant.
The Multiple variant has a Vec::IntoIter, so we can handle the Union case easier, this is because we have to transform the Vec<Arc<LogicalPlan>> into a collection of references, which results in us allocating and transforming it back into a iterator. This also happens in the inputs()function but with just the collect() call.

Extension Plans

To handle the Extension nodes i added the following function into the UserDefinedLogicalNodetrait:

fn inputs_iter(&self) -> InputIterator<'_>;

The user of this trait can then choose their own iterator, and we only have to call extension.node.inputs_iter().

More Problems

But there are still some problems with this (i think):

  1. The Multiplevariant still causes us to allocate in a lot of cases, some UserDefined plans in examples have their inputs stored as Vec<LogicalPlan>, so we need to do inputs.iter().collect().into_iter().
    We could add another Iterator type:
Slice(slice::IntoIter<&'parent LogicalPlan>)

which is an slice iterator, so the above would be inputs.as_slice().into_iter(). But adding another variant that still says "i have multiple inputs" does not look nice.
2. Some Nodes, like Union hold their inputs in Vec<Arc<LogicalPlan>> which also causes us to transform and collect this into a Vec<&LogicalPlan> to turn this into a iterator the Multiple variant accepts.

Solutions

To fix this i can split up the Multiple variant into 2 new variants:

pub enum InputIterator{
    // ...
    Slice(slice::IntoIter<&LogicalPlan>),
    FromArcs(Map<Iter<'_, Arc<LogicalPlan>>, AsRef::as_ref>), // maps Arc<LogicalPlan> into &LogicalPlan
}

This does sum up to a total of 5 different iterator types, but i don't know how i can cover every possible way of holding onto multiple inputs. For example if node implementation holds their inputs in some other collection (like BTreeMap, HashMap, ...) their respective Iter implementations have different types, so if someone wants to return a InputIterator they are forced to also keep their inputs in a Vec<LogicalPlan> or Vec<Arc<LogicalPlan>>.

Currently

So currently the InputIterator looks like this...

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>),
    Slice(SliceIter<'parent, LogicalPlan>),
    FromArcs(FromArcsIter<'parent, LogicalPlan>),
}

I have made a macro which let's me "delegate" a call to this iterator to any of the inner iterators, e.g. fn next():

fn next(&mut self) -> Option<Self::Item> {
    delegate!(self, iter => iter.next())
}

Things to do:

  • find a way to accept many more IteratorTypes, (maybe just have a fallback in the form of Vec::into_iter or dyn Iterator).
  • Minimize Iter variants while still keeping the implementation clear, maybe we can collapse the Empty, Single and Double into one variant, but this does cause more checks.

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

I agree your (clearly very impressive skills) might be better spent on other projects.

Is there any particular issue or area you are interested in working on?

@LorrensP-2158466
Copy link
Contributor Author

Thanks, that means a lot. I'm interested in anything data science or database related. I came in contact with DataFusion because of my Bachelor's project this year, and I liked it so much that I wanted to contribute in any way I could. But because of school, I can't find the time to do this actively. The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

As I said, I like to help wherever I can, but I'm not familiar enough with DataFusion to know where I can help. Maybe you guys know some places where I can look/help?

Thanks for the interest and help in this issue!

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

This is very cool -- is the code somehere pubic ? Hopefully doing this kind of analysis would be easier now with the nicer tree node API from @peter-toth .

One thing that might be interesting / relevant perhaps then would be to add an example of that kind of analysis.

For example, this ticket describes an example for an analyzer rule #10855, but writing an example of sql_analaysis.rs that shows how to parse some sql and then use DataFusion structures to do an analysis (like maybe join counts, or predicate analysis or something) would be really neat -- I filed #10871 to track this

@LorrensP-2158466
Copy link
Contributor Author

This is very cool -- is the code somehere pubic?

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

I also think adding an example of SQL analysis would be nice. I'll move to #10871 so this can be closed.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

That is very cool -- thank you for sharing @LorrensP-2158466 . Since DataFusion is totally open it is always cool to see what people are doing with it.

(BTW if you need a paper about DataFusion to cite -- we now have one https://dl.acm.org/doi/10.1145/3626246.3653368 :) )

@alamb alamb closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants