Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested schema projection (#5148) #5149

Merged
merged 3 commits into from
Nov 29, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #5148.

Rationale for this change

See ticket

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 29, 2023
/// ]);
/// assert_eq!(filtered, expected);
/// ```
pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
Copy link
Contributor Author

@tustvold tustvold Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had two methods, one which was called for all Fields, however, it wasn't clear to me how such an API could be used - as it would lack any context as to where in the schema a given field appeared. I therefore just stuck with filter_leaves which has immediate concrete utility for #5135. A case could be made for a fully-fledged visitor pattern, but I'm not aware of any immediate use-cases for this, and therefore what the requirements might be for such an API.

FYI @Jefffrey

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we would support selecting a subfield that is itself a field?

For example:

{ 
  a : {
    b: {
      c: 1, 
    },
    d: 2
}

Could I use this API to support only selecting a.b?

{ 
  a : {
    b: {
      c: 1, 
    },
}

Maybe we could if the parent FieldRef was also passed

    /// calls filter(depth, parent_field, child_field)` 
    pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
...
}

🤔

Copy link
Contributor Author

@tustvold tustvold Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would select b by selecting all its children, if you call filter with intermediate fields you end up in a position of how much context do you provide, just the parent, what about the parent's parent - reasoning in terms of leaves is the only coherent mechanism I can think of

@@ -99,6 +100,93 @@ impl Fields {
.all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
}

/// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
///
/// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this could be expressed better, but I struggled to come up with something 😅

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- this seems like a good API to me -- I tried to suggest some comment improvements and had some questions about making is more general, but I think since it is better than what currently exists, this is a nice step forward

@@ -99,6 +100,93 @@ impl Fields {
.all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
}

/// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal of how to better phrase what this is doing (as well as give a usecase that might not be obvious to the casual reader)

Suggested change
/// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
/// Returns a copy of this [`Fields`], with only those non nested (leaf) [`FieldRef`]s that
/// pass a predicate.
///
/// This can be used to select only a subset of fields in nested types such as
/// [`DataType::Struct`].
/// Leaf [`FieldRef`]s `DataType`s have no nested `FieldRef`s`. For example
/// a field with [`DatatType::Int32`] would be a leaf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incoporated this into 8d97950. I tweaked the wording a little as the depth-first part is critical to understanding what the leaf ordering means. I also added some comments to the doctest to highlight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- the comments are most helpful

Comment on lines 105 to 106
/// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
/// count of the number of previous calls to `filter` - i.e. the leaf's index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
/// count of the number of previous calls to `filter` - i.e. the leaf's index.
/// Invokes `filter` with each leaf [`FieldRef`] and a
/// count of the depth to the `field` (i.e. the number of previous calls to `filter`, and the leaf's index.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct, it is the count of leaves encountered, not the depth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify this in 8d97950

/// ]);
/// assert_eq!(filtered, expected);
/// ```
pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we would support selecting a subfield that is itself a field?

For example:

{ 
  a : {
    b: {
      c: 1, 
    },
    d: 2
}

Could I use this API to support only selecting a.b?

{ 
  a : {
    b: {
      c: 1, 
    },
}

Maybe we could if the parent FieldRef was also passed

    /// calls filter(depth, parent_field, child_field)` 
    pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
...
}

🤔

Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
d => (None, d),
};
let d = match v {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map and RunEndEncoded appear to have embedded fields too. It seems like they might also need to be handled 🤔

The usecase might be "I want only the values of the map? Or maybe Arrow can't express that concept (Apply the filter to only the values 🤔 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map is there, I missed RunEndEncoded 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunEndEncoded support added in 8d97950

use DataType::*;

let (k, v) = match f.data_type() {
Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this presumes the key itself doesn't have any leaves which I think is reasonable, but figured I would point it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in 8d97950

@alamb
Copy link
Contributor

alamb commented Nov 29, 2023

Looks very nice @tustvold -- thank you

@alamb
Copy link
Contributor

alamb commented Nov 29, 2023

One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)? If not, I can file a ticket tracking that feature request

@tustvold tustvold merged commit 6d4b8bb into apache:master Nov 29, 2023
26 checks passed
@tustvold
Copy link
Contributor Author

One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)?

Yes, I responded to the comment as such?

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)?

Yes, I responded to the comment as such?

Sorry -- the github UI is confusing me. I see the response in #5149 (comment) now. Thanks

Screenshot 2023-11-29 at 7 41 04 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nested Schema Projection
2 participants