Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing reference for nested fields #965

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Jul 25, 2024

Fixes: #953

This may be a backward incompatible change because it seemingly removes support for allowing the table name within the parser string expression. https://github.com/apache/iceberg-python/pull/965/files#diff-d86e9b5642a7250398c2a3e9062e053dbd6314f83ad471e6d09b1664160a5cc7R55

However, row_filter is always passed as an input to a Table method (scan, delete, overwrite. Hence maintaining optional support for parsing of the table name on a row_filter feels redundant. This assumption, also prevents us from being able to parse a nested field in the string expression, as we will always only take the name of the field at the lowest level, which prevents us from binding the field using the name in the expression string. This PR proposes to use '.' delimiter to represent field hierarchy in nested fields because:

  • it is the flat representation convention used in NameMapping
  • it is the flat representation convention used in selected_fields, which is another input parameter to the scan method

Similar to the issue highlighted in the issue with optional catalog name support in identifiers, having optional support for table_name is bound to result in an issue if we use the same delimiter for table and namespace association with column hierarchy.

For example, if we have the nested field: 'person.age' and our table is named: 'person', and have optional support for table names in the row_filter, 'person.age' will remain an ambiguous field.

NOTE: There is a larger conversation around correctly supporting name mapping, and the correctness of expressing field names in flat form #935

@sungwy sungwy requested a review from kevinjqliu July 26, 2024 15:03
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I was able to verify that this works for previously errored case
#953 (comment)

Since we're using the . character to represent nested access, what if I have a column named foo.bar? Can we add a test case for that?

@@ -83,7 +83,7 @@

@column.set_parse_action
def _(result: ParseResults) -> Reference:
return Reference(result.column[-1])
return Reference(".".join(result.column))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract out "." as a variable

@sungwy
Copy link
Collaborator Author

sungwy commented Jul 26, 2024

Since we're using the . character to represent nested access, what if I have a column named foo.bar? Can we add a test case for that?

Thank you for the suggestion! This test will definitely break with the current approach

@sungwy
Copy link
Collaborator Author

sungwy commented Jul 26, 2024

@kevinjqliu - I've given this a bit more thought, and if we are looking to use . to represent nested fields, we will always run into an issue with . named fields if we split based on the delimiter. Unfortunately, I think this means that the scale of this problem is larger than a simple bug fix onto the Expression parser, expression_to_pyarrow and the Reference classes.

I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to .)

@sungwy sungwy marked this pull request as draft July 26, 2024 17:53
@kevinjqliu
Copy link
Contributor

Unfortunately, I think this means that the scale of this problem is larger than a simple bug fix onto the Expression parser, expression_to_pyarrow and the Reference classes.

I'm +1 to tracking this issue in another thread. And proceed with this current PR to resolve a valid bug with nested fields.
@HonahX / @Fokko wdyt?

@Fokko Fokko self-requested a review August 6, 2024 21:00
@@ -572,51 +572,54 @@ def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:


class _ConvertToArrowExpression(BoundBooleanExpressionVisitor[pc.Expression]):
def _flat_name_to_list(self, name: str) -> List[str]:
return name.split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a risky route. I'd rather migrate to a Tuple[str] situation internally so we can actually support fields with .'s



def test_nested_field_equality() -> None:
assert EqualTo("foo.first", "a") == parser.parse("foo.first == 'a'")
Copy link
Contributor

@Fokko Fokko Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to .)

I think the key to success is to have some kind of syntax for quoting literals. For example: https://spark.apache.org/docs/latest/sql-ref-literals.html

Then we can parse something like:

'a.b' -> Reference(('a.b',))
'a.b'.c -> Reference(('a.b', 'c'))
a.b.c -> Reference(('a', 'b', 'c'))

Or folks have to use:

row_filter=EqualTo(('a.b',), 123)
row_filter=EqualTo(('a.b', 'c'), 123)
row_filter=EqualTo(('a', 'b', 'c'), 123)

Copy link
Contributor

@Fokko Fokko Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an interesting proposal on the spec side of things: apache/iceberg#10883

Related: apache/iceberg#598

@kevinjqliu kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query on nested struct field with PyIceberg?
3 participants