-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing reference for nested fields #965
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I was able to verify that this works for previously errored case
#953 (comment)
Since we're using the .
character to represent nested access, what if I have a column named foo.bar
? Can we add a test case for that?
@@ -83,7 +83,7 @@ | |||
|
|||
@column.set_parse_action | |||
def _(result: ParseResults) -> Reference: | |||
return Reference(result.column[-1]) | |||
return Reference(".".join(result.column)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extract out "."
as a variable
Thank you for the suggestion! This test will definitely break with the current approach |
@kevinjqliu - I've given this a bit more thought, and if we are looking to use I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to |
I'm +1 to tracking this issue in another thread. And proceed with this current PR to resolve a valid bug with nested fields. |
@@ -572,51 +572,54 @@ def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar: | |||
|
|||
|
|||
class _ConvertToArrowExpression(BoundBooleanExpressionVisitor[pc.Expression]): | |||
def _flat_name_to_list(self, name: str) -> List[str]: | |||
return name.split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a risky route. I'd rather migrate to a Tuple[str]
situation internally so we can actually support fields with .
's
|
||
|
||
def test_nested_field_equality() -> None: | ||
assert EqualTo("foo.first", "a") == parser.parse("foo.first == 'a'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to
.
)
I think the key to success is to have some kind of syntax for quoting literals. For example: https://spark.apache.org/docs/latest/sql-ref-literals.html
Then we can parse something like:
'a.b' -> Reference(('a.b',))
'a.b'.c -> Reference(('a.b', 'c'))
a.b.c -> Reference(('a', 'b', 'c'))
Or folks have to use:
row_filter=EqualTo(('a.b',), 123)
row_filter=EqualTo(('a.b', 'c'), 123)
row_filter=EqualTo(('a', 'b', 'c'), 123)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also an interesting proposal on the spec side of things: apache/iceberg#10883
Related: apache/iceberg#598
Fixes: #953
This may be a backward incompatible change because it seemingly removes support for allowing the table name within the parser string expression. https://github.com/apache/iceberg-python/pull/965/files#diff-d86e9b5642a7250398c2a3e9062e053dbd6314f83ad471e6d09b1664160a5cc7R55
However,
row_filter
is always passed as an input to a Table method (scan, delete, overwrite. Hence maintaining optional support for parsing of the table name on arow_filter
feels redundant. This assumption, also prevents us from being able to parse a nested field in the string expression, as we will always only take the name of the field at the lowest level, which prevents us from binding the field using the name in the expression string. This PR proposes to use '.' delimiter to represent field hierarchy in nested fields because:Similar to the issue highlighted in the issue with optional catalog name support in identifiers, having optional support for table_name is bound to result in an issue if we use the same delimiter for table and namespace association with column hierarchy.
For example, if we have the nested field: 'person.age' and our table is named: 'person', and have optional support for table names in the row_filter, 'person.age' will remain an ambiguous field.
NOTE: There is a larger conversation around correctly supporting name mapping, and the correctness of expressing field names in flat form #935