Fix parsing reference for nested fields #965

sungwy · 2024-07-25T12:58:42Z

Fixes: #953

This may be a backward incompatible change because it seemingly removes support for allowing the table name within the parser string expression. https://github.com/apache/iceberg-python/pull/965/files#diff-d86e9b5642a7250398c2a3e9062e053dbd6314f83ad471e6d09b1664160a5cc7R55

However, row_filter is always passed as an input to a Table method (scan, delete, overwrite. Hence maintaining optional support for parsing of the table name on a row_filter feels redundant. This assumption, also prevents us from being able to parse a nested field in the string expression, as we will always only take the name of the field at the lowest level, which prevents us from binding the field using the name in the expression string. This PR proposes to use '.' delimiter to represent field hierarchy in nested fields because:

it is the flat representation convention used in NameMapping
it is the flat representation convention used in selected_fields, which is another input parameter to the scan method

Similar to the issue highlighted in the issue with optional catalog name support in identifiers, having optional support for table_name is bound to result in an issue if we use the same delimiter for table and namespace association with column hierarchy.

For example, if we have the nested field: 'person.age' and our table is named: 'person', and have optional support for table names in the row_filter, 'person.age' will remain an ambiguous field.

NOTE: There is a larger conversation around correctly supporting name mapping, and the correctness of expressing field names in flat form #935

kevinjqliu

Thanks for the fix! I was able to verify that this works for previously errored case
#953 (comment)

Since we're using the . character to represent nested access, what if I have a column named foo.bar? Can we add a test case for that?

kevinjqliu · 2024-07-26T16:21:54Z

pyiceberg/expressions/parser.py

@@ -83,7 +83,7 @@

 @column.set_parse_action
 def _(result: ParseResults) -> Reference:
-    return Reference(result.column[-1])
+    return Reference(".".join(result.column))


nit: extract out "." as a variable

sungwy · 2024-07-26T16:52:40Z

Since we're using the . character to represent nested access, what if I have a column named foo.bar? Can we add a test case for that?

Thank you for the suggestion! This test will definitely break with the current approach

sungwy · 2024-07-26T17:53:25Z

@kevinjqliu - I've given this a bit more thought, and if we are looking to use . to represent nested fields, we will always run into an issue with . named fields if we split based on the delimiter. Unfortunately, I think this means that the scale of this problem is larger than a simple bug fix onto the Expression parser, expression_to_pyarrow and the Reference classes.

I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to .)

kevinjqliu · 2024-07-26T18:44:48Z

Unfortunately, I think this means that the scale of this problem is larger than a simple bug fix onto the Expression parser, expression_to_pyarrow and the Reference classes.

I'm +1 to tracking this issue in another thread. And proceed with this current PR to resolve a valid bug with nested fields.
@HonahX / @Fokko wdyt?

Fokko · 2024-08-07T16:22:33Z

pyiceberg/io/pyarrow.py

@@ -572,51 +572,54 @@ def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:


 class _ConvertToArrowExpression(BoundBooleanExpressionVisitor[pc.Expression]):
+    def _flat_name_to_list(self, name: str) -> List[str]:
+        return name.split(".")


This is a risky route. I'd rather migrate to a Tuple[str] situation internally so we can actually support fields with .'s

Fokko · 2024-08-07T16:25:41Z

tests/expressions/test_parser.py

+
+
+def test_nested_field_equality() -> None:
+    assert EqualTo("foo.first", "a") == parser.parse("foo.first == 'a'")


I think we first have to come up with a proposal for representing nested fields in a flat string that doesn't result in these edge cases, or is at least configurable (e.g. Config parameter PYICEBERG__NESTED_FIELD_DELIMITER defined at the session level that defaults to .)

I think the key to success is to have some kind of syntax for quoting literals. For example: https://spark.apache.org/docs/latest/sql-ref-literals.html

Then we can parse something like:

'a.b' -> Reference(('a.b',)) 'a.b'.c -> Reference(('a.b', 'c')) a.b.c -> Reference(('a', 'b', 'c'))

Or folks have to use:

row_filter=EqualTo(('a.b',), 123) row_filter=EqualTo(('a.b', 'c'), 123) row_filter=EqualTo(('a', 'b', 'c'), 123)

There is also an interesting proposal on the spec side of things: apache/iceberg#10883

Related: apache/iceberg#598

fix parsing reference for nested fields

e1132e0

sungwy mentioned this pull request Jul 25, 2024

Query on nested struct field with PyIceberg? #953

Open

sungwy and others added 2 commits July 26, 2024 14:03

fixes 953

2653f1c

Merge branch 'main' into nested-parse

1423a65

sungwy requested a review from kevinjqliu July 26, 2024 15:03

test fix

5f14dbb

kevinjqliu reviewed Jul 26, 2024

View reviewed changes

failing edge case

cb07171

sungwy marked this pull request as draft July 26, 2024 17:53

Fokko self-requested a review August 6, 2024 21:00

Fokko reviewed Aug 7, 2024

View reviewed changes

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing reference for nested fields #965

Fix parsing reference for nested fields #965

sungwy commented Jul 25, 2024 •

edited

Loading

kevinjqliu left a comment

kevinjqliu Jul 26, 2024

sungwy commented Jul 26, 2024

sungwy commented Jul 26, 2024

kevinjqliu commented Jul 26, 2024

Fokko Aug 7, 2024

Fokko Aug 7, 2024 •

edited

Loading

Fokko Aug 8, 2024 •

edited

Loading



		def test_nested_field_equality() -> None:
		assert EqualTo("foo.first", "a") == parser.parse("foo.first == 'a'")

Fix parsing reference for nested fields #965

Are you sure you want to change the base?

Fix parsing reference for nested fields #965

Conversation

sungwy commented Jul 25, 2024 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Jul 26, 2024

Choose a reason for hiding this comment

sungwy commented Jul 26, 2024

sungwy commented Jul 26, 2024

kevinjqliu commented Jul 26, 2024

Fokko Aug 7, 2024

Choose a reason for hiding this comment

Fokko Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Fokko Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

sungwy commented Jul 25, 2024 •

edited

Loading

Fokko Aug 7, 2024 •

edited

Loading

Fokko Aug 8, 2024 •

edited

Loading