Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NameMapping flattens the names and causes a.b field to collide with child b field of field a #935

Closed
sungwy opened this issue Jul 16, 2024 · 3 comments · Fixed by #1014
Closed
Assignees

Comments

@sungwy
Copy link
Collaborator

sungwy commented Jul 16, 2024

Apache Iceberg version

None

Please describe the bug 🐞

According to the Iceberg documentation on Column Projection:

A name may contain . but this refers to a literal name, not a nested field. For example, a.b refers to a field named a.b, not child field b of field a.

The current implementation of NameMapping flattens the name by joining the parent child relationships with a .. This causes name collisions issues with fields that should not collide with each other.

For example, this flat map causes a.b field to collide with child b field of field a.

We should update _field_by_name() and find() methods of NameMapping to use a tree structure instead of a flat dict, and traverse the tree in order to retrieve MappedField of the provided name.

@cached_property
def _field_by_name(self) -> Dict[str, MappedField]:
return visit_name_mapping(self, _IndexByName())
def find(self, *names: str) -> MappedField:
name = ".".join(names)
try:
return self._field_by_name[name]
except KeyError as e:
raise ValueError(f"Could not find field with name: {name}") from e

@Fokko
Copy link
Contributor

Fokko commented Jul 16, 2024

Thanks for tracking this @syun64, can I pick this one up? :)

@sungwy
Copy link
Collaborator Author

sungwy commented Jul 16, 2024

Yes of course!

Just a note that's hopefully helpful: while working on covering more cases for #921 , I realized this may require a bit more work than I originally thought. We currently rely on a flat name mapping in many places throughout the repository, including when we aggregate stats from the parquet files:

statistics = data_file_statistics_from_parquet_metadata(
parquet_metadata=writer.writer.metadata,
stats_columns=compute_statistics_plan(file_schema, table_metadata.properties),
parquet_column_mapping=parquet_path_to_id_mapping(file_schema),
)

So I think we will need to build a tree representation of the Name to ID mapping for a given pyarrow schema as well.

for pos in range(parquet_metadata.num_columns):
column = row_group.column(pos)
field_id = parquet_column_mapping[column.path_in_schema]

@sungwy
Copy link
Collaborator Author

sungwy commented Jul 16, 2024

Hi @Fokko - I wanted to make note of this Rest Catalog Open API Spec PR, where the community may be weighing the pros and cons of flattening the nested field names in our APIs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants