NameMapping flattens the names and causes `a.b` field to collide with child `b` field of field `a` #935

sungwy · 2024-07-16T01:39:46Z

Apache Iceberg version

None

Please describe the bug 🐞

According to the Iceberg documentation on Column Projection:

A name may contain . but this refers to a literal name, not a nested field. For example, a.b refers to a field named a.b, not child field b of field a.

The current implementation of NameMapping flattens the name by joining the parent child relationships with a .. This causes name collisions issues with fields that should not collide with each other.

For example, this flat map causes a.b field to collide with child b field of field a.

We should update _field_by_name() and find() methods of NameMapping to use a tree structure instead of a flat dict, and traverse the tree in order to retrieve MappedField of the provided name.

iceberg-python/pyiceberg/table/name_mapping.py

Lines 73 to 82 in e27cd90

    
           @cached_property 
        
           def _field_by_name(self) -> Dict[str, MappedField]: 
        
               return visit_name_mapping(self, _IndexByName()) 
        
           def find(self, *names: str) -> MappedField: 
        
               name = ".".join(names) 
        
               try: 
        
                   return self._field_by_name[name] 
        
               except KeyError as e: 
        
                   raise ValueError(f"Could not find field with name: {name}") from e

The text was updated successfully, but these errors were encountered:

Fokko · 2024-07-16T19:13:21Z

Thanks for tracking this @syun64, can I pick this one up? :)

sungwy · 2024-07-16T19:51:45Z

Yes of course!

Just a note that's hopefully helpful: while working on covering more cases for #921 , I realized this may require a bit more work than I originally thought. We currently rely on a flat name mapping in many places throughout the repository, including when we aggregate stats from the parquet files:

iceberg-python/pyiceberg/io/pyarrow.py

Lines 2027 to 2031 in 0f2e19e

    
           statistics = data_file_statistics_from_parquet_metadata( 
        
               parquet_metadata=writer.writer.metadata, 
        
               stats_columns=compute_statistics_plan(file_schema, table_metadata.properties), 
        
               parquet_column_mapping=parquet_path_to_id_mapping(file_schema), 
        
           )

So I think we will need to build a tree representation of the Name to ID mapping for a given pyarrow schema as well.

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1934 to 1936 in 0f2e19e

    
           for pos in range(parquet_metadata.num_columns): 
        
               column = row_group.column(pos) 
        
               field_id = parquet_column_mapping[column.path_in_schema]

sungwy · 2024-07-16T22:31:48Z

Hi @Fokko - I wanted to make note of this Rest Catalog Open API Spec PR, where the community may be weighing the pros and cons of flattening the nested field names in our APIs:

sungwy mentioned this issue Jul 25, 2024

Fix parsing reference for nested fields #965

Draft

Fokko self-assigned this Aug 7, 2024

Fokko mentioned this issue Aug 7, 2024

Use VisitorWithPartner for name-mapping #1014

Merged

sungwy closed this as completed in #1014 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NameMapping flattens the names and causes `a.b` field to collide with child `b` field of field `a` #935

NameMapping flattens the names and causes `a.b` field to collide with child `b` field of field `a` #935

sungwy commented Jul 16, 2024 •

edited

Loading

Fokko commented Jul 16, 2024 •

edited

Loading

sungwy commented Jul 16, 2024 •

edited

Loading

sungwy commented Jul 16, 2024

NameMapping flattens the names and causes a.b field to collide with child b field of field a #935

NameMapping flattens the names and causes a.b field to collide with child b field of field a #935

Comments

sungwy commented Jul 16, 2024 • edited Loading

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Jul 16, 2024 • edited Loading

sungwy commented Jul 16, 2024 • edited Loading

sungwy commented Jul 16, 2024

NameMapping flattens the names and causes `a.b` field to collide with child `b` field of field `a` #935

NameMapping flattens the names and causes `a.b` field to collide with child `b` field of field `a` #935

sungwy commented Jul 16, 2024 •

edited

Loading

Fokko commented Jul 16, 2024 •

edited

Loading

sungwy commented Jul 16, 2024 •

edited

Loading