Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Add Schema class #4318

Merged
merged 23 commits into from
Apr 4, 2022
Merged

Python: Add Schema class #4318

merged 23 commits into from
Apr 4, 2022

Conversation

samredai
Copy link
Collaborator

@samredai samredai commented Mar 13, 2022

This adds the Schema class by carving it out of PR #3228 by @nssalian and building upon it.

A Schema object is created by providing a list of NestedField instances, a schema ID, and optionally a map of aliases to field IDs. The get_field_id(...) method allows you to retrieve a field ID by providing the field name or an alias, and the get_field(...) method allows you to retrieve the field object by providing the field ID. There's also a get_type method that lets you retrieve the type of field by providing the field ID.

from iceberg.table.schema import Schema
from iceberg.types import BooleanType, IntegerType, NestedField, StringType

fields = [
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
]
table_schema = Schema(fields=fields, schema_id=1, aliases={"qux": 3})
print(table_schema)

output

1: name=foo, type=string, required=True
2: name=bar, type=int, required=False
3: name=baz, type=boolean, required=True
table_schema.find_field_id_by_name("foo")  # 1
table_schema.find_field_by_id(1)  # NestedField(field_id=1, name='foo', field_type=StringType(), is_optional=False)
table_schema.find_field_type(1). # StringType()

@samredai
Copy link
Collaborator Author

@cabhishek


return field_id

def get_field(self, field_id: int) -> NestedField:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should use maps that are constructed by indexing with a SchemaVisitor. All of the methods on schema are intended to return a field by full name or by ID from anywhere in the schema. That's why schema is not just a regular struct. Structs will return fields by name or ID, but are limited to just that struct.

That's also why we use names like find_field rather than get_field (in addition, we avoid get in methods names generally).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I saw that in the legacy implementation but now I understand the purpose. I'll update this today.

Copy link
Collaborator Author

@samredai samredai Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I updated the PR to use the visitor pattern and added the IndexById and IndexByName schema visitors. There are four more schema visitor classes but I think those should be added in follow-up PRs. Let me know what you think:

  • GetProjectedIds
  • PruneColumns
  • AssignFreshIds
  • CheckCompatibility

return field
raise ValueError(f"Cannot get field, ID does not exist: {field_id}")

def find_field_type(self, field_id: int) -> IcebergType:
Copy link
Contributor

@cabhishek cabhishek Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be re-named to def find_field_type_by_id(...) to be more explicit?
This seems a little ambiguous

table_schema.find_field_type(1) # StringType()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated 👍

@samredai
Copy link
Collaborator Author

Rebased to add the changes from the docstest PR that was merged in.

@samredai
Copy link
Collaborator Author

@rdblue thanks for the review! I brought over the the schema and visitors from the partition spec PR which also addresses the other comments.

@samredai
Copy link
Collaborator Author

samredai commented Mar 28, 2022

The latest click release has broken black (issue here). I'm sure it'll be resolved soon enough so we can ignore these test failures for now. Alternatively we can temporarily add a pin for click in the tox file: click == 8.0.4

This has been resolved

raise ValueError("Cannot find field: {name_or_id}")
return matched_fields[0]

def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know when we can use PEP 604?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been back-ported the earliest version we support but not enabled by default. We'd have to add from __future__ import annotations to the top of every file where we use it. I do like the newer syntax, let me know if the visual cost of the import feels worth it and I'll go ahead and update these.

@samredai
Copy link
Collaborator Author

samredai commented Mar 30, 2022

Added an IndexByName visitor and did a comparison with the java client which produces the same result.

Java

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;

Schema schema = new Schema(
      Types.NestedField.required(1, "foo", Types.StringType.get()),
      Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
      Types.NestedField.required(3, "baz", Types.BooleanType.get()),
      Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
      Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<String, Integer> index = TypeUtil.indexByName(schema.asStruct());
System.out.println(index);

output:

{foo=1, bar=2, baz=3, qux=4, qux.element=5, quux=6, quux.key=7, quux.value=8, quux.value.key=9, quux.value.value=10}

Python

from iceberg.types import (
    BooleanType,
    IntegerType,
    ListType,
    MapType,
    NestedField,
    StringType,
    StructType,
)
from iceberg.table.schema import Schema, index_by_name

schema = Schema(
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
    NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
    NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
    
index = index_by_name(schema)
print(index)

output:

{'foo': 1, 'bar': 2, 'baz': 3, 'qux': 4, 'qux.element': 5, 'quux': 6, 'quux.key': 7, 'quux.value': 8, 'quux.value.key': 9, 'quux.value.value': 10}

@samredai
Copy link
Collaborator Author

Also, here's a comparison to TypeUtil.indexById in Java:

Java

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;

Schema schema = new Schema(
      Types.NestedField.required(1, "foo", Types.StringType.get()),
      Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
      Types.NestedField.required(3, "baz", Types.BooleanType.get()),
      Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
      Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<Integer, Types.NestedField> index = TypeUtil.indexById(schema.asStruct());
System.out.println(index);

output:

{1=1: foo: required string, 2=2: bar: optional int, 3=3: baz: required boolean, 4=4: qux: required list<string>, 5=5: element: optional string, 6=6: quux: required map<string, map<string, int>>, 7=7: key: required string, 8=8: value: optional map<string, int>, 9=9: key: required string, 10=10: value: optional int}

Python

from iceberg.types import (
    BooleanType,
    IntegerType,
    ListType,
    MapType,
    NestedField,
    StringType,
    StructType,
)
from iceberg.table.schema import Schema, index_by_id

schema = Schema(
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
    NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
    NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
    
index = index_by_id(schema)
print(index)

output:

{
 1: NestedField(field_id=1, name='foo', field_type=StringType(), is_optional=False),
 2: NestedField(field_id=2, name='bar', field_type=IntegerType(), is_optional=True),
 3: NestedField(field_id=3, name='baz', field_type=BooleanType(), is_optional=False),
 4: NestedField(field_id=4, name='qux', field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
 5: NestedField(field_id=5, name='element', field_type=StringType(), is_optional=True),
 6: NestedField(field_id=6, name='quux', field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True),
 7: NestedField(field_id=7, name='key', field_type=StringType(), is_optional=False),
 8: NestedField(field_id=8, name='value', field_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), is_optional=True),
 9: NestedField(field_id=9, name='key', field_type=StringType(), is_optional=False),
 10: NestedField(field_id=10, name='value', field_type=IntegerType(), is_optional=True),
}

Co-authored-by: nssalian <[email protected]>
@samredai
Copy link
Collaborator Author

Rebased and also updated it to call index_by_id(...) lazily and then cache it.

index_by_name(...) on the other hand is called right away in the init for Schema and cached.

str: The column name
"""
column = self._lazy_id_to_field().get(column_id)
return None if column is None else column.name # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually needs to return the full name, not the field name. In Java, this uses the same index visitor as the name to ID, but it produces the byId map.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, other than the find_column_name method doesn't return the full column name.

@rdblue
Copy link
Contributor

rdblue commented Apr 4, 2022

@samredai, if you want, you can throw NotImplementedError in find_column_name and we can add it in a follow-up.

@rdblue rdblue merged commit 656717d into apache:master Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants