Python: Add Schema class #4318

samredai · 2022-03-13T22:06:26Z

This adds the Schema class by carving it out of PR #3228 by @nssalian and building upon it.

A Schema object is created by providing a list of NestedField instances, a schema ID, and optionally a map of aliases to field IDs. The get_field_id(...) method allows you to retrieve a field ID by providing the field name or an alias, and the get_field(...) method allows you to retrieve the field object by providing the field ID. There's also a get_type method that lets you retrieve the type of field by providing the field ID.

from iceberg.table.schema import Schema
from iceberg.types import BooleanType, IntegerType, NestedField, StringType

fields = [
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
]
table_schema = Schema(fields=fields, schema_id=1, aliases={"qux": 3})
print(table_schema)

output

1: name=foo, type=string, required=True
2: name=bar, type=int, required=False
3: name=baz, type=boolean, required=True

table_schema.find_field_id_by_name("foo")  # 1
table_schema.find_field_by_id(1)  # NestedField(field_id=1, name='foo', field_type=StringType(), is_optional=False)
table_schema.find_field_type(1). # StringType()

samredai · 2022-03-13T22:08:20Z

@cabhishek

rdblue · 2022-03-13T23:53:13Z

python/src/iceberg/table/schema.py

+
+        return field_id
+
+    def get_field(self, field_id: int) -> NestedField:


These should use maps that are constructed by indexing with a SchemaVisitor. All of the methods on schema are intended to return a field by full name or by ID from anywhere in the schema. That's why schema is not just a regular struct. Structs will return fields by name or ID, but are limited to just that struct.

That's also why we use names like find_field rather than get_field (in addition, we avoid get in methods names generally).

Thanks for the explanation, I saw that in the legacy implementation but now I understand the purpose. I'll update this today.

@rdblue I updated the PR to use the visitor pattern and added the IndexById and IndexByName schema visitors. There are four more schema visitor classes but I think those should be added in follow-up PRs. Let me know what you think:

GetProjectedIds

PruneColumns

AssignFreshIds

CheckCompatibility

cabhishek · 2022-03-15T03:13:21Z

python/src/iceberg/table/schema.py

+                return field
+        raise ValueError(f"Cannot get field, ID does not exist: {field_id}")
+
+    def find_field_type(self, field_id: int) -> IcebergType:


should this be re-named to def find_field_type_by_id(...) to be more explicit?
This seems a little ambiguous

table_schema.find_field_type(1) # StringType()

Sounds good, updated 👍

python/src/iceberg/table/schema.py

samredai · 2022-03-25T22:35:45Z

Rebased to add the changes from the docstest PR that was merged in.

python/src/iceberg/table/schema.py

samredai · 2022-03-28T17:15:56Z

@rdblue thanks for the review! I brought over the the schema and visitors from the partition spec PR which also addresses the other comments.

samredai · 2022-03-28T18:14:28Z

The latest click release has broken black (issue here). I'm sure it'll be resolved soon enough so we can ignore these test failures for now. Alternatively we can temporarily add a pin for click in the tox file: click == 8.0.4

This has been resolved

rdblue · 2022-03-28T19:54:49Z

python/src/iceberg/table/schema.py

+            raise ValueError("Cannot find field: {name_or_id}")
+        return matched_fields[0]
+
+    def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField:


Do you know when we can use PEP 604?

It's been back-ported the earliest version we support but not enabled by default. We'd have to add from __future__ import annotations to the top of every file where we use it. I do like the newer syntax, let me know if the visual cost of the import feels worth it and I'll go ahead and update these.

python/src/iceberg/table/schema.py

python/tests/table/test_schema.py

samredai · 2022-03-30T04:47:30Z

Added an IndexByName visitor and did a comparison with the java client which produces the same result.

Java

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;

Schema schema = new Schema(
      Types.NestedField.required(1, "foo", Types.StringType.get()),
      Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
      Types.NestedField.required(3, "baz", Types.BooleanType.get()),
      Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
      Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<String, Integer> index = TypeUtil.indexByName(schema.asStruct());
System.out.println(index);

output:

{foo=1, bar=2, baz=3, qux=4, qux.element=5, quux=6, quux.key=7, quux.value=8, quux.value.key=9, quux.value.value=10}

Python

from iceberg.types import (
    BooleanType,
    IntegerType,
    ListType,
    MapType,
    NestedField,
    StringType,
    StructType,
)
from iceberg.table.schema import Schema, index_by_name

schema = Schema(
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
    NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
    NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
    
index = index_by_name(schema)
print(index)

output:

{'foo': 1, 'bar': 2, 'baz': 3, 'qux': 4, 'qux.element': 5, 'quux': 6, 'quux.key': 7, 'quux.value': 8, 'quux.value.key': 9, 'quux.value.value': 10}

samredai · 2022-03-30T04:56:58Z

Also, here's a comparison to TypeUtil.indexById in Java:

Java

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;

Schema schema = new Schema(
      Types.NestedField.required(1, "foo", Types.StringType.get()),
      Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
      Types.NestedField.required(3, "baz", Types.BooleanType.get()),
      Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
      Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<Integer, Types.NestedField> index = TypeUtil.indexById(schema.asStruct());
System.out.println(index);

output:

{1=1: foo: required string, 2=2: bar: optional int, 3=3: baz: required boolean, 4=4: qux: required list<string>, 5=5: element: optional string, 6=6: quux: required map<string, map<string, int>>, 7=7: key: required string, 8=8: value: optional map<string, int>, 9=9: key: required string, 10=10: value: optional int}

Python

from iceberg.types import (
    BooleanType,
    IntegerType,
    ListType,
    MapType,
    NestedField,
    StringType,
    StructType,
)
from iceberg.table.schema import Schema, index_by_id

schema = Schema(
    NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
    NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
    NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
    NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
    NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
    
index = index_by_id(schema)
print(index)

output:

{
 1: NestedField(field_id=1, name='foo', field_type=StringType(), is_optional=False),
 2: NestedField(field_id=2, name='bar', field_type=IntegerType(), is_optional=True),
 3: NestedField(field_id=3, name='baz', field_type=BooleanType(), is_optional=False),
 4: NestedField(field_id=4, name='qux', field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
 5: NestedField(field_id=5, name='element', field_type=StringType(), is_optional=True),
 6: NestedField(field_id=6, name='quux', field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True),
 7: NestedField(field_id=7, name='key', field_type=StringType(), is_optional=False),
 8: NestedField(field_id=8, name='value', field_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), is_optional=True),
 9: NestedField(field_id=9, name='key', field_type=StringType(), is_optional=False),
 10: NestedField(field_id=10, name='value', field_type=IntegerType(), is_optional=True),
}

Co-authored-by: nssalian <[email protected]>

samredai · 2022-03-31T01:08:47Z

Rebased and also updated it to call index_by_id(...) lazily and then cache it.

index_by_name(...) on the other hand is called right away in the init for Schema and cached.

python/src/iceberg/table/schema.py

python/src/iceberg/schema.py

rdblue · 2022-04-04T20:36:41Z

python/src/iceberg/schema.py

+            str: The column name
+        """
+        column = self._lazy_id_to_field().get(column_id)
+        return None if column is None else column.name  # type: ignore


This actually needs to return the full name, not the field name. In Java, this uses the same index visitor as the name to ID, but it produces the byId map.

rdblue

Looks great, other than the find_column_name method doesn't return the full column name.

rdblue · 2022-04-04T20:39:10Z

@samredai, if you want, you can throw NotImplementedError in find_column_name and we can add it in a follow-up.

github-actions bot added the python label Mar 13, 2022

samredai requested review from jun-he, rdblue and kbendick March 13, 2022 22:07

rdblue reviewed Mar 13, 2022

View reviewed changes

cabhishek reviewed Mar 15, 2022

View reviewed changes

kbendick reviewed Mar 15, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

kbendick reviewed Mar 15, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

samredai requested review from rdblue and kbendick March 15, 2022 18:32

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 27, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 28, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 28, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 28, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 28, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Mar 28, 2022

View reviewed changes

python/tests/table/test_schema.py Outdated Show resolved Hide resolved

Add Schema class

f27f181

Co-authored-by: nssalian <[email protected]>

samredai added 6 commits March 30, 2022 18:05

Add temporary pin for click

c163b76

Add index_by_name and clean up tests

90ea3bf

Adding ASF headers

bffd182

Add schema_id and identifier_field_ids Schema properties

d957c9b

Schema: lazily generate id_index, generate name_index during init

4743588

Adding a couple missing return typehints

338304a

Adding back singledispatch to install_requires

48cf967

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2022

View reviewed changes

python/src/iceberg/table/schema.py Outdated Show resolved Hide resolved

Incorporate PR feedback

46f486b

rdblue reviewed Apr 4, 2022

View reviewed changes

python/src/iceberg/schema.py Show resolved Hide resolved

rdblue reviewed Apr 4, 2022

View reviewed changes

Raise NotImplementedError for find_column_name

8ccfca4

rdblue approved these changes Apr 4, 2022

View reviewed changes

rdblue merged commit 656717d into apache:master Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add Schema class #4318

Python: Add Schema class #4318

samredai commented Mar 13, 2022 •

edited

Loading

samredai commented Mar 13, 2022

rdblue Mar 13, 2022

samredai Mar 14, 2022

samredai Mar 14, 2022 •

edited

Loading

cabhishek Mar 15, 2022 •

edited

Loading

samredai Mar 15, 2022

samredai commented Mar 25, 2022

samredai commented Mar 28, 2022

samredai commented Mar 28, 2022 •

edited

Loading

rdblue Mar 28, 2022

samredai Mar 29, 2022

samredai commented Mar 30, 2022 •

edited

Loading

samredai commented Mar 30, 2022

samredai commented Mar 31, 2022

rdblue Apr 4, 2022

rdblue left a comment

rdblue commented Apr 4, 2022


		return field_id

		def get_field(self, field_id: int) -> NestedField:

Python: Add Schema class #4318

Python: Add Schema class #4318

Conversation

samredai commented Mar 13, 2022 • edited Loading

samredai commented Mar 13, 2022

rdblue Mar 13, 2022

Choose a reason for hiding this comment

samredai Mar 14, 2022

Choose a reason for hiding this comment

samredai Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

cabhishek Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

samredai Mar 15, 2022

Choose a reason for hiding this comment

samredai commented Mar 25, 2022

samredai commented Mar 28, 2022

samredai commented Mar 28, 2022 • edited Loading

rdblue Mar 28, 2022

Choose a reason for hiding this comment

samredai Mar 29, 2022

Choose a reason for hiding this comment

samredai commented Mar 30, 2022 • edited Loading

Java

Python

samredai commented Mar 30, 2022

Java

Python

samredai commented Mar 31, 2022

rdblue Apr 4, 2022

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

rdblue commented Apr 4, 2022

samredai commented Mar 13, 2022 •

edited

Loading

samredai Mar 14, 2022 •

edited

Loading

cabhishek Mar 15, 2022 •

edited

Loading

samredai commented Mar 28, 2022 •

edited

Loading

samredai commented Mar 30, 2022 •

edited

Loading