[Python] #3228: PartitionSpec python API implementation #3407

nssalian · 2021-10-28T02:31:07Z

In this PR:

Added the initial implementation of partition_field, schema and partition_spec for [Python] support partition spec in iceberg python library #3228
Formatted using black.

Most items are pending since Transforms and TypeUtil haven't been implemented in the Python API

CC: @jun-he, @rdblue , @samredai to help review and understand how to proceed.

nssalian · 2021-10-28T16:56:04Z

Tests pass locally for 3.7 and 3.9.5. I'll fix the line nits in upcoming commits, waiting for review comments first.

python/src/iceberg/partition_field.py

python/src/iceberg/partition_spec.py

rdblue · 2021-10-28T22:19:48Z

python/src/iceberg/partition_spec.py

+
+class PartitionSpec(object):
+    fields_by_source_id: defaultdict[list] = None
+    field_list: List[PartitionField] = None


Can you help me understand the scope of these? Are the essentially instance variable declarations?

Yes, that's right. Internal to the class.

These aren't needed until an individual instance is initialized right? I think they should be arguments to __init__(). A few reasons being that it's unnecessary noise when doing dir(PartitionSpec) since you don't expect a user to set/get them at the class scope, and also the help() documentation could be misleading as most users will refer to __init__() as a comprehensive requirement for initialization and these would be missing there.

Similar thoughts as #3407 (comment)

python/src/iceberg/partition_spec.py

rdblue · 2021-10-28T22:24:28Z

python/src/iceberg/partition_spec.py

+        # TODO: Needs transform
+        pass
+
+    def _generate_unpartitioned_spec(self):


Why have a separate method for this? Shouldn't the unpartitioned spec be a constant somewhere in this file?

Sorry if I'm off here, but if this is just a helper for getting a commonly used instantiation of this class, shouldn't these just be default arguments to __init__() so a user gets this if they just do p_spec = PartitionSpec()?

Sam's idea sounds good to me. We probably don't need this in Python at all unless you think there is some reason to have a singleton unpartitioned spec, or if there is value in readability from using PartitionSpec.unpartitioned().

rdblue · 2021-10-28T22:26:27Z

python/src/iceberg/partition_spec.py

+        )
+
+    def unpartitioned(self) -> PartitionSpec:
+        return self._generate_unpartitioned_spec()


I think this should either be a @staticmethod or a method in the package and not a method on PartitionSpec. Probably the latter. The only reason why it was a static method on the PartitionSpec class in Java is because you can't associate methods with packages directly in Java.

There's also an argument to be made for making this a @classmethod because it is a factory method. In that case it would be attached to the class.

rdblue · 2021-10-28T22:26:58Z

python/src/iceberg/partition_spec.py

+    def unpartitioned(self) -> PartitionSpec:
+        return self._generate_unpartitioned_spec()
+
+    def check_compatibility(self, spec: PartitionSpec, schema: Schema):


Same here. This is probably a method in the package, not attached to the class.

python/src/iceberg/partition_spec.py

rdblue · 2021-10-28T22:29:45Z

python/src/iceberg/partition_spec.py

+        spec = PartitionSpec(
+            self.schema, self.spec_id, self.fields, self.last_assigned_field_id
+        )
+        PartitionSpec().check_compatibility(spec, self.schema)


This should definitely not create a partition spec with only default values just to call check_compatibility.

+1 and I agree that check_compatibility should be a function scoped to the package. In general I feel that if self is not used then scoping to the package makes more sense.

python/src/iceberg/validation_exception.py

python/src/iceberg/partition_field.py

python/src/iceberg/schema.py

rdblue · 2021-10-28T22:37:46Z

python/src/iceberg/partition_spec.py

+            ):
+                index += 1
+                # TODO: Add transform check
+                return False


This looks suspicious to me. We should be careful to mirror what Java is doing.

python/tests/test_partition_field.py

python/src/iceberg/validation_exception.py

python/src/iceberg/schema.py

python/src/iceberg/partition_spec.py

samredai · 2021-10-29T01:20:38Z

python/src/iceberg/partition_spec.py

+
+        index = 0
+        for field in self.fields:
+            other_field: PartitionField = other.fields[index]


Instead of adding typing here, doesn't it make more sense to just add it to the function signature? If other is typed to PartitionSpec than the type hint for fields is implied by the type hint in the PartitionSpec constructor where there is:

... part_fields: List[PartitionField], ...

samredai · 2021-10-29T01:22:52Z

python/src/iceberg/partition_spec.py

+        spec = PartitionSpec(
+            self.schema, self.spec_id, self.fields, self.last_assigned_field_id
+        )
+        PartitionSpec().check_compatibility(spec, self.schema)


+1 and I agree that check_compatibility should be a function scoped to the package. In general I feel that if self is not used then scoping to the package makes more sense.

python/src/iceberg/schema.py

python/src/iceberg/partition_spec.py

python/src/iceberg/partition_field.py

jun-he · 2021-10-29T06:54:16Z

python/src/iceberg/partition_spec.py

+            return self._value
+
+
+class Builder(object):


Wondering if we can simplify it to avoid a Java kind of builder class.

I think that the builder translates well to python because you're constructing the object through an API that does checking while you build. This also gives us a good way to build partition specs by passing the builder around. I'd vote to keep it for now.

I like constructing objects in this kind of pattern but I think we can achieve it without using typical Java builder in a more pythonic way. For example,

def buildermethod(func): def wrapper(self, *args, **kwargs): func(self, *args, **kwargs) return self return wrapper class PartitionSpec: def __init__(self): pass @buildermethod def with_schema(self, schema): # additional checks self_schema = schema ... partition_spec = PartitionSpec().with_schema(schama)...

Also, we won't be able to get immutability that Java builder offers in Python.

That's not really how builders work. It looks more like refinement, but it modifies the original object. I think that will create confusion.

I'd prefer to stick with the builder API that we have already rather than trying to come up with a new pattern.

But we cannot enforce immutability of the original object in python and it is always mutable. So it misses the main benefit of builder pattern in Java. So I am not sure if there is a confusion for python users that the original object is expected to be immutable or not.

A builder class here seems to be a boilerplate class with an extra layer to prevent calling the target object's methods but it won't prevent the field in the target object being mutated.

There are also multiple other places using builder pattern in iceberg. It is better that we keep this consistent (either use or not use builder class). I will start a thread in the slack channel to get more options about it from python community about builder class.

Discussion continued here on Slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1635538598009300

nssalian · 2021-10-29T16:54:25Z

Thanks for the review. I'll address the comments in upcoming commits.

jun-he · 2021-10-29T23:20:05Z

python/src/iceberg/partition_field.py

+            )
+        return False
+
+    def __str__(self):


may also add def __repr__ as well.

python/src/iceberg/partition_field.py

python/src/iceberg/partition_spec.py

jun-he · 2021-10-29T23:31:28Z

python/src/iceberg/schema.py

+    alias_to_id: dict = None
+    id_to_field = {}
+    name_to_id: dict = None
+    lowercase_name_to_id: dict = None


IMO, those 4 fields can be included within __init__.
Also, if they are not needed in this change, we might consider to add them in the future when we actually need them

I think these are simply class variables that might be used for methods. They don't need to be part of the __init__ since they aren't initialized externally.
Let me fix the PR with the rest of the changes and see if we need the variables for this iteration of change. I'll remove them if not needed.

It seems that here is declaring mutable variables as class attributes. This can be problematic because class attributes are mutable and persist across the same python session. i.e.

class Spec: id: str = None name: str = None spec1 = Spec() spec2 = Spec() assert(spec1.id is None) # change class variable from an instance spec2.__class__.id = "spec2_id" assert(spec1.id == "spec2_id") assert(Spec.id == "spec2_id") spec3 = Spec() assert(spec3.id == "spec2_id") Spec.id = "class_spec_id" assert(spec1.id == "class_spec_id") assert(spec2.id == "class_spec_id") assert(spec3.id == "class_spec_id")

In the above example, an instance can change the class variable and it affects all other current and future instances. This can lead to surprising behavior.

To have instance scope only variables, we can put them inside the init without exposing them to the signature. i.e.

def __init__(self, struct: StructType, schema_id: int, identifier_field_ids: [int]): self._struct = struct self._schema_id = schema_id self._identifier_field_ids = identifier_field_ids # not initialized by the user externally self.alias_to_id: dict = None self.id_to_field = {} self.name_to_id: dict = None self.lowercase_name_to_id: dict = None

Agreed. This is the approach that makes sense. Fixing in the upcoming commits.

jun-he · 2021-10-29T23:34:05Z

Also, should we always implement __repr__ in those classes?

Python: Add basic Schema and visitors to types

xinbinhuang · 2021-11-06T23:15:26Z

python/src/iceberg/partition_spec.py

+    def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]:
+        return self._generate_fields_by_source_id().get(field_id, None)


Hi I'm new to iceberg but this signature looks confusing. Is source_id and field_id the same thing?

Suggested change

def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]:

return self._generate_fields_by_source_id().get(field_id, None)

def get_fields_by_source_id(self, field_id: int) -> Optional[List[PartitionField]]:

return self._generate_fields_by_source_id().get(field_id, None)

The returns should be Optional[List[PartitionField]] instead if the the method will returns None.

I'm confused about what _generate_fields_by_source_id tries to achieve. Why do we need to generate fields_source_to_field_dict every time it's called when self.fields_by_source_id is None?

xinbinhuang · 2021-11-06T23:21:48Z

python/src/iceberg/partition_spec.py

+            fields_source_to_field_dict = defaultdict(list)
+            for field in self.fields:
+                fields_source_to_field_dict[field.source_id] = [field]


Can a single source_id map to multiple fields?

nssalian · 2021-11-10T18:28:43Z

Added some fixes. Still some more left. Will get to it soon.

jun-he

Additionally, can you rebase and run build again?

jun-he · 2022-01-24T22:21:41Z

python/src/iceberg/partitioning.py

+        return True
+
+
+class Builder:


Should we make the Builder to be the inner class of PartitionSpec class?

jun-he · 2022-01-24T22:22:00Z

python/src/iceberg/partitioning.py

+        self.spec_id = 0
+        self.last_assigned_field_id = PARTITION_DATA_ID_START - 1
+
+    def builder_for(self, schema):


nit: add type for schema

jun-he · 2022-01-24T22:22:20Z

python/src/iceberg/partitioning.py

+    def builder_for(self, schema):
+        return Builder(schema=schema)
+
+    def next_field_id(self):


add type info for the return result.

jun-he · 2022-01-24T22:25:44Z

@nssalian BTW, you may also check if the unit test coverage is good enough to pass the threshold.

rdblue · 2022-06-29T17:28:33Z

I'm going to close this, since PartitionSpec was added in #4717.

PartitionSpec python implementation.WIP

22178f2

github-actions bot added the python label Oct 28, 2021

nssalian changed the title ~~PartitionSpec python API implementation~~ [Python] #3228: PartitionSpec python API implementation Oct 28, 2021

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/partition_field.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/validation_exception.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/validation_exception.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/partition_field.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

python/src/iceberg/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Oct 28, 2021

View reviewed changes

samredai reviewed Oct 29, 2021

View reviewed changes

python/tests/test_partition_field.py Outdated Show resolved Hide resolved

samredai reviewed Oct 29, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

samredai reviewed Oct 29, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

jun-he reviewed Oct 29, 2021

View reviewed changes

Python: Add basic Schema and visitors to types

e09d261

jun-he reviewed Oct 29, 2021

View reviewed changes

python/src/iceberg/partition_field.py Outdated

)

return False

def __str__(self):

Copy link

Collaborator

jun-he Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may also add def __repr__ as well.

jun-he reviewed Oct 29, 2021

View reviewed changes

python/src/iceberg/partition_field.py Outdated Show resolved Hide resolved

jun-he reviewed Oct 29, 2021

View reviewed changes

python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

jun-he reviewed Oct 29, 2021

View reviewed changes

nssalian added 2 commits November 1, 2021 17:47

Merge pull request #1 from rdblue/py-partition-spec

f6da13a

Python: Add basic Schema and visitors to types

Merge branch 'master' into py-partition-spec

408f019

xinbinhuang reviewed Nov 6, 2021

View reviewed changes

PR comments fixes. WIP

afcf346

PR fixes

ce189ca

nssalian requested review from jun-he, samredai and rdblue November 23, 2021 02:07

jun-he reviewed Jan 24, 2022

View reviewed changes

dramaticlly mentioned this pull request Apr 25, 2022

Python: PartitionSpec Construction #4631

Closed

rdblue closed this Jun 29, 2022

		def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]:
		return self._generate_fields_by_source_id().get(field_id, None)

[Python] #3228: PartitionSpec python API implementation #3407

[Python] #3228: PartitionSpec python API implementation #3407

Conversation

nssalian commented Oct 28, 2021 • edited Loading

nssalian commented Oct 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nssalian commented Oct 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinbinhuang Nov 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he commented Oct 29, 2021

xinbinhuang Nov 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nssalian commented Nov 10, 2021

jun-he left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he commented Jan 24, 2022

rdblue commented Jun 29, 2022

nssalian commented Oct 28, 2021 •

edited

Loading

samredai Oct 29, 2021 •

edited

Loading

rdblue Oct 28, 2021 •

edited

Loading

jun-he Oct 29, 2021 •

edited

Loading

xinbinhuang Nov 6, 2021 •

edited

Loading

xinbinhuang Nov 6, 2021 •

edited

Loading