-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] #3228: PartitionSpec python API implementation #3407
Conversation
Tests pass locally for 3.7 and 3.9.5. I'll fix the line nits in upcoming commits, waiting for review comments first. |
python/src/iceberg/partition_spec.py
Outdated
|
||
class PartitionSpec(object): | ||
fields_by_source_id: defaultdict[list] = None | ||
field_list: List[PartitionField] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand the scope of these? Are the essentially instance variable declarations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Internal to the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't needed until an individual instance is initialized right? I think they should be arguments to __init__()
. A few reasons being that it's unnecessary noise when doing dir(PartitionSpec)
since you don't expect a user to set/get them at the class scope, and also the help()
documentation could be misleading as most users will refer to __init__()
as a comprehensive requirement for initialization and these would be missing there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar thoughts as #3407 (comment)
python/src/iceberg/partition_spec.py
Outdated
# TODO: Needs transform | ||
pass | ||
|
||
def _generate_unpartitioned_spec(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have a separate method for this? Shouldn't the unpartitioned spec be a constant somewhere in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm off here, but if this is just a helper for getting a commonly used instantiation of this class, shouldn't these just be default arguments to __init__()
so a user gets this if they just do p_spec = PartitionSpec()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam's idea sounds good to me. We probably don't need this in Python at all unless you think there is some reason to have a singleton unpartitioned spec, or if there is value in readability from using PartitionSpec.unpartitioned()
.
python/src/iceberg/partition_spec.py
Outdated
) | ||
|
||
def unpartitioned(self) -> PartitionSpec: | ||
return self._generate_unpartitioned_spec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should either be a @staticmethod
or a method in the package and not a method on PartitionSpec
. Probably the latter. The only reason why it was a static method on the PartitionSpec
class in Java is because you can't associate methods with packages directly in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also an argument to be made for making this a @classmethod
because it is a factory method. In that case it would be attached to the class.
python/src/iceberg/partition_spec.py
Outdated
def unpartitioned(self) -> PartitionSpec: | ||
return self._generate_unpartitioned_spec() | ||
|
||
def check_compatibility(self, spec: PartitionSpec, schema: Schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. This is probably a method in the package, not attached to the class.
python/src/iceberg/partition_spec.py
Outdated
spec = PartitionSpec( | ||
self.schema, self.spec_id, self.fields, self.last_assigned_field_id | ||
) | ||
PartitionSpec().check_compatibility(spec, self.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely not create a partition spec with only default values just to call check_compatibility
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 and I agree that check_compatibility
should be a function scoped to the package. In general I feel that if self
is not used then scoping to the package makes more sense.
python/src/iceberg/partition_spec.py
Outdated
): | ||
index += 1 | ||
# TODO: Add transform check | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks suspicious to me. We should be careful to mirror what Java is doing.
python/src/iceberg/partition_spec.py
Outdated
|
||
index = 0 | ||
for field in self.fields: | ||
other_field: PartitionField = other.fields[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding typing here, doesn't it make more sense to just add it to the function signature? If other
is typed to PartitionSpec
than the type hint for fields is implied by the type hint in the PartitionSpec
constructor where there is:
...
part_fields: List[PartitionField],
...
python/src/iceberg/partition_spec.py
Outdated
spec = PartitionSpec( | ||
self.schema, self.spec_id, self.fields, self.last_assigned_field_id | ||
) | ||
PartitionSpec().check_compatibility(spec, self.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 and I agree that check_compatibility
should be a function scoped to the package. In general I feel that if self
is not used then scoping to the package makes more sense.
python/src/iceberg/partition_spec.py
Outdated
return self._value | ||
|
||
|
||
class Builder(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can simplify it to avoid a Java kind of builder class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the builder translates well to python because you're constructing the object through an API that does checking while you build. This also gives us a good way to build partition specs by passing the builder around. I'd vote to keep it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like constructing objects in this kind of pattern but I think we can achieve it without using typical Java builder in a more pythonic way. For example,
def buildermethod(func):
def wrapper(self, *args, **kwargs):
func(self, *args, **kwargs)
return self
return wrapper
class PartitionSpec:
def __init__(self):
pass
@buildermethod
def with_schema(self, schema):
# additional checks
self_schema = schema
...
partition_spec = PartitionSpec().with_schema(schama)...
Also, we won't be able to get immutability that Java builder offers in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really how builders work. It looks more like refinement, but it modifies the original object. I think that will create confusion.
I'd prefer to stick with the builder API that we have already rather than trying to come up with a new pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we cannot enforce immutability of the original object in python and it is always mutable. So it misses the main benefit of builder pattern in Java. So I am not sure if there is a confusion for python users that the original object is expected to be immutable or not.
A builder class here seems to be a boilerplate class with an extra layer to prevent calling the target object's methods but it won't prevent the field in the target object being mutated.
There are also multiple other places using builder pattern in iceberg. It is better that we keep this consistent (either use or not use builder class). I will start a thread in the slack channel to get more options about it from python community about builder class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion continued here on Slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1635538598009300
Thanks for the review. I'll address the comments in upcoming commits. |
) | ||
return False | ||
|
||
def __str__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may also add def __repr__
as well.
python/src/iceberg/schema.py
Outdated
alias_to_id: dict = None | ||
id_to_field = {} | ||
name_to_id: dict = None | ||
lowercase_name_to_id: dict = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, those 4 fields can be included within __init__
.
Also, if they are not needed in this change, we might consider to add them in the future when we actually need them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are simply class variables that might be used for methods. They don't need to be part of the __init__
since they aren't initialized externally.
Let me fix the PR with the rest of the changes and see if we need the variables for this iteration of change. I'll remove them if not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that here is declaring mutable variables as class attributes. This can be problematic because class attributes are mutable and persist across the same python session. i.e.
class Spec:
id: str = None
name: str = None
spec1 = Spec()
spec2 = Spec()
assert(spec1.id is None)
# change class variable from an instance
spec2.__class__.id = "spec2_id"
assert(spec1.id == "spec2_id")
assert(Spec.id == "spec2_id")
spec3 = Spec()
assert(spec3.id == "spec2_id")
Spec.id = "class_spec_id"
assert(spec1.id == "class_spec_id")
assert(spec2.id == "class_spec_id")
assert(spec3.id == "class_spec_id")
In the above example, an instance can change the class variable and it affects all other current and future instances. This can lead to surprising behavior.
To have instance scope only variables, we can put them inside the init without exposing them to the signature. i.e.
def __init__(self, struct: StructType, schema_id: int, identifier_field_ids: [int]):
self._struct = struct
self._schema_id = schema_id
self._identifier_field_ids = identifier_field_ids
# not initialized by the user externally
self.alias_to_id: dict = None
self.id_to_field = {}
self.name_to_id: dict = None
self.lowercase_name_to_id: dict = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is the approach that makes sense. Fixing in the upcoming commits.
Also, should we always implement |
Python: Add basic Schema and visitors to types
python/src/iceberg/partition_spec.py
Outdated
def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]: | ||
return self._generate_fields_by_source_id().get(field_id, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I'm new to iceberg but this signature looks confusing. Is source_id
and field_id
the same thing?
def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]: | |
return self._generate_fields_by_source_id().get(field_id, None) | |
def get_fields_by_source_id(self, field_id: int) -> Optional[List[PartitionField]]: | |
return self._generate_fields_by_source_id().get(field_id, None) |
The returns should be Optional[List[PartitionField]]
instead if the the method will returns None
.
I'm confused about what _generate_fields_by_source_id
tries to achieve. Why do we need to generate fields_source_to_field_dict
every time it's called when self.fields_by_source_id is None
?
python/src/iceberg/partition_spec.py
Outdated
fields_source_to_field_dict = defaultdict(list) | ||
for field in self.fields: | ||
fields_source_to_field_dict[field.source_id] = [field] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a single source_id
map to multiple fields
?
Added some fixes. Still some more left. Will get to it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, can you rebase and run build again?
return True | ||
|
||
|
||
class Builder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the Builder to be the inner class of PartitionSpec class?
self.spec_id = 0 | ||
self.last_assigned_field_id = PARTITION_DATA_ID_START - 1 | ||
|
||
def builder_for(self, schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add type for schema
def builder_for(self, schema): | ||
return Builder(schema=schema) | ||
|
||
def next_field_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add type info for the return result.
@nssalian BTW, you may also check if the unit test coverage is good enough to pass the threshold. |
I'm going to close this, since PartitionSpec was added in #4717. |
In this PR:
partition_field
,schema
andpartition_spec
for [Python] support partition spec in iceberg python library #3228Most items are pending since Transforms and TypeUtil haven't been implemented in the Python API
CC: @jun-he, @rdblue , @samredai to help review and understand how to proceed.