-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Add PartitionSpec #4717
Changes from 10 commits
7badb8d
3721fff
be6ce50
6fb762e
eef8519
e94ae76
05e9d63
ce0f2a4
d3e7e1c
71d99a8
4e6ed26
76c82bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,7 @@ NaN | |
nan | ||
NestedField | ||
nullability | ||
PartitionField | ||
pragma | ||
PrimitiveType | ||
pyarrow | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,9 +14,16 @@ | |
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
from dataclasses import dataclass, field | ||
from typing import Dict, List, Tuple | ||
|
||
from iceberg.schema import Schema | ||
from iceberg.transforms import Transform | ||
|
||
_PARTITION_DATA_ID_START: int = 1000 | ||
|
||
|
||
@dataclass(frozen=True) | ||
class PartitionField: | ||
""" | ||
PartitionField is a single element with name and unique id, | ||
|
@@ -29,38 +36,81 @@ class PartitionField: | |
name(str): The name of this partition field | ||
""" | ||
|
||
def __init__(self, source_id: int, field_id: int, transform: Transform, name: str): | ||
self._source_id = source_id | ||
self._field_id = field_id | ||
self._transform = transform | ||
self._name = name | ||
source_id: int | ||
field_id: int | ||
transform: Transform | ||
name: str | ||
|
||
def __str__(self): | ||
return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})" | ||
|
||
|
||
@property | ||
def source_id(self) -> int: | ||
return self._source_id | ||
@dataclass(eq=False, frozen=True) | ||
class PartitionSpec: | ||
""" | ||
PartitionSpec capture the transformation from table data to partition values | ||
dramaticlly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@property | ||
def field_id(self) -> int: | ||
return self._field_id | ||
Attributes: | ||
schema(Schema): the schema of data table | ||
spec_id(int): any change to PartitionSpec will produce a new specId | ||
fields(List[PartitionField): list of partition fields to produce partition values | ||
last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be great if we can also add an example: https://github.com/apache/iceberg/blob/master/python/src/iceberg/types.py#L83-L87 this will also test the str method as the examples are executed as tests as well 👍🏻 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Fokko, I think this is where I want to leave out the example as I intended to construct the PartitionSpec via dedicated builder (which is not included in this PR), as that's the desired way to construct the PartitionSpec with convenient transform helper method and equipped with validation. I can include the example there, what do you think? |
||
""" | ||
|
||
@property | ||
def name(self) -> str: | ||
return self._name | ||
schema: Schema | ||
spec_id: int | ||
fields: Tuple[PartitionField, ...] | ||
last_assigned_field_id: int | ||
source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False, repr=False) | ||
|
||
@property | ||
def transform(self) -> Transform: | ||
return self._transform | ||
def __post_init__(self): | ||
source_id_to_fields_map = dict() | ||
for partition_field in self.fields: | ||
source_column = self.schema.find_column_name(partition_field.source_id) | ||
if not source_column: | ||
raise ValueError(f"Cannot find source column: {partition_field.source_id}") | ||
existing = source_id_to_fields_map.get(partition_field.source_id, []) | ||
existing.append(partition_field) | ||
source_id_to_fields_map[partition_field.source_id] = existing | ||
object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
def __eq__(self, other): | ||
return ( | ||
self.field_id == other.field_id | ||
and self.source_id == other.source_id | ||
and self.name == other.name | ||
and self.transform == other.transform | ||
) | ||
""" | ||
Produce a boolean to return True if two objects are considered equal | ||
|
||
Note: | ||
Equality of PartitionSpec is determined by spec_id and partition fields only | ||
""" | ||
if not isinstance(other, PartitionSpec): | ||
return False | ||
return self.spec_id == other.spec_id and self.fields == other.fields | ||
dramaticlly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def __str__(self): | ||
return f"{self.field_id}: {self.name}: {self.transform}({self.source_id})" | ||
""" | ||
Produce a human-readable string representation of PartitionSpec | ||
|
||
def __repr__(self): | ||
return f"PartitionField(field_id={self.field_id}, name={self.name}, transform={repr(self.transform)}, source_id={self.source_id})" | ||
dramaticlly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Note: | ||
Only include list of partition fields in the PartitionSpec's string representation | ||
""" | ||
result_str = "[" | ||
if self.fields: | ||
result_str += "\n " + "\n ".join([str(field) for field in self.fields]) + "\n" | ||
result_str += "]" | ||
return result_str | ||
|
||
def is_unpartitioned(self) -> bool: | ||
return len(self.fields) < 1 | ||
dramaticlly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def fields_by_source_id(self, field_id: int) -> List[PartitionField]: | ||
return self.source_id_to_fields_map[field_id] | ||
|
||
def compatible_with(self, other: "PartitionSpec") -> bool: | ||
dramaticlly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Produce a boolean to return True if two PartitionSpec are considered compatible | ||
""" | ||
return all( | ||
this_field.source_id == that_field.source_id | ||
and this_field.transform == that_field.transform | ||
and this_field.name == that_field.name | ||
for this_field, that_field in zip(self.fields, other.fields) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing the
__eq__
and__hash__
we could leverage the dataclass library. If we seteq=True
andfrozen=True
(which makes it immutable, which is also nice), then we get hash automatically:If
eq
andfrozen
are both true, by default dataclass() will generate a__hash__()
method for you. If eq is true and frozen is false,__hash__()
will be set toNone
, marking it unhashable (which it is, since it is mutable). Ifeq
is false,__hash__()
will be left untouched meaning the__hash__()
method of the superclass will be used (if the superclass is object, this means it will fall back to id-based hashing).More information here: https://docs.python.org/3/library/dataclasses.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree the PartitionField is an immutable class after construction so dataclass with both eq and frozen sounds fair to me.
for reference this is what will be look like for immutable PartitionField, with all testcase passing (small ordering change on repr but I think default one is very close to what we have today in java impl)
On the other side, I think the biggest benefit of the dataclass is the
__post_init__
method which allow for java-like builderPattern equivalent processing when we build the PartitionSpec. There's collection of validations need to happen and I am discussing with @samredai in #4631 (comment).From what I can tell, we will need a
PartitionSpecBuilder
class with convenient way to construct the PartitionSpec, but we also want to make sure avoid duplicate the builder logic in an overly complex init method for PartitionSpecThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dramaticlly that's an interesting idea I haven't thought of. A big argument for using the builder pattern was that we wanted
PartitionSpec
to be immutable, which would require us to include a ton of validation logic (everything that would be in a builder) in the__init__
method. If I understand your suggestion, using__post_init__
would allow us to have a typical init method, but then include the builder-type validation logic in the__post_init__
which would fail the initialization of any invalidPartitionSpec
. cc: @rdblue what do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to use
@dataclass
? I like that idea, but I won't hold off on reviewing if we want to get it in like this first.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also go full pythonic and bump the dataclass to a pydantic model. With Pydantic you can annotate the fields with validators: https://pydantic-docs.helpmanual.io/usage/validators/
We could use the generate Open API classes as the base classes and extend from those:
https://github.com/apache/iceberg/pull/4858/files#diff-4f32e455c8da9fc5dc641048dc398741b72e928f359bfb9e5ef3640e7d32873e
This also allows us to add validation. For example, the
BaseUserModel
is the generated one from open-api, and theUserModel
is the one extended with all the (convience) methods attached to it:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea of wrapping the classes generated from the OpenAPI spec. The class naming here might be tricky. Module namespacing allows us to re-use the same name if we want, something like:
We shouldn't expect users to import from the openapi module directly so we shouldn't need to worry about naming conflicts, right? Maybe we should name it
_openapi
just to be super clear about that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll hold off on commenting too much until I have a chance to look into the
pydantic
project as well as look at the other PR.My first ask would be how many dependencies are we bringing in if we add
pydantic
? I know that some folks were concerned about adding very many external python dependencies, so as not to conflict with their own dependencies, but if the benefit is very large I'm not personally opposed to it (I believe it was somebody / some group from Netflix that originally requested we keep the number of required dependencies down).But validation, either via a library or via a common pattern we settle on, is something that would be very beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I just checked the dependencies for pydantic and the good news is that all it requires is
typing-extensions
which is probably just for some python 3.7 backports. That will even probably get dropped at some point when they no longer support 3.7There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not consider this blocking for now.