Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] #3228: PartitionSpec python API implementation #3407

Closed
wants to merge 6 commits into from

Conversation

nssalian
Copy link
Contributor

@nssalian nssalian commented Oct 28, 2021

In this PR:

  1. Added the initial implementation of partition_field, schema and partition_spec for [Python] support partition spec in iceberg python library #3228
  2. Formatted using black.

Most items are pending since Transforms and TypeUtil haven't been implemented in the Python API

CC: @jun-he, @rdblue , @samredai to help review and understand how to proceed.

@nssalian nssalian changed the title PartitionSpec python API implementation [Python] #3228: PartitionSpec python API implementation Oct 28, 2021
@nssalian
Copy link
Contributor Author

Tests pass locally for 3.7 and 3.9.5. I'll fix the line nits in upcoming commits, waiting for review comments first.


class PartitionSpec(object):
fields_by_source_id: defaultdict[list] = None
field_list: List[PartitionField] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the scope of these? Are the essentially instance variable declarations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Internal to the class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't needed until an individual instance is initialized right? I think they should be arguments to __init__(). A few reasons being that it's unnecessary noise when doing dir(PartitionSpec) since you don't expect a user to set/get them at the class scope, and also the help() documentation could be misleading as most users will refer to __init__() as a comprehensive requirement for initialization and these would be missing there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thoughts as #3407 (comment)

# TODO: Needs transform
pass

def _generate_unpartitioned_spec(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a separate method for this? Shouldn't the unpartitioned spec be a constant somewhere in this file?

Copy link
Collaborator

@samredai samredai Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm off here, but if this is just a helper for getting a commonly used instantiation of this class, shouldn't these just be default arguments to __init__() so a user gets this if they just do p_spec = PartitionSpec()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sam's idea sounds good to me. We probably don't need this in Python at all unless you think there is some reason to have a singleton unpartitioned spec, or if there is value in readability from using PartitionSpec.unpartitioned().

)

def unpartitioned(self) -> PartitionSpec:
return self._generate_unpartitioned_spec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should either be a @staticmethod or a method in the package and not a method on PartitionSpec. Probably the latter. The only reason why it was a static method on the PartitionSpec class in Java is because you can't associate methods with packages directly in Java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also an argument to be made for making this a @classmethod because it is a factory method. In that case it would be attached to the class.

def unpartitioned(self) -> PartitionSpec:
return self._generate_unpartitioned_spec()

def check_compatibility(self, spec: PartitionSpec, schema: Schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. This is probably a method in the package, not attached to the class.

spec = PartitionSpec(
self.schema, self.spec_id, self.fields, self.last_assigned_field_id
)
PartitionSpec().check_compatibility(spec, self.schema)
Copy link
Contributor

@rdblue rdblue Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely not create a partition spec with only default values just to call check_compatibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and I agree that check_compatibility should be a function scoped to the package. In general I feel that if self is not used then scoping to the package makes more sense.

):
index += 1
# TODO: Add transform check
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious to me. We should be careful to mirror what Java is doing.

python/tests/test_partition_field.py Outdated Show resolved Hide resolved
python/tests/test_partition_field.py Outdated Show resolved Hide resolved
python/src/iceberg/validation_exception.py Outdated Show resolved Hide resolved
python/src/iceberg/validation_exception.py Outdated Show resolved Hide resolved
python/src/iceberg/schema.py Outdated Show resolved Hide resolved
python/src/iceberg/partition_spec.py Outdated Show resolved Hide resolved

index = 0
for field in self.fields:
other_field: PartitionField = other.fields[index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding typing here, doesn't it make more sense to just add it to the function signature? If other is typed to PartitionSpec than the type hint for fields is implied by the type hint in the PartitionSpec constructor where there is:

...
part_fields: List[PartitionField],
...

spec = PartitionSpec(
self.schema, self.spec_id, self.fields, self.last_assigned_field_id
)
PartitionSpec().check_compatibility(spec, self.schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and I agree that check_compatibility should be a function scoped to the package. In general I feel that if self is not used then scoping to the package makes more sense.

python/src/iceberg/schema.py Outdated Show resolved Hide resolved
python/src/iceberg/schema.py Outdated Show resolved Hide resolved
python/src/iceberg/partition_field.py Outdated Show resolved Hide resolved
return self._value


class Builder(object):
Copy link
Collaborator

@jun-he jun-he Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can simplify it to avoid a Java kind of builder class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the builder translates well to python because you're constructing the object through an API that does checking while you build. This also gives us a good way to build partition specs by passing the builder around. I'd vote to keep it for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like constructing objects in this kind of pattern but I think we can achieve it without using typical Java builder in a more pythonic way. For example,

def buildermethod(func):
  def wrapper(self, *args, **kwargs):
    func(self, *args, **kwargs)
    return self
  return wrapper

class PartitionSpec:
  def __init__(self):
    pass

  @buildermethod
  def with_schema(self, schema):
    # additional checks
    self_schema = schema
...

partition_spec = PartitionSpec().with_schema(schama)...

Also, we won't be able to get immutability that Java builder offers in Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really how builders work. It looks more like refinement, but it modifies the original object. I think that will create confusion.

I'd prefer to stick with the builder API that we have already rather than trying to come up with a new pattern.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we cannot enforce immutability of the original object in python and it is always mutable. So it misses the main benefit of builder pattern in Java. So I am not sure if there is a confusion for python users that the original object is expected to be immutable or not.

A builder class here seems to be a boilerplate class with an extra layer to prevent calling the target object's methods but it won't prevent the field in the target object being mutated.

There are also multiple other places using builder pattern in iceberg. It is better that we keep this consistent (either use or not use builder class). I will start a thread in the slack channel to get more options about it from python community about builder class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nssalian
Copy link
Contributor Author

Thanks for the review. I'll address the comments in upcoming commits.

)
return False

def __str__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may also add def __repr__ as well.

alias_to_id: dict = None
id_to_field = {}
name_to_id: dict = None
lowercase_name_to_id: dict = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, those 4 fields can be included within __init__.
Also, if they are not needed in this change, we might consider to add them in the future when we actually need them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are simply class variables that might be used for methods. They don't need to be part of the __init__ since they aren't initialized externally.
Let me fix the PR with the rest of the changes and see if we need the variables for this iteration of change. I'll remove them if not needed.

Copy link
Contributor

@xinbinhuang xinbinhuang Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that here is declaring mutable variables as class attributes. This can be problematic because class attributes are mutable and persist across the same python session. i.e.

class Spec:
    id: str = None
    name: str = None


spec1 = Spec()
spec2 = Spec()

assert(spec1.id is None)

# change class variable from an instance
spec2.__class__.id = "spec2_id"
assert(spec1.id == "spec2_id")
assert(Spec.id == "spec2_id")

spec3 = Spec() 
assert(spec3.id == "spec2_id")

Spec.id = "class_spec_id"
assert(spec1.id == "class_spec_id")
assert(spec2.id == "class_spec_id")
assert(spec3.id == "class_spec_id")

In the above example, an instance can change the class variable and it affects all other current and future instances. This can lead to surprising behavior.

To have instance scope only variables, we can put them inside the init without exposing them to the signature. i.e.

   def __init__(self, struct: StructType, schema_id: int, identifier_field_ids: [int]):
        self._struct = struct
        self._schema_id = schema_id
        self._identifier_field_ids = identifier_field_ids
        # not initialized by the user externally
        self.alias_to_id: dict = None
        self.id_to_field = {}
        self.name_to_id: dict = None
        self.lowercase_name_to_id: dict = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is the approach that makes sense. Fixing in the upcoming commits.

@jun-he
Copy link
Collaborator

jun-he commented Oct 29, 2021

Also, should we always implement __repr__ in those classes?

Comment on lines 72 to 73
def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]:
return self._generate_fields_by_source_id().get(field_id, None)
Copy link
Contributor

@xinbinhuang xinbinhuang Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I'm new to iceberg but this signature looks confusing. Is source_id and field_id the same thing?

Suggested change
def get_fields_by_source_id(self, field_id: int) -> List[PartitionField]:
return self._generate_fields_by_source_id().get(field_id, None)
def get_fields_by_source_id(self, field_id: int) -> Optional[List[PartitionField]]:
return self._generate_fields_by_source_id().get(field_id, None)

The returns should be Optional[List[PartitionField]] instead if the the method will returns None.

I'm confused about what _generate_fields_by_source_id tries to achieve. Why do we need to generate fields_source_to_field_dict every time it's called when self.fields_by_source_id is None?

Comment on lines 66 to 68
fields_source_to_field_dict = defaultdict(list)
for field in self.fields:
fields_source_to_field_dict[field.source_id] = [field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a single source_id map to multiple fields?

@nssalian
Copy link
Contributor Author

Added some fixes. Still some more left. Will get to it soon.

Copy link
Collaborator

@jun-he jun-he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, can you rebase and run build again?

return True


class Builder:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the Builder to be the inner class of PartitionSpec class?

self.spec_id = 0
self.last_assigned_field_id = PARTITION_DATA_ID_START - 1

def builder_for(self, schema):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add type for schema

def builder_for(self, schema):
return Builder(schema=schema)

def next_field_id(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type info for the return result.

@jun-he
Copy link
Collaborator

jun-he commented Jan 24, 2022

@nssalian BTW, you may also check if the unit test coverage is good enough to pass the threshold.

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

I'm going to close this, since PartitionSpec was added in #4717.

@rdblue rdblue closed this Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants