-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Add generated classes from OpenAPI spec #4858
Conversation
64a1f7f
to
b7f5143
Compare
This makes is typesafe to generate the classes for the Rest Catalog spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! It would be good to devote a section to the readme that covers this and maybe some basics on how it should be incorporated into the codebase.
- working-directory: ./python | ||
run: | | ||
make generate-openapi | ||
if ! git diff --exit-code src/iceberg/openapi/rest_catalog.py; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
|
||
|
||
class PartitionField(BaseModel): | ||
field_id: Optional[int] = Field(None, alias='field-id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know python doesn't leave us the option here to use field-id
as the name, but does alias
here mean we'd allow either field-id
or field_id
in a response? Are there any implications to not strictly requiring field-id
? cc: @kbendick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use a -
in a variable in Python, therefore it uses the alias. When going back and forth between dicts and json you can explicitly tell it to use the alias:
➜ iceberg git:(fd-open-api) ✗ python3
Python 3.9.13 (main, May 19 2022, 13:48:47)
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pydantic import BaseModel, Field
>>>
>>> class BarModel(BaseModel):
... what_ever: int = Field(alias='what-ever')
...
>>> BarModel(**{'what-ever': 123})
BarModel(what_ever=123)
>>> model = BarModel(**{'what-ever': 123})
>>> model.dict()
{'what_ever': 123}
>>> model.dict(by_alias=True)
{'what-ever': 123}
>>> model.json(by_alias=True)
'{"what-ever": 123}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to use the proper name (kebab case field-id
) any time it's serialized, be that stored in a file or sent over the network for things like the REST catalog etc.
I don't personally think it's a good idea to start allowing for multiple forms accepted when stored in files or over the network, as we'll always have to support those then and it adds unnecessary complexity. We have some places where we have additional logic for things like 3-level lists in parquet (as their form changed during some versions before). So once those files are written, we always have to support that alternative form or we have to make the choice to break people's existing tables (that they might have not touched for a while).
So generally that's something we'd avoid when serializing within files or within REST requests.
Otherwise, within the Python project and code itself, it is fine to use underscore and other things. E.g. Java doesn't allow for field-id
as an identifier so we would generally use fieldId
.
We'd just want to be sure that we test that the JSON is always generated and used correctly (either via some helper that ensures by_alias
is used when needed or just extensive testing).
.github/workflows/openapi-ci.yml
Outdated
- uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.9 | ||
- working-directory: ./python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a name here helpful? Maybe something like name: "Validate that python OpenAPI code is in sync"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that got dropped somewhere 👍🏻
We need a |
@samredai indeed, that's missing. Thanks for pointing out! 👍🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth it?
Many of the classes in the REST spec mirror structures from TableMetadata
, but when we talked last I thought that we were going to create to_dict
/ from_dict
methods to serialize and deserialize those. Having autogenerated classes for those seems like duplication.
Also, I don't think we're getting much out of this. This makes classes that are simple and could easily be statically defined. And it requires an additional library that looks similar to what we get from @dataclass
. I'd probably opt not to pull in the additional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Fokko for looking into this.
I'm generally unsure about using the OpenAPI doc currently for generating classes, as it's presently intended more for documentation purposes than for usage for models and generating code. To that extent, it seems that the generated classes are at times not that helpful (like Updates
).
I've also noticed a lot of variance in the generated code depending on which library is used to do it. I know one Python generation tool I used to generate code made all models just dictionaries... not even classes that were secretly dictionaries but just... dictionaries.
I do think this could be useful, for example for finding ways we might make the OpenAPI document more clear by looking at the generated code. But again, it's a reference specification of behaviors and happens to be an OpenAPI document because that provides a lot of existing tooling for looking at it etc. But it's not as much something intended for generating code.
If we get to a place where the OpenAPI doc can be used as a drop in, that would be great. But talking to some people who are very familiar with OpenAPI, my understanding is that some projects have behavior that's simply more complex than any of the OpenAPI generation tools can really account for. I've been told that either the code has to be written in a very specific style or that it can take years to get to such a place.
But I'd like to hear from others, specifically @rdblue who has done a good amount of work on the OpenAPI-based spec.
EDIT - It seems while I was writing this that Ryan weighed in already =)
@@ -44,11 +44,13 @@ packages = find: | |||
python_requires = >=3.8 | |||
install_requires = | |||
mmh3 | |||
pydantic==1.9.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pydantic
required if we were to move forward with this or is this just added in this PR for other reasons?
I know that we're trying to keep the required dependencies as minimal as possible for the Python library.
I think it is! Allow me to elaborate on why I think it is the case.
That's correct. Having the serialization methods would be replaced by this PR because we get them for free: ➜ python git:(fd-open-api) python3
Python 3.9.13 (main, May 19 2022, 13:48:47)
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
... spec_id=123,
... fields=[
... PartitionField(
... field_id=234,
... source_id=1,
... name='date',
... transform='identity'
... )
... ]
... )
>>> spec
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))])
# Serialize
>>> spec.json()
'{"spec_id": 123, "fields": [{"field_id": 234, "source_id": 1, "name": "date", "transform": "identity"}]}'
# Deserialize
>>> PartitionSpec.parse_raw(spec.json())
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))]) I think this is much less error-prone as code like this: https://github.com/apache/iceberg/pull/3677/files#diff-9d9a8492ccd85cbbddfc202b9a954b57a079f2cca33e96eb0a9106cf1a4a8130R37-R58 Next to that, the maintenance cost is much higher when we have to keep it in sync by hand. If we generate it, we can check if we break compatibility (using mypy).
This is correct, but we get some more on top of the data classes; validators. Instead of using a builder pattern, we can just nicely validate the input using decorators: https://pydantic-docs.helpmanual.io/usage/validators/ This looks like: >>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
... spec_id=123,
... fields=[
... PartitionField(
... field_id=234,
... name='date',
... transform='identity'
... )
... ]
... )
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PartitionField
source-id
field required (type=value_error.missing)
I see your point, but it is a bit of a catch-22. Until we start using it, it won't mature. Instead of appending methods to the python classes, it is just as simple as amending the OpenAPI spec and regenerating the code.
Yes, there is a variety of generators and I found this one the best. It nicely generates all the aliases as well. I already bumped into some issues with the plain python one: openapi-generators/openapi-python-client#618
Great point, and I think we always need to extend the generated classes with convenience methods, or additional validation. How to do this is given in #4717 (comment)
For this implementation, pydantic is required: https://github.com/koxudaxi/datamodel-code-generator The code generator specifically generates the pydantic object (and brings in the validation as well). It fits our use-case perfectly: https://python.plainenglish.io/an-introduction-to-the-pydantic-stack-9e490d606c8d It is coming from the FastAPI movement, and it is becoming increasingly more popular. Also, pydantic doesn't pull in the whole universe, it just has one dependency: https://github.com/samuelcolvin/pydantic/blob/master/setup.py#L131-L133 |
Okay, so assuming that we go this direction, how do we integrate these generated classes with the other code that needs to be there? |
Subclassing them in another module would be the way to go. In that class, we then add all the convenience methods and additional validation, while inheriting the (de)serialization of the actual data from the Open API spec. This would rewrite from PR #4717 iceberg/python/src/iceberg/table/partitioning.py Lines 48 to 111 in f5de18f
To the following (tests are passing :): from iceberg.openapi import rest_catalog
class PartitionSpec(rest_catalog.PartitionSpec):
"""
PartitionSpec capture the transformation from table data to partition values
Attributes:
table_schema(IcebergSchema): the schema of data table
"""
# Fokko: I've aliased the schema to IcebergSchema, because we also have a schema in the open_api spec
# This would go away later on
table_schema: IcebergSchema = Field()
_source_id_to_fields_map: Dict[int, List[PartitionField]] = Field(init=False)
@root_validator
def check_fields_in_schema(cls, values: Dict[str, Any]):
schema: IcebergSchema = values['table_schema']
source_id_to_fields_map = dict()
for partition_field in values['fields']:
source_column = schema.find_column_name(partition_field.source_id)
if not source_column:
raise ValueError(f"Cannot find source column: {partition_field.source_id}")
existing = source_id_to_fields_map.get(partition_field.source_id, [])
existing.append(partition_field)
source_id_to_fields_map[partition_field.source_id] = existing
values["_source_id_to_fields_map"] = source_id_to_fields_map
return values
def __eq__(self, other):
"""
Equality check on spec_id and partition fields only
"""
return self.spec_id == other.spec_id and self.fields == other.fields
def __str__(self):
"""
PartitionSpec str method highlight the partition field only
"""
result_str = "["
for partition_field in self.fields:
result_str += f"\n {str(partition_field)}"
if len(self.fields) > 0:
result_str += "\n"
result_str += "]"
return result_str
def is_unpartitioned(self) -> bool:
return len(self.fields) < 1
def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
return self._source_id_to_fields_map[field_id]
def compatible_with(self, other: "PartitionSpec") -> bool:
"""
Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
"""
return all(
this_field.source_id == that_field.source_id
and this_field.transform == that_field.transform
and this_field.name == that_field.name
for this_field, that_field in zip(self.fields, other.fields)
) There are also some discrepancies between the open api models, and the current classes in Python/Java. For example, the Schema derives from the StructType. While in Java/Python, the schema class has a struct as a variable. But we could use the fields in Python 👍🏻 It would also consolidate the naming; currently we use |
Alright, did some more in-depth investigation over the weekend by just converting everything to the generated models. It went quite well and it works quite nicely with the encoders/decoders. We're able to export the schema from our subclassed types by taking the parent schema of the pydantic model, which is quite cool. But hit a wall now. At the top we have the iceberg/open-api/rest-catalog-open-api.yaml Lines 940 to 945 in f5de18f
With the generated models, we're unable to construct the type hierarchy because the |
Unfortunately, we can't change the primitive type to an object. That serialization is part of the table spec so we can't update the representation. We could possibly change the OpenAPI representation, though. It didn't make sense to me that the generator was expecting |
The abovementioned issue has been fixed in #4899 Based on the discussion at the Python sync, let's close this for now. We can't integrate the code into our existing codebase, and we don't want to maintain the two separately :) |
This makes it typesafe to generate the classes for the Rest Catalog spec