Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Add generated classes from OpenAPI spec #4858

Closed
wants to merge 5 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 24, 2022

This makes it typesafe to generate the classes for the Rest Catalog spec

@Fokko Fokko marked this pull request as ready for review May 24, 2022 19:21
@Fokko Fokko force-pushed the fd-open-api branch 3 times, most recently from 64a1f7f to b7f5143 Compare May 24, 2022 20:58
This makes is typesafe to generate the classes for the Rest Catalog spec
Copy link
Collaborator

@samredai samredai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! It would be good to devote a section to the readme that covers this and maybe some basics on how it should be incorporated into the codebase.

- working-directory: ./python
run: |
make generate-openapi
if ! git diff --exit-code src/iceberg/openapi/rest_catalog.py; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!



class PartitionField(BaseModel):
field_id: Optional[int] = Field(None, alias='field-id')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know python doesn't leave us the option here to use field-id as the name, but does alias here mean we'd allow either field-id or field_id in a response? Are there any implications to not strictly requiring field-id? cc: @kbendick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use a - in a variable in Python, therefore it uses the alias. When going back and forth between dicts and json you can explicitly tell it to use the alias:

iceberg git:(fd-open-api) ✗ python3
Python 3.9.13 (main, May 19 2022, 13:48:47) 
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pydantic import BaseModel, Field
>>> 
>>> class BarModel(BaseModel):
...     what_ever: int = Field(alias='what-ever')
... 
>>> BarModel(**{'what-ever': 123})
BarModel(what_ever=123)
>>> model = BarModel(**{'what-ever': 123})
>>> model.dict()
{'what_ever': 123}
>>> model.dict(by_alias=True)
{'what-ever': 123}
>>> model.json(by_alias=True)
'{"what-ever": 123}'

Copy link
Contributor

@kbendick kbendick May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to use the proper name (kebab case field-id) any time it's serialized, be that stored in a file or sent over the network for things like the REST catalog etc.

I don't personally think it's a good idea to start allowing for multiple forms accepted when stored in files or over the network, as we'll always have to support those then and it adds unnecessary complexity. We have some places where we have additional logic for things like 3-level lists in parquet (as their form changed during some versions before). So once those files are written, we always have to support that alternative form or we have to make the choice to break people's existing tables (that they might have not touched for a while).

So generally that's something we'd avoid when serializing within files or within REST requests.

Otherwise, within the Python project and code itself, it is fine to use underscore and other things. E.g. Java doesn't allow for field-id as an identifier so we would generally use fieldId.

We'd just want to be sure that we test that the JSON is always generated and used correctly (either via some helper that ensures by_alias is used when needed or just extensive testing).

- uses: actions/setup-python@v2
with:
python-version: 3.9
- working-directory: ./python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a name here helpful? Maybe something like name: "Validate that python OpenAPI code is in sync"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that got dropped somewhere 👍🏻

@samredai
Copy link
Collaborator

We need a iceberg/openapi/__init__.py file right?

@Fokko
Copy link
Contributor Author

Fokko commented May 25, 2022

@samredai indeed, that's missing. Thanks for pointing out! 👍🏻

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth it?

Many of the classes in the REST spec mirror structures from TableMetadata, but when we talked last I thought that we were going to create to_dict / from_dict methods to serialize and deserialize those. Having autogenerated classes for those seems like duplication.

Also, I don't think we're getting much out of this. This makes classes that are simple and could easily be statically defined. And it requires an additional library that looks similar to what we get from @dataclass. I'd probably opt not to pull in the additional dependency.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fokko for looking into this.

I'm generally unsure about using the OpenAPI doc currently for generating classes, as it's presently intended more for documentation purposes than for usage for models and generating code. To that extent, it seems that the generated classes are at times not that helpful (like Updates).

I've also noticed a lot of variance in the generated code depending on which library is used to do it. I know one Python generation tool I used to generate code made all models just dictionaries... not even classes that were secretly dictionaries but just... dictionaries.

I do think this could be useful, for example for finding ways we might make the OpenAPI document more clear by looking at the generated code. But again, it's a reference specification of behaviors and happens to be an OpenAPI document because that provides a lot of existing tooling for looking at it etc. But it's not as much something intended for generating code.

If we get to a place where the OpenAPI doc can be used as a drop in, that would be great. But talking to some people who are very familiar with OpenAPI, my understanding is that some projects have behavior that's simply more complex than any of the OpenAPI generation tools can really account for. I've been told that either the code has to be written in a very specific style or that it can take years to get to such a place.

But I'd like to hear from others, specifically @rdblue who has done a good amount of work on the OpenAPI-based spec.

EDIT - It seems while I was writing this that Ryan weighed in already =)

@@ -44,11 +44,13 @@ packages = find:
python_requires = >=3.8
install_requires =
mmh3
pydantic==1.9.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pydantic required if we were to move forward with this or is this just added in this PR for other reasons?

I know that we're trying to keep the required dependencies as minimal as possible for the Python library.

@Fokko
Copy link
Contributor Author

Fokko commented May 27, 2022

Is this worth it?

I think it is! Allow me to elaborate on why I think it is the case.

Many of the classes in the REST spec mirror structures from TableMetadata, but when we talked last I thought that we were going to create to_dict / from_dict methods to serialize and deserialize those. Having autogenerated classes for those seems like duplication.

That's correct. Having the serialization methods would be replaced by this PR because we get them for free:

python git:(fd-open-api) python3
Python 3.9.13 (main, May 19 2022, 13:48:47) 
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
...   spec_id=123,
...   fields=[
...     PartitionField(
...       field_id=234,
...       source_id=1,
...       name='date',
...       transform='identity'
...     )
...   ]
... )
>>> spec
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))])

# Serialize
>>> spec.json()
'{"spec_id": 123, "fields": [{"field_id": 234, "source_id": 1, "name": "date", "transform": "identity"}]}'

# Deserialize
>>> PartitionSpec.parse_raw(spec.json())
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))])

I think this is much less error-prone as code like this: https://github.com/apache/iceberg/pull/3677/files#diff-9d9a8492ccd85cbbddfc202b9a954b57a079f2cca33e96eb0a9106cf1a4a8130R37-R58

Next to that, the maintenance cost is much higher when we have to keep it in sync by hand. If we generate it, we can check if we break compatibility (using mypy).

Also, I don't think we're getting much out of this. This makes classes that are simple and could easily be statically defined. And it requires an additional library that looks similar to what we get from @DataClass. I'd probably opt not to pull in the additional dependency.

This is correct, but we get some more on top of the data classes; validators. Instead of using a builder pattern, we can just nicely validate the input using decorators: https://pydantic-docs.helpmanual.io/usage/validators/

This looks like:

>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
...   spec_id=123,
...   fields=[
...     PartitionField(
...       field_id=234,
...       name='date',
...       transform='identity'
...     )
...   ]
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PartitionField
source-id
  field required (type=value_error.missing)

I'm generally unsure about using the OpenAPI doc currently for generating classes, as it's presently intended more for documentation purposes than for usage for models and generating code. To that extent, it seems that the generated classes are at times not that helpful (like Updates).

I see your point, but it is a bit of a catch-22. Until we start using it, it won't mature. Instead of appending methods to the python classes, it is just as simple as amending the OpenAPI spec and regenerating the code.

I've also noticed a lot of variance in the generated code depending on which library is used to do it. I know one Python generation tool I used to generate code made all models just dictionaries... not even classes that were secretly dictionaries but just... dictionaries.

Yes, there is a variety of generators and I found this one the best. It nicely generates all the aliases as well. I already bumped into some issues with the plain python one: openapi-generators/openapi-python-client#618

If we get to a place where the OpenAPI doc can be used as a drop-in, that would be great. But talking to some people who are very familiar with OpenAPI, my understanding is that some projects have behavior that's simply more complex than any of the OpenAPI generation tools can really account for. I've been told that either the code has to be written in a very specific style or that it can take years to get to such a place.

Great point, and I think we always need to extend the generated classes with convenience methods, or additional validation. How to do this is given in #4717 (comment)

Is pydantic required if we were to move forward with this or is this just added in this PR for other reasons?

For this implementation, pydantic is required: https://github.com/koxudaxi/datamodel-code-generator The code generator specifically generates the pydantic object (and brings in the validation as well). It fits our use-case perfectly: https://python.plainenglish.io/an-introduction-to-the-pydantic-stack-9e490d606c8d

It is coming from the FastAPI movement, and it is becoming increasingly more popular. Also, pydantic doesn't pull in the whole universe, it just has one dependency: https://github.com/samuelcolvin/pydantic/blob/master/setup.py#L131-L133

@rdblue
Copy link
Contributor

rdblue commented May 27, 2022

Okay, so assuming that we go this direction, how do we integrate these generated classes with the other code that needs to be there? PartitionSpec is a great example, where we can use this to get a PartitionSpec object that we can serialize and deserialize properly, along with basic validation. But how do we expose methods like PartitionSpec.partitionType() as these change? Do we make subclasses?

@Fokko
Copy link
Contributor Author

Fokko commented May 28, 2022

Subclassing them in another module would be the way to go. In that class, we then add all the convenience methods and additional validation, while inheriting the (de)serialization of the actual data from the Open API spec.

This would rewrite from PR #4717

@dataclass(eq=False, frozen=True)
class PartitionSpec:
"""
PartitionSpec capture the transformation from table data to partition values
Attributes:
schema(Schema): the schema of data table
spec_id(int): any change to PartitionSpec will produce a new specId
fields(List[PartitionField): list of partition fields to produce partition values
last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START
"""
schema: Schema
spec_id: int
fields: Tuple[PartitionField]
last_assigned_field_id: int
source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False)
def __post_init__(self):
source_id_to_fields_map = dict()
for partition_field in self.fields:
source_column = self.schema.find_column_name(partition_field.source_id)
if not source_column:
raise ValueError(f"Cannot find source column: {partition_field.source_id}")
existing = source_id_to_fields_map.get(partition_field.source_id, [])
existing.append(partition_field)
source_id_to_fields_map[partition_field.source_id] = existing
object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map)
def __eq__(self, other):
"""
Equality check on spec_id and partition fields only
"""
return self.spec_id == other.spec_id and self.fields == other.fields
def __str__(self):
"""
PartitionSpec str method highlight the partition field only
"""
result_str = "["
for partition_field in self.fields:
result_str += f"\n {str(partition_field)}"
if len(self.fields) > 0:
result_str += "\n"
result_str += "]"
return result_str
def is_unpartitioned(self) -> bool:
return len(self.fields) < 1
def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
return self.source_id_to_fields_map[field_id]
def compatible_with(self, other: "PartitionSpec") -> bool:
"""
Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
"""
return all(
this_field.source_id == that_field.source_id
and this_field.transform == that_field.transform
and this_field.name == that_field.name
for this_field, that_field in zip(self.fields, other.fields)
)

To the following (tests are passing :):

from iceberg.openapi import rest_catalog


class PartitionSpec(rest_catalog.PartitionSpec):
    """
    PartitionSpec capture the transformation from table data to partition values
    Attributes:
        table_schema(IcebergSchema): the schema of data table
    """
    # Fokko: I've aliased the schema to IcebergSchema, because we also have a schema in the open_api spec
    # This would go away later on
    table_schema: IcebergSchema = Field()
    _source_id_to_fields_map: Dict[int, List[PartitionField]] = Field(init=False)

    @root_validator
    def check_fields_in_schema(cls, values: Dict[str, Any]):
        schema: IcebergSchema = values['table_schema']
        source_id_to_fields_map = dict()
        for partition_field in values['fields']:
            source_column = schema.find_column_name(partition_field.source_id)
            if not source_column:
                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
            existing = source_id_to_fields_map.get(partition_field.source_id, [])
            existing.append(partition_field)
            source_id_to_fields_map[partition_field.source_id] = existing
        values["_source_id_to_fields_map"] = source_id_to_fields_map
        return values

    def __eq__(self, other):
        """
        Equality check on spec_id and partition fields only
        """
        return self.spec_id == other.spec_id and self.fields == other.fields

    def __str__(self):
        """
        PartitionSpec str method highlight the partition field only
        """
        result_str = "["
        for partition_field in self.fields:
            result_str += f"\n  {str(partition_field)}"
        if len(self.fields) > 0:
            result_str += "\n"
        result_str += "]"
        return result_str

    def is_unpartitioned(self) -> bool:
        return len(self.fields) < 1

    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
        return self._source_id_to_fields_map[field_id]

    def compatible_with(self, other: "PartitionSpec") -> bool:
        """
        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
        """
        return all(
            this_field.source_id == that_field.source_id
            and this_field.transform == that_field.transform
            and this_field.name == that_field.name
            for this_field, that_field in zip(self.fields, other.fields)
        )

There are also some discrepancies between the open api models, and the current classes in Python/Java. For example, the Schema derives from the StructType. While in Java/Python, the schema class has a struct as a variable. But we could use the fields in Python 👍🏻 It would also consolidate the naming; currently we use columns in Python, in contrast to fields in the openapi spec. Or required vs is_optional.

@Fokko
Copy link
Contributor Author

Fokko commented May 29, 2022

Alright, did some more in-depth investigation over the weekend by just converting everything to the generated models. It went quite well and it works quite nicely with the encoders/decoders. We're able to export the schema from our subclassed types by taking the parent schema of the pydantic model, which is quite cool.

But hit a wall now. At the top we have the Type:

Type:
anyOf:
- $ref: '#/components/schemas/PrimitiveType'
- $ref: '#/components/schemas/StructType'
- $ref: '#/components/schemas/ListType'
- $ref: '#/components/schemas/MapType'

With the generated models, we're unable to construct the type hierarchy because the PrimitiveType is a plain string, and the other ones are of type object 😭 The combination of Pydantic and subclassing isn't able to cope with this. One way would be to change the PrimitiveType also to an object (and also add a type field to determine the type). This would also allow us to add properties for the Fixed and Decimal instead of having to parse them from the name.

@rdblue
Copy link
Contributor

rdblue commented May 29, 2022

Unfortunately, we can't change the primitive type to an object. That serialization is part of the table spec so we can't update the representation. We could possibly change the OpenAPI representation, though. It didn't make sense to me that the generator was expecting Type branches to be uniform, so I looked at the anyOf docs. Looks like anyOf can be one or more branches, which implies that there is some similarity across branches. However, oneOf expects the type to match exactly one branch. Maybe that would fix this issue?

@Fokko
Copy link
Contributor Author

Fokko commented Jun 1, 2022

The abovementioned issue has been fixed in #4899

Based on the discussion at the Python sync, let's close this for now. We can't integrate the code into our existing codebase, and we don't want to maintain the two separately :)

@Fokko Fokko closed this Jun 1, 2022
@Fokko Fokko deleted the fd-open-api branch June 1, 2022 17:42
@Fokko Fokko added OPENAPI and removed OPENAPI labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants