Python: Add generated classes from OpenAPI spec #4858

Fokko · 2022-05-24T19:21:07Z

This makes it typesafe to generate the classes for the Rest Catalog spec

This makes is typesafe to generate the classes for the Rest Catalog spec

samredai

This looks great! It would be good to devote a section to the readme that covers this and maybe some basics on how it should be incorporated into the codebase.

samredai · 2022-05-25T15:42:22Z

.github/workflows/openapi-ci.yml

+      - working-directory: ./python
+        run: |
+          make generate-openapi
+          if ! git diff --exit-code src/iceberg/openapi/rest_catalog.py; then


samredai · 2022-05-25T15:53:01Z

python/src/iceberg/openapi/rest_catalog.py

+
+
+class PartitionField(BaseModel):
+    field_id: Optional[int] = Field(None, alias='field-id')


I know python doesn't leave us the option here to use field-id as the name, but does alias here mean we'd allow either field-id or field_id in a response? Are there any implications to not strictly requiring field-id? cc: @kbendick

You can't use a - in a variable in Python, therefore it uses the alias. When going back and forth between dicts and json you can explicitly tell it to use the alias:

➜ iceberg git:(fd-open-api) ✗ python3 Python 3.9.13 (main, May 19 2022, 13:48:47) [Clang 13.1.6 (clang-1316.0.21.2)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from pydantic import BaseModel, Field >>> >>> class BarModel(BaseModel): ... what_ever: int = Field(alias='what-ever') ... >>> BarModel(**{'what-ever': 123}) BarModel(what_ever=123) >>> model = BarModel(**{'what-ever': 123}) >>> model.dict() {'what_ever': 123} >>> model.dict(by_alias=True) {'what-ever': 123} >>> model.json(by_alias=True) '{"what-ever": 123}'

We would need to use the proper name (kebab case field-id) any time it's serialized, be that stored in a file or sent over the network for things like the REST catalog etc.

I don't personally think it's a good idea to start allowing for multiple forms accepted when stored in files or over the network, as we'll always have to support those then and it adds unnecessary complexity. We have some places where we have additional logic for things like 3-level lists in parquet (as their form changed during some versions before). So once those files are written, we always have to support that alternative form or we have to make the choice to break people's existing tables (that they might have not touched for a while).

So generally that's something we'd avoid when serializing within files or within REST requests.

Otherwise, within the Python project and code itself, it is fine to use underscore and other things. E.g. Java doesn't allow for field-id as an identifier so we would generally use fieldId.

We'd just want to be sure that we test that the JSON is always generated and used correctly (either via some helper that ensures by_alias is used when needed or just extensive testing).

samredai · 2022-05-25T15:59:51Z

.github/workflows/openapi-ci.yml

+      - uses: actions/setup-python@v2
+        with:
+          python-version: 3.9
+      - working-directory: ./python


Is a name here helpful? Maybe something like name: "Validate that python OpenAPI code is in sync"

Thanks, that got dropped somewhere 👍🏻

samredai · 2022-05-25T16:37:15Z

We need a iceberg/openapi/__init__.py file right?

Fokko · 2022-05-25T16:46:11Z

@samredai indeed, that's missing. Thanks for pointing out! 👍🏻

rdblue

Is this worth it?

Many of the classes in the REST spec mirror structures from TableMetadata, but when we talked last I thought that we were going to create to_dict / from_dict methods to serialize and deserialize those. Having autogenerated classes for those seems like duplication.

Also, I don't think we're getting much out of this. This makes classes that are simple and could easily be statically defined. And it requires an additional library that looks similar to what we get from @dataclass. I'd probably opt not to pull in the additional dependency.

kbendick

Thanks @Fokko for looking into this.

I'm generally unsure about using the OpenAPI doc currently for generating classes, as it's presently intended more for documentation purposes than for usage for models and generating code. To that extent, it seems that the generated classes are at times not that helpful (like Updates).

I've also noticed a lot of variance in the generated code depending on which library is used to do it. I know one Python generation tool I used to generate code made all models just dictionaries... not even classes that were secretly dictionaries but just... dictionaries.

I do think this could be useful, for example for finding ways we might make the OpenAPI document more clear by looking at the generated code. But again, it's a reference specification of behaviors and happens to be an OpenAPI document because that provides a lot of existing tooling for looking at it etc. But it's not as much something intended for generating code.

If we get to a place where the OpenAPI doc can be used as a drop in, that would be great. But talking to some people who are very familiar with OpenAPI, my understanding is that some projects have behavior that's simply more complex than any of the OpenAPI generation tools can really account for. I've been told that either the code has to be written in a very specific style or that it can take years to get to such a place.

But I'd like to hear from others, specifically @rdblue who has done a good amount of work on the OpenAPI-based spec.

EDIT - It seems while I was writing this that Ryan weighed in already =)

kbendick · 2022-05-27T17:51:52Z

python/setup.cfg

@@ -44,11 +44,13 @@ packages = find:
 python_requires = >=3.8
 install_requires =
    mmh3
+    pydantic==1.9.1


Is pydantic required if we were to move forward with this or is this just added in this PR for other reasons?

I know that we're trying to keep the required dependencies as minimal as possible for the Python library.

Fokko · 2022-05-27T21:36:31Z

Is this worth it?

I think it is! Allow me to elaborate on why I think it is the case.

Many of the classes in the REST spec mirror structures from TableMetadata, but when we talked last I thought that we were going to create to_dict / from_dict methods to serialize and deserialize those. Having autogenerated classes for those seems like duplication.

That's correct. Having the serialization methods would be replaced by this PR because we get them for free:

➜  python git:(fd-open-api) python3
Python 3.9.13 (main, May 19 2022, 13:48:47) 
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
...   spec_id=123,
...   fields=[
...     PartitionField(
...       field_id=234,
...       source_id=1,
...       name='date',
...       transform='identity'
...     )
...   ]
... )
>>> spec
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))])

# Serialize
>>> spec.json()
'{"spec_id": 123, "fields": [{"field_id": 234, "source_id": 1, "name": "date", "transform": "identity"}]}'

# Deserialize
>>> PartitionSpec.parse_raw(spec.json())
PartitionSpec(spec_id=123, fields=[PartitionField(field_id=234, source_id=1, name='date', transform=Transform(__root__='identity'))])

I think this is much less error-prone as code like this: https://github.com/apache/iceberg/pull/3677/files#diff-9d9a8492ccd85cbbddfc202b9a954b57a079f2cca33e96eb0a9106cf1a4a8130R37-R58

Next to that, the maintenance cost is much higher when we have to keep it in sync by hand. If we generate it, we can check if we break compatibility (using mypy).

Also, I don't think we're getting much out of this. This makes classes that are simple and could easily be statically defined. And it requires an additional library that looks similar to what we get from @DataClass. I'd probably opt not to pull in the additional dependency.

This is correct, but we get some more on top of the data classes; validators. Instead of using a builder pattern, we can just nicely validate the input using decorators: https://pydantic-docs.helpmanual.io/usage/validators/

This looks like:

>>> from iceberg.openapi.rest_catalog import PartitionSpec, PartitionField
>>> spec = PartitionSpec(
...   spec_id=123,
...   fields=[
...     PartitionField(
...       field_id=234,
...       name='date',
...       transform='identity'
...     )
...   ]
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PartitionField
source-id
  field required (type=value_error.missing)

I'm generally unsure about using the OpenAPI doc currently for generating classes, as it's presently intended more for documentation purposes than for usage for models and generating code. To that extent, it seems that the generated classes are at times not that helpful (like Updates).

I see your point, but it is a bit of a catch-22. Until we start using it, it won't mature. Instead of appending methods to the python classes, it is just as simple as amending the OpenAPI spec and regenerating the code.

I've also noticed a lot of variance in the generated code depending on which library is used to do it. I know one Python generation tool I used to generate code made all models just dictionaries... not even classes that were secretly dictionaries but just... dictionaries.

Yes, there is a variety of generators and I found this one the best. It nicely generates all the aliases as well. I already bumped into some issues with the plain python one: openapi-generators/openapi-python-client#618

If we get to a place where the OpenAPI doc can be used as a drop-in, that would be great. But talking to some people who are very familiar with OpenAPI, my understanding is that some projects have behavior that's simply more complex than any of the OpenAPI generation tools can really account for. I've been told that either the code has to be written in a very specific style or that it can take years to get to such a place.

Great point, and I think we always need to extend the generated classes with convenience methods, or additional validation. How to do this is given in #4717 (comment)

Is pydantic required if we were to move forward with this or is this just added in this PR for other reasons?

For this implementation, pydantic is required: https://github.com/koxudaxi/datamodel-code-generator The code generator specifically generates the pydantic object (and brings in the validation as well). It fits our use-case perfectly: https://python.plainenglish.io/an-introduction-to-the-pydantic-stack-9e490d606c8d

It is coming from the FastAPI movement, and it is becoming increasingly more popular. Also, pydantic doesn't pull in the whole universe, it just has one dependency: https://github.com/samuelcolvin/pydantic/blob/master/setup.py#L131-L133

rdblue · 2022-05-27T22:00:48Z

Okay, so assuming that we go this direction, how do we integrate these generated classes with the other code that needs to be there? PartitionSpec is a great example, where we can use this to get a PartitionSpec object that we can serialize and deserialize properly, along with basic validation. But how do we expose methods like PartitionSpec.partitionType() as these change? Do we make subclasses?

Fokko · 2022-05-28T20:10:22Z

Subclassing them in another module would be the way to go. In that class, we then add all the convenience methods and additional validation, while inheriting the (de)serialization of the actual data from the Open API spec.

This would rewrite from PR #4717

iceberg/python/src/iceberg/table/partitioning.py

Lines 48 to 111 in f5de18f

    
           @dataclass(eq=False, frozen=True) 
        
           class PartitionSpec: 
        
               """ 
        
               PartitionSpec capture the transformation from table data to partition values 
        
               Attributes: 
        
                   schema(Schema): the schema of data table 
        
                   spec_id(int): any change to PartitionSpec will produce a new specId 
        
                   fields(List[PartitionField): list of partition fields to produce partition values 
        
                   last_assigned_field_id(int): auto-increment partition field id starting from PARTITION_DATA_ID_START 
        
               """ 
        
               schema: Schema 
        
               spec_id: int 
        
               fields: Tuple[PartitionField] 
        
               last_assigned_field_id: int 
        
               source_id_to_fields_map: Dict[int, List[PartitionField]] = field(init=False) 
        
               def __post_init__(self): 
        
                   source_id_to_fields_map = dict() 
        
                   for partition_field in self.fields: 
        
                       source_column = self.schema.find_column_name(partition_field.source_id) 
        
                       if not source_column: 
        
                           raise ValueError(f"Cannot find source column: {partition_field.source_id}") 
        
                       existing = source_id_to_fields_map.get(partition_field.source_id, []) 
        
                       existing.append(partition_field) 
        
                       source_id_to_fields_map[partition_field.source_id] = existing 
        
                   object.__setattr__(self, "source_id_to_fields_map", source_id_to_fields_map) 
        
               def __eq__(self, other): 
        
                   """ 
        
                   Equality check on spec_id and partition fields only 
        
                   """ 
        
                   return self.spec_id == other.spec_id and self.fields == other.fields 
        
               def __str__(self): 
        
                   """ 
        
                   PartitionSpec str method highlight the partition field only 
        
                   """ 
        
                   result_str = "[" 
        
                   for partition_field in self.fields: 
        
                       result_str += f"\n  {str(partition_field)}" 
        
                   if len(self.fields) > 0: 
        
                       result_str += "\n" 
        
                   result_str += "]" 
        
                   return result_str 
        
               def is_unpartitioned(self) -> bool: 
        
                   return len(self.fields) < 1 
        
               def fields_by_source_id(self, field_id: int) -> List[PartitionField]: 
        
                   return self.source_id_to_fields_map[field_id] 
        
               def compatible_with(self, other: "PartitionSpec") -> bool: 
        
                   """ 
        
                   Returns true if this partition spec is equivalent to the other, with partition field_id ignored. 
        
                   That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms. 
        
                   """ 
        
                   return all( 
        
                       this_field.source_id == that_field.source_id 
        
                       and this_field.transform == that_field.transform 
        
                       and this_field.name == that_field.name 
        
                       for this_field, that_field in zip(self.fields, other.fields) 
        
                   )

To the following (tests are passing :):

from iceberg.openapi import rest_catalog


class PartitionSpec(rest_catalog.PartitionSpec):
    """
    PartitionSpec capture the transformation from table data to partition values
    Attributes:
        table_schema(IcebergSchema): the schema of data table
    """
    # Fokko: I've aliased the schema to IcebergSchema, because we also have a schema in the open_api spec
    # This would go away later on
    table_schema: IcebergSchema = Field()
    _source_id_to_fields_map: Dict[int, List[PartitionField]] = Field(init=False)

    @root_validator
    def check_fields_in_schema(cls, values: Dict[str, Any]):
        schema: IcebergSchema = values['table_schema']
        source_id_to_fields_map = dict()
        for partition_field in values['fields']:
            source_column = schema.find_column_name(partition_field.source_id)
            if not source_column:
                raise ValueError(f"Cannot find source column: {partition_field.source_id}")
            existing = source_id_to_fields_map.get(partition_field.source_id, [])
            existing.append(partition_field)
            source_id_to_fields_map[partition_field.source_id] = existing
        values["_source_id_to_fields_map"] = source_id_to_fields_map
        return values

    def __eq__(self, other):
        """
        Equality check on spec_id and partition fields only
        """
        return self.spec_id == other.spec_id and self.fields == other.fields

    def __str__(self):
        """
        PartitionSpec str method highlight the partition field only
        """
        result_str = "["
        for partition_field in self.fields:
            result_str += f"\n  {str(partition_field)}"
        if len(self.fields) > 0:
            result_str += "\n"
        result_str += "]"
        return result_str

    def is_unpartitioned(self) -> bool:
        return len(self.fields) < 1

    def fields_by_source_id(self, field_id: int) -> List[PartitionField]:
        return self._source_id_to_fields_map[field_id]

    def compatible_with(self, other: "PartitionSpec") -> bool:
        """
        Returns true if this partition spec is equivalent to the other, with partition field_id ignored.
        That is, if both specs have the same number of fields, field order, field name, source column ids, and transforms.
        """
        return all(
            this_field.source_id == that_field.source_id
            and this_field.transform == that_field.transform
            and this_field.name == that_field.name
            for this_field, that_field in zip(self.fields, other.fields)
        )

There are also some discrepancies between the open api models, and the current classes in Python/Java. For example, the Schema derives from the StructType. While in Java/Python, the schema class has a struct as a variable. But we could use the fields in Python 👍🏻 It would also consolidate the naming; currently we use columns in Python, in contrast to fields in the openapi spec. Or required vs is_optional.

Fokko · 2022-05-29T17:04:20Z

Alright, did some more in-depth investigation over the weekend by just converting everything to the generated models. It went quite well and it works quite nicely with the encoders/decoders. We're able to export the schema from our subclassed types by taking the parent schema of the pydantic model, which is quite cool.

But hit a wall now. At the top we have the Type:

iceberg/open-api/rest-catalog-open-api.yaml

Lines 940 to 945 in f5de18f

    
           Type: 
        
             anyOf: 
        
               - $ref: '#/components/schemas/PrimitiveType' 
        
               - $ref: '#/components/schemas/StructType' 
        
               - $ref: '#/components/schemas/ListType' 
        
               - $ref: '#/components/schemas/MapType'

With the generated models, we're unable to construct the type hierarchy because the PrimitiveType is a plain string, and the other ones are of type object 😭 The combination of Pydantic and subclassing isn't able to cope with this. One way would be to change the PrimitiveType also to an object (and also add a type field to determine the type). This would also allow us to add properties for the Fixed and Decimal instead of having to parse them from the name.

rdblue · 2022-05-29T21:38:23Z

Unfortunately, we can't change the primitive type to an object. That serialization is part of the table spec so we can't update the representation. We could possibly change the OpenAPI representation, though. It didn't make sense to me that the generator was expecting Type branches to be uniform, so I looked at the anyOf docs. Looks like anyOf can be one or more branches, which implies that there is some similarity across branches. However, oneOf expects the type to match exactly one branch. Maybe that would fix this issue?

Fokko · 2022-06-01T17:41:28Z

The abovementioned issue has been fixed in #4899

Based on the discussion at the Python sync, let's close this for now. We can't integrate the code into our existing codebase, and we don't want to maintain the two separately :)

Fokko marked this pull request as ready for review May 24, 2022 19:21

github-actions bot added INFRA python labels May 24, 2022

Fokko force-pushed the fd-open-api branch 3 times, most recently from 64a1f7f to b7f5143 Compare May 24, 2022 20:58

Python: Add generated classes from OpenAPI spec

9321720

This makes is typesafe to generate the classes for the Rest Catalog spec

Fokko force-pushed the fd-open-api branch from b7f5143 to 9321720 Compare May 24, 2022 20:58

samredai approved these changes May 25, 2022

View reviewed changes

Add missing __init__.py and set name to the CI job

e0e5805

rdblue reviewed May 27, 2022

View reviewed changes

kbendick reviewed May 27, 2022

View reviewed changes

Fokko added 2 commits May 27, 2022 23:06

Merge branch 'master' into fd-open-api

f0c31ea

Set --allow-population-by-field-name

c80679e

Update classes

1ba392b

Fokko closed this Jun 1, 2022

Fokko deleted the fd-open-api branch June 1, 2022 17:42

Fokko added OPENAPI and removed OPENAPI labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add generated classes from OpenAPI spec #4858

Python: Add generated classes from OpenAPI spec #4858

Fokko commented May 24, 2022 •

edited

Loading

samredai left a comment

samredai May 25, 2022

samredai May 25, 2022

Fokko May 25, 2022

kbendick May 27, 2022 •

edited

Loading

samredai May 25, 2022

Fokko May 25, 2022

samredai commented May 25, 2022

Fokko commented May 25, 2022

rdblue left a comment

kbendick left a comment •

edited

Loading

kbendick May 27, 2022

Fokko commented May 27, 2022

rdblue commented May 27, 2022

Fokko commented May 28, 2022 •

edited

Loading

Fokko commented May 29, 2022

rdblue commented May 29, 2022

Fokko commented Jun 1, 2022



		class PartitionField(BaseModel):
		field_id: Optional[int] = Field(None, alias='field-id')

Python: Add generated classes from OpenAPI spec #4858

Python: Add generated classes from OpenAPI spec #4858

Conversation

Fokko commented May 24, 2022 • edited Loading

samredai left a comment

Choose a reason for hiding this comment

samredai May 25, 2022

Choose a reason for hiding this comment

samredai May 25, 2022

Choose a reason for hiding this comment

Fokko May 25, 2022

Choose a reason for hiding this comment

kbendick May 27, 2022 • edited Loading

Choose a reason for hiding this comment

samredai May 25, 2022

Choose a reason for hiding this comment

Fokko May 25, 2022

Choose a reason for hiding this comment

samredai commented May 25, 2022

Fokko commented May 25, 2022

rdblue left a comment

Choose a reason for hiding this comment

kbendick left a comment • edited Loading

Choose a reason for hiding this comment

kbendick May 27, 2022

Choose a reason for hiding this comment

Fokko commented May 27, 2022

rdblue commented May 27, 2022

Fokko commented May 28, 2022 • edited Loading

Fokko commented May 29, 2022

rdblue commented May 29, 2022

Fokko commented Jun 1, 2022

Fokko commented May 24, 2022 •

edited

Loading

kbendick May 27, 2022 •

edited

Loading

kbendick left a comment •

edited

Loading

Fokko commented May 28, 2022 •

edited

Loading