Fix Union type with dataclass ambiguous error and support superset comparison #5858

mao3267 · 2024-10-18T05:50:26Z

Tracking issue

Related to #5489

Why are the changes needed?

When a function accepts a Union of two dataclasses as input, Flyte cannot distinguish which dataclass matches the user's input. This is because Flyte only compares the simple types, and both dataclasses are identified as flyte.SimpleType_STRUCT in this scenario. As a result, there will be multiple matches, causing ambiguity and leading to an error.

union_test_dataclass.py

from typing import Union
from dataclasses import dataclass
from dataclasses_json import dataclass_json
from flytekit import task, workflow

@dataclass_json 
@dataclass
class A:
    a: int
    

@dataclass_json
@dataclass
class B:
    b: int


@task
def bar() -> A:
    return A(a=1)

@task
def foo(inp: Union[A, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

What changes were proposed in this pull request?

To distinguish between different types using protobuf struct (dataclass, Pydantic.BaseModel), we compare their JSON schemas generated by marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema (draft 2020-12) for dataclass and Pydantic.BaseModel for itself. To check equivalence, we compare the bytes from marshaling the json schemas if they are in the same draft version. For now, we only consider supporting the comparison of schemas with the same version.
We plan to support superset matching for dataclass/Pydantic.BaseModel with schemas in draft 2020-12, meaning that class A and class supersetA can be a match in the following example: (Pydantic.BaseModel example is in the screenshot section)

superset_A.py

# downstream 
@dataclass
class A:
    a: int
    b: Optional[int] = None
    c: str = "Flyte"

superset_dataclass.py

from dataclasses import dataclass
from typing import Optional
from superset_A import A as supersetA
# upstream 
@dataclass
class A:
    a: int

@dataclass
class B:
    b: str

@task
def foo() -> A:
    return A(a=1)

@task
def my_task(input: Union[supersetA, B]):
    print(input)

@workflow
def wf():
    a = foo()
    my_task(a)

Unit tests will be added for different versions of json schema, including one-level, two-level, and superset examples.

How was this patch tested?

Run an example using union input with identical dataclass on remote (union_test_dataclass.py)
Run an example with superset on remote (superset_dataclass.py)
Run an example using union input with identical BaseModel class on remote (union_test_basemodel.py)

union_test_basemodel.py

from pydantic import BaseModel
from typing import Union
from flytekit import task, workflow
from flytekit.image_spec import ImageSpec

flytekit_hash = "3475ddc41f2ba31d23dd072362be704d7c2470a0"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
image = ImageSpec(
    packages=[
                flytekit,
                "pydantic>2",
                "pandas",
                "pyarrow"
],
    apt_packages=["git"],
    registry="localhost:30000",
    builder="default",
)

class A(BaseModel):
    a: int

class B(BaseModel):
    b: str

@task(container_image=image)
def bar() -> A:
    return A(a=1)

@task(container_image=image)
def foo(inp: Union[A, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

if __name__ == "__main__":
    wf()

Run an example with superset on remote (superset_basemodel.py)

superset_basemodel.py

from pydantic import BaseModel
from typing import Union
from flytekit import task, workflow
from flytekit.image_spec import ImageSpec
from superset_A import A as supersetA

flytekit_hash = "3475ddc41f2ba31d23dd072362be704d7c2470a0"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
image = ImageSpec(
    packages=[
                flytekit,
                "pydantic>2",
                "pandas",
                "pyarrow"
],
    apt_packages=["git"],
    registry="localhost:30000",
    builder="default",
)

# downstream (superset_A.py)
class A(BaseModel):
    a: int
    b: Optional[int] = None
    c: str = "Flyte"

# upstream
class A(BaseModel):
    a: int

class B(BaseModel):
    b: str

@task(container_image=image)
def bar() -> A:
    return A(a=1)

@task(container_image=image)
def foo(inp: Union[supersetA, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

if __name__ == "__main__":
    wf()

Setup process

git clone https://github.com/flyteorg/flyte.git
gh pr checkout 5858
make compile
POD_NAMESPACE=flyte ./flyte start --config flyte-single-binary-local.yaml

Screenshots

Example using union input with identical dataclass on remote (union_test_dataclass.py)

Example with superset on remote (superset_dataclass.py)

Example using union input with identical BaseModel class on remote (union_test_basemodel.py)

Example with superset on remote (superset_basemodel.py)

Input is dataclass and Superset is BaseModel inherited

Note for Optional values in JSON Schemas

While handling Optional values in Python, both NoneType and the target type are accepted. However, when defining such values, default values must still be provided. This is why Optional properties without default values are marked as required in JSON schemas.

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

None

Docs link

TODO

Signed-off-by: mao3267 <[email protected]>

codecov · 2024-10-21T18:06:53Z

Codecov Report

Attention: Patch coverage is 64.76190% with 37 lines in your changes missing coverage. Please review.

Project coverage is 36.96%. Comparing base (b5f23a6) to head (ada05ed).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
flytepropeller/pkg/compiler/validators/typing.go	64.76%	29 Missing and 8 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5858      +/-   ##
==========================================
+ Coverage   36.90%   36.96%   +0.06%     
==========================================
  Files        1310     1310              
  Lines      131372   131487     +115     
==========================================
+ Hits        48477    48608     +131     
+ Misses      78682    78658      -24     
- Partials     4213     4221       +8

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`54.07% <ø> (+0.01%)`	⬆️
unittests-flytecopilot	`22.23% <ø> (ø)`
unittests-flytectl	`62.39% <ø> (ø)`
unittests-flyteidl	`6.92% <ø> (ø)`
unittests-flyteplugins	`53.84% <ø> (ø)`
unittests-flytepropeller	`43.15% <64.76%> (+0.25%)`	⬆️
unittests-flytestdlib	`55.31% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…yteorg#5489-dataclass-mismatch Signed-off-by: mao3267 <[email protected]>

Signed-off-by: mao3267 <[email protected]>

…t (one level) dataclass Signed-off-by: mao3267 <[email protected]>

fg91 · 2024-11-01T14:25:16Z

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct.
We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct.
It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

…yteorg#5489-dataclass-mismatch

Signed-off-by: mao3267 <[email protected]>

Future-Outlier · 2024-11-08T10:56:33Z

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct. We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct. It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

Signed-off-by: mao3267 <[email protected]>

Future-Outlier · 2024-11-11T02:36:00Z

Let's get this done this week @mao3267

mao3267 · 2024-11-11T03:13:03Z

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures. For dataclass_json, we only support equivalence without superset matching due to significant differences between its JSON schema (draft-07) and the newer version used by both Pydantic and dataclass (draft 2020-12).

Although the schemas from Pydantic BaseModel and dataclass adhere to the same version, they differ in how they handle fields. For instance, only the schema generated by Pydantic BaseModel records the title in properties. Additionally, the required and additionalProperties fields are omitted if no required properties exist or if additional properties are disallowed. To address these discrepancies, we preprocess the schema before comparison, which involves removing the title field, and modify the logic while comparing for the required and additionalProperties field.

Signed-off-by: Future-Outlier <[email protected]>

wild-endeavor · 2024-11-11T18:07:21Z

the exact match part is okay and we should fix that but I'm confused about the other part.

looking at this example

@task
def my_task(input: Union[supersetA, B]):
    print(input)

@workflow
def wf():
    a = foo()
    my_task(a)

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.

Why doesn't mypy complain in this example? Or does it?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is because supersetA contains more fields than A

cc @fg91 if you want to take a look as well.

wasn't there another pr where we were discussing an LGPL library also?

mao3267 · 2024-11-12T14:11:44Z

flytepropeller/pkg/compiler/validators/typing.go

@@ -19,6 +19,8 @@ type trivialChecker struct {
 }

 func removeTitleFieldFromProperties(schema map[string]*structpb.Value) {
+	// TODO: Explain why we need this
+	// TODO: givse me example about dataclass vs. Pydantic BaseModel


This is an example comparing dataclass and Pydantic BaseModel. As shown, the schema for dataclass includes a title field that records the name of the class. Additionally, the additionalProperties field is absent from the Pydantic BaseModel schema because its value is false. cc @eapolinario

dataclass Pydantic.BaseModel

To add the comment, writing the entire schema would make it too lengthy. Would it be acceptable to use something like this instead?

class A: a: int Pydantic.BaseModel: {"properties": {"a": {"title": "A", "type": "integer"}}} dataclass: {"properties": {"a": {"type": "integer"}}, "additionalProperties": false}

Are you proposing to preprocess the schemas so that one can mix and match dataclasses and base models given their schemas are aligned? I.e. task expects a dataclass with schema "A" and I pass a base model that has the same schema.

I personally feel this is not necessary and think it would be totally acceptable to consider a dataclass and a base model not a match by default. Especially if this makes things a lot more complicated in the backend otherwise because the schemas need to be aligned. What do you think about this?

If you are confident in the logic I'm of course not opposing the feature but if you feel this makes things complicated and brittle, I'd rather keep it simple and more robust.

I think it actually make things more complicated, will remove related logic.

mao3267 · 2024-11-12T14:36:16Z

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A.
Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this? By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

wasn't there another pr where we were discussing an LGPL library also?

The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

cc @Future-Outlier @wild-endeavor

fg91 · 2024-11-12T20:00:27Z

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures.

@Future-Outlier @mao3267
My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

wild-endeavor · 2024-11-12T23:49:23Z

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A. Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this?

Oh got it, but this only works because there are defaults right? The original case you linked to should work because
So if your supersetA was

@dataclass
class A:
    a: int
    b: Optional[int] = None
    c: str

then this should not work correct? (because c is missing).

By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

I don't think of this as superset matching. I think of this as schema compatibility, which is why I was thinking we'd find some off-the-shelf library that can just do it for us. 'Is this schema compatible with this other schema?'

wasn't there another pr where we were discussing an LGPL library also?
The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

Re @fg91's comments

Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

I don't know the answer but it should be yes to the first question. It should be possible to easily provide a json schema and have everything on the backend just work. Isn't this the case @Future-Outlier? There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

mao3267 · 2024-11-13T17:03:50Z

Replying @wild-endeavor

then this should not work correct? (because c is missing).

Yes. This should not work.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

No. I just decided to use another package that directly compares the JSON schema. This is not documented anywhere.

Replying @fg91

My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

fg91 · 2024-11-13T18:03:48Z

There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

This is exactly what I'm trying to say :)

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way". And I personally feel that the dataclass and pydantic type transformers should provide the schema in this general way so that the backend doesn't have to have type specific implementations for dataclasses/base models.
What do you think about this?

(As a side note, as I described here, for cache checks we don't use the schema in metadata but the so-called type structure. Maybe it's difficult to fix this in hindsight but I kinda wished that there was a single unified way type transformers make the schema available to propeller that is used for everything, cache checks, union type checks, ...)

wild-endeavor · 2024-11-13T18:09:35Z

@fg91 and @EngHabu mind chiming in on dataclass compatibility? Just thinking about it from the basics, if I have json schemas representing two dataclasses, let's say,

@dataclass
class A2:
    a: int
    b: Optional[int] = None
    c: str = "hello"

and

@dataclass
class A1:
    a: int

which of the following two are valid?
Case 1

def wf():
  a1 = create_a1()  # -> A1
  use_a2(a2=a1)  # a2: A2

Case 2

def wf():
  a2 = create_a2()  # -> A2
  use_a1(a1=a2)  # a1: A1

Just thinking about compatibility in the loosest sense, both should be valid. The reason is that in the first case, when calling the downstream task use_a2, fields b and c have defaults.

In the second case, the a field can be taken from the a2 object and b and c discarded.

The implication here though is that if you have

@task
def make_a1() -> A1: ...

@task
def use_either(a: typing.Union[A1, A2]): ...

This this will fail

use_either(a=make_a1())

because A1 will match more than one variant. flytekit itself will not fail I think (right @Future-Outlier?) but we'll never get there because the compiler will fail.

Should we just do exact matches only? Plus of two examples earlier (case 1 & 2), both will fail mypy type checking.

feat: fix Union type with dataclass ambiguous error

f06cdc6

Signed-off-by: mao3267 <[email protected]>

pingsutw assigned mao3267 and pingsutw and unassigned pingsutw Oct 23, 2024

Future-Outlier self-assigned this Oct 28, 2024

mao3267 added 6 commits November 1, 2024 17:00

Merge branch 'master' of https://github.com/mao3267/flyte into fix/fl…

8660db5

…yteorg#5489-dataclass-mismatch Signed-off-by: mao3267 <[email protected]>

fix: direct json comparison for superset

47ccbd1

Signed-off-by: mao3267 <[email protected]>

fix: go.mod missing entry for error

85489dc

Signed-off-by: mao3267 <[email protected]>

fix: update go module and sum

cc685bb

Signed-off-by: mao3267 <[email protected]>

refactor: gci format

3a629e1

Signed-off-by: mao3267 <[email protected]>

test: add dataset casting tests for same (one/two levels) and superse…

aa4d98e

…t (one level) dataclass Signed-off-by: mao3267 <[email protected]>

mao3267 added 2 commits November 8, 2024 14:01

Merge branch 'master' of https://github.com/mao3267/flyte into fix/fl…

b282e5f

…yteorg#5489-dataclass-mismatch

fix: support Pydantic BaseModel comparison

818afb7

Signed-off-by: mao3267 <[email protected]>

mao3267 changed the title ~~[WIP] Fix Union type with dataclass ambiguous error and support superset comparison~~ Fix Union type with dataclass ambiguous error and support superset comparison Nov 8, 2024

fix: handle nested pydantic basemodel

d6468b6

Signed-off-by: mao3267 <[email protected]>

mao3267 marked this pull request as ready for review November 11, 2024 02:48

Reviews from Eduardo

ada05ed

Signed-off-by: Future-Outlier <[email protected]>

mao3267 commented Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Union type with dataclass ambiguous error and support superset comparison #5858

Fix Union type with dataclass ambiguous error and support superset comparison #5858

mao3267 commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

fg91 commented Nov 1, 2024 •

edited

Loading

Future-Outlier commented Nov 8, 2024

Future-Outlier commented Nov 11, 2024

mao3267 commented Nov 11, 2024

wild-endeavor commented Nov 11, 2024 •

edited

Loading

mao3267 Nov 12, 2024

mao3267 Nov 12, 2024

fg91 Nov 12, 2024 •

edited

Loading

mao3267 Nov 13, 2024

mao3267 commented Nov 12, 2024

fg91 commented Nov 12, 2024 •

edited

Loading

wild-endeavor commented Nov 12, 2024 •

edited

Loading

mao3267 commented Nov 13, 2024

fg91 commented Nov 13, 2024 •

edited

Loading

wild-endeavor commented Nov 13, 2024

Fix Union type with dataclass ambiguous error and support superset comparison #5858

Are you sure you want to change the base?

Fix Union type with dataclass ambiguous error and support superset comparison #5858

Conversation

mao3267 commented Oct 18, 2024 • edited Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Note for Optional values in JSON Schemas

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Oct 21, 2024 • edited Loading

Codecov Report

fg91 commented Nov 1, 2024 • edited Loading

Future-Outlier commented Nov 8, 2024

Future-Outlier commented Nov 11, 2024

mao3267 commented Nov 11, 2024

wild-endeavor commented Nov 11, 2024 • edited Loading

mao3267 Nov 12, 2024

Choose a reason for hiding this comment

mao3267 Nov 12, 2024

Choose a reason for hiding this comment

fg91 Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

mao3267 Nov 13, 2024

Choose a reason for hiding this comment

mao3267 commented Nov 12, 2024

fg91 commented Nov 12, 2024 • edited Loading

wild-endeavor commented Nov 12, 2024 • edited Loading

mao3267 commented Nov 13, 2024

fg91 commented Nov 13, 2024 • edited Loading

wild-endeavor commented Nov 13, 2024

mao3267 commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

fg91 commented Nov 1, 2024 •

edited

Loading

wild-endeavor commented Nov 11, 2024 •

edited

Loading

fg91 Nov 12, 2024 •

edited

Loading

fg91 commented Nov 12, 2024 •

edited

Loading

wild-endeavor commented Nov 12, 2024 •

edited

Loading

fg91 commented Nov 13, 2024 •

edited

Loading